[00:00:04] twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T0000). [00:02:42] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: CommonSettings: Factor out load of variant config into MWConfigCacheGenerator, part 1 (duration: 00m 55s) [00:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:00] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: CommonSettings: Factor out load of variant config into MWConfigCacheGenerator, part 2 (duration: 00m 55s) [00:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:17] (03PS6) 10Jforrester: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728 [00:11:23] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Factor out write of variant config into MWConfigCacheGenerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728 (owner: 10Jforrester) [00:12:03] (03Merged) 10jenkins-bot: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728 (owner: 10Jforrester) [00:12:10] (03CR) 10jenkins-bot: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728 (owner: 10Jforrester) [00:19:58] (03PS2) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) [00:21:03] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Write to static (JSON) as well as serialised cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [00:21:36] (03PS1) 10Jforrester: CommonSettings: Push back the mtime/globals muxing into CommonSettings (ahead of deployment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549 [00:22:29] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Push back the mtime/globals muxing into CommonSettings (ahead of deployment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549 (owner: 10Jforrester) [00:23:58] (03PS2) 10Jforrester: CommonSettings: Push back the mtime/globals muxing into CommonSettings (ahead of deployment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549 [00:25:36] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Push back the mtime/globals muxing into CommonSettings (ahead of deployment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549 (owner: 10Jforrester) [00:25:55] (03CR) 10Jforrester: [C: 03+2] "Parent patch was never deployed so this won't mess up production caches." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549 (owner: 10Jforrester) [00:26:36] (03Merged) 10jenkins-bot: CommonSettings: Push back the mtime/globals muxing into CommonSettings (ahead of deployment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549 (owner: 10Jforrester) [00:27:30] (03CR) 10jenkins-bot: CommonSettings: Push back the mtime/globals muxing into CommonSettings (ahead of deployment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534549 (owner: 10Jforrester) [00:39:29] (03PS1) 10Jforrester: CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 [00:39:37] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester) [00:40:22] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester) [00:40:33] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester) [00:43:23] (03PS2) 10Jforrester: CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 [00:44:14] (03CR) 10jerkins-bot: [V: 04-1] CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester) [00:45:31] (03PS3) 10Jforrester: CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 [00:48:36] (03CR) 10Jforrester: [C: 03+2] CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester) [00:49:39] (03Merged) 10jenkins-bot: CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester) [00:49:55] (03CR) 10jenkins-bot: CommonSettings: Fix path vs. filename difference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534550 (owner: 10Jforrester) [00:51:35] (03CR) 10Jforrester: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728 (owner: 10Jforrester) [00:54:48] !log jforrester@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator, part 1 (duration: 00m 56s) [00:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:56] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator, part 2 (duration: 00m 53s) [00:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:03] Phew. [00:56:09] OK, conch returned. [01:04:05] (03PS3) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) [01:04:30] (03CR) 10Jforrester: [C: 04-1] "Let's do another chat before this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [01:24:05] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/base (Get base CSS) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:25:11] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:10:31] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) restbase-dev1004 has been decommissioned and can come down for a re-image at any time. /cc @MoritzMuehl... [02:36:15] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 100544568 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:36:33] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 878137008 and 49 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:45:23] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 48256 and 64 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:45:43] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4648 and 84 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:55:15] (03PS2) 10Dzahn: ATS/varnish/parsoid-testing: remove directors for parsoid-vd/parsoid-rt [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356) [03:57:32] !log switching cp1076 from nginx to ats-tls - T231433 [03:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:57:35] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [03:57:53] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/532987 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [03:58:02] (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/532987 (https://phabricator.wikimedia.org/T231433) [04:00:33] (03CR) 10Dzahn: [C: 03+2] "Arlolra, sorry for the delay with this. i was on vacation and forgot about this until i saw your mail again the other day." [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356) (owner: 10Dzahn) [04:01:04] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to 4443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/532988 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:01:09] (03PS3) 10Dzahn: ATS/varnish/parsoid-testing: remove directors for parsoid-vd/parsoid-rt [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356) [04:01:21] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to 4443 on cp1076 [puppet] - 10https://gerrit.wikimedia.org/r/532988 (https://phabricator.wikimedia.org/T231433) [04:02:26] !log upgrading ATS to 8.0.5-1wm5 on cp1076 - T231433 [04:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:04:16] PROBLEM - HTTPS Unified RSA on cp1076 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:04:52] ^ that's expected [04:05:04] yep, saw the log line matching it :) [04:05:18] PROBLEM - HTTPS Unified ECDSA on cp1076 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:05:20] (03PS4) 10Dzahn: ATS/varnish/parsoid-testing: remove directors for parsoid-vd/parsoid-rt [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356) [04:05:42] RECOVERY - HTTPS Unified RSA on cp1076 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345577 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:06:42] RECOVERY - HTTPS Unified ECDSA on cp1076 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345515 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:08:52] PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp1076 is CRITICAL: connect to address 10.64.0.131 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:13:26] (03CR) 10Dzahn: tlsproxy/envoy: limit connections on 443 to cache servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534421 (owner: 10Dzahn) [04:13:55] (03PS2) 10Dzahn: add fake SSL key for releases.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/534408 [04:14:06] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake SSL key for releases.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/534408 (owner: 10Dzahn) [04:14:19] (03PS1) 10Dzahn: remove parsoid-vd/parsoid-rt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/534554 (https://phabricator.wikimedia.org/T229356) [04:14:52] (03PS2) 10Dzahn: add fake SSL key for webperf.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/534409 [04:15:48] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake SSL key for webperf.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/534409 (owner: 10Dzahn) [04:17:24] the puppet-merge just took under 10 seconds. mentioning that because i saw comments how yesterday it was 1 minute [04:18:34] under 10 secs? [04:18:59] yea, really quick to me [04:19:01] last one for me felt way slower than that [04:19:08] I'll time the next one [04:19:14] my change was in labs/private though [04:20:11] !log switching cp3034 from nginx to ats-tls - T231433 [04:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:14] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:20:21] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/532989 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:20:30] (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/532989 (https://phabricator.wikimedia.org/T231433) [04:22:49] mutante: real 1m8.243s [04:23:43] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/532990 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:23:46] (03CR) 10Dzahn: [C: 03+2] add certificate for webperf.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/534412 (owner: 10Dzahn) [04:23:56] (03PS2) 10Dzahn: add certificate for webperf.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/534412 [04:24:42] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/532990 (https://phabricator.wikimedia.org/T231433) [04:25:56] sigh... [04:26:12] (03PS3) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp3034 [puppet] - 10https://gerrit.wikimedia.org/r/532990 (https://phabricator.wikimedia.org/T231433) [04:26:22] real 1m4.667s [04:26:30] hmm.. ok.. because i said something :p [04:26:46] private VS puppet repo [04:27:06] PROBLEM - HTTPS Unified RSA on cp3034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:27:08] yea, it's smarter than i thought and only syncs what is is needed i guess [04:27:16] PROBLEM - HTTPS Unified ECDSA on cp3034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:28:22] ^^ expected as well [04:31:06] RECOVERY - HTTPS Unified RSA on cp3034 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345558 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:31:06] (03CR) 10Dzahn: [C: 03+2] add discovery CNAME for webperf [dns] - 10https://gerrit.wikimedia.org/r/534414 (owner: 10Dzahn) [04:31:12] (03PS2) 10Dzahn: add discovery CNAME for webperf [dns] - 10https://gerrit.wikimedia.org/r/534414 [04:31:16] RECOVERY - HTTPS Unified ECDSA on cp3034 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345549 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:31:56] !log upgrading ATS to 8.0.5-1wm5 on cp3034 - T231433 [04:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:59] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:37:08] wtf.. i merged a DNS change and authdns-update shows stuff that is not my in gerrit change [04:37:10] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx to port 4443 on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/532991 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:37:15] not merging that but now stuck [04:37:19] (03PS2) 10Vgutierrez: hiera: Move nginx to port 4443 on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/532991 (https://phabricator.wikimedia.org/T231433) [04:37:43] !log switching cp4021 from nginx to ats-tls - T231433 [04:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:46] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:38:56] (03CR) 10Dzahn: "please don't merge without also running authdns-update. that caused my great confusion when i got unexpected changes after the next merge" [dns] - 10https://gerrit.wikimedia.org/r/534500 (owner: 10Papaul) [04:40:03] DNS merge does not have the "warning. type multiple" thing when there is more than one change [04:40:22] nope [04:40:48] that was pretty unexpected to see unrelated stuff but i confirmed it's the last merge before mine [04:41:15] somebody merged something and didn't run authdns-update? [04:41:21] yep [04:41:53] maybe we should have a check for unmerged changes, like in puppet [04:42:29] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/532992 (https://phabricator.wikimedia.org/T231433) (owner: 10Vgutierrez) [04:42:31] yea, that would have been safer [04:42:44] !log upgrading ATS to 8.0.5-1wm5 on cp4021 - T231433 [04:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:47] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [04:42:50] it's a permission issue :( [04:43:15] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/532992 (https://phabricator.wikimedia.org/T231433) [04:44:19] PROBLEM - HTTPS Unified RSA on cp4021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:44:42] ^^expected :) [04:45:25] PROBLEM - HTTPS Unified ECDSA on cp4021 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [04:50:45] RECOVERY - HTTPS Unified ECDSA on cp4021 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345551 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:52:08] RECOVERY - HTTPS Unified RSA on cp4021 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345470 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [04:53:06] (03PS2) 10Dzahn: add certificate for releases.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/534406 [04:56:42] (03CR) 10Vgutierrez: "+1 to the acme-chief change but IMHO this should be split in two different commits" [puppet] - 10https://gerrit.wikimedia.org/r/534490 (owner: 10CRusnov) [04:57:19] !log rearming keyholder on cumin1001 [04:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:40] RECOVERY - Keyholder SSH agent on cumin1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [05:01:03] (03CR) 10Dzahn: [C: 03+2] add certificate for releases.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/534406 (owner: 10Dzahn) [05:12:01] (03CR) 10Marostegui: [C: 03+2] realm.pp: Remove filejournal table from private list [puppet] - 10https://gerrit.wikimedia.org/r/534392 (https://phabricator.wikimedia.org/T51195) (owner: 10Marostegui) [05:12:05] ganeti2005.mgmt is behaving weirdly [05:12:08] (03PS3) 10Marostegui: realm.pp: Remove filejournal table from private list [puppet] - 10https://gerrit.wikimedia.org/r/534392 (https://phabricator.wikimedia.org/T51195) [05:14:55] !log Restart MySQL on codfw sanitariums (db2094 and db2095) to pick up new filters - T51195 [05:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:58] T51195: Drop filejournal table from WMF - https://phabricator.wikimedia.org/T51195 [05:19:37] !log ganeti2005 - reset DRAC via local IPMI since mgmt stopped responding [05:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:22] !log ganeti2005 - DRAC reset fails - ipmi_cmd_cold_reset: bad completion code [05:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:33] !log Restart wikibugs [05:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:58] !log Restart MySQL on codfw sanitariums (db1124 and db1125) to pick up new filters - T51195 [05:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:01] T51195: Drop filejournal table from WMF - https://phabricator.wikimedia.org/T51195 [05:37:06] ACKNOWLEDGEMENT - SSH ganeti2005.mgmt on ganeti2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T232067 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:38:36] (03PS1) 10Vgutierrez: ATS: Disable SSL Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/534558 (https://phabricator.wikimedia.org/T231849) [05:41:06] (03PS1) 10Marostegui: report_users: Add dbproxy1017 IP [software] - 10https://gerrit.wikimedia.org/r/534559 [05:41:39] (03CR) 10Marostegui: [C: 03+2] report_users: Add dbproxy1017 IP [software] - 10https://gerrit.wikimedia.org/r/534559 (owner: 10Marostegui) [05:41:54] (03PS5) 10Marostegui: mariadb: Promote db1109 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/531189 (https://phabricator.wikimedia.org/T230762) [05:42:01] (03PS4) 10Marostegui: wmnet: Update s8-master record [dns] - 10https://gerrit.wikimedia.org/r/531455 (https://phabricator.wikimedia.org/T230762) [05:42:16] (03PS2) 10Vgutierrez: ATS: Disable SSL Session tickets [puppet] - 10https://gerrit.wikimedia.org/r/534558 (https://phabricator.wikimedia.org/T231849) [05:42:56] !log Remove grants for dbproxy1005 T231280 T231967 [05:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:01] T231967: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 [05:43:02] T231280: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 [05:44:16] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/18181/" [puppet] - 10https://gerrit.wikimedia.org/r/534558 (https://phabricator.wikimedia.org/T231849) (owner: 10Vgutierrez) [05:44:55] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Dzahn) [05:45:50] (03CR) 10Urbanecm: [C: 03+1] "Lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534471 (https://phabricator.wikimedia.org/T231982) (owner: 10Zoranzoki21) [05:45:52] ACKNOWLEDGEMENT - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 1044 MB (0% inode=81%): daniel_zahn https://phabricator.wikimedia.org/T232068 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [05:52:00] 10Operations, 10Analytics, 10Analytics-Cluster: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10Dzahn) [05:52:37] ACKNOWLEDGEMENT - Check systemd state on analytics1045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T232069 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:52:37] ACKNOWLEDGEMENT - MegaRAID on analytics1045 is CRITICAL: CRITICAL: 12 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough daniel_zahn https://phabricator.wikimedia.org/T232069 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:53:27] ACKNOWLEDGEMENT - Check systemd state on ununpentium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn envoy fails to start - WIP (dzahn) https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:58:07] (03PS1) 10Marostegui: report_users: Remove dbproxy1005 IP [software] - 10https://gerrit.wikimedia.org/r/534562 [05:59:16] (03CR) 10Marostegui: [C: 03+2] report_users: Remove dbproxy1005 IP [software] - 10https://gerrit.wikimedia.org/r/534562 (owner: 10Marostegui) [06:03:35] 10Operations, 10DBA: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) [06:05:06] (03PS2) 10Dzahn: add discovery CNAME for releases [dns] - 10https://gerrit.wikimedia.org/r/534402 [06:05:28] (03CR) 10jerkins-bot: [V: 04-1] add discovery CNAME for releases [dns] - 10https://gerrit.wikimedia.org/r/534402 (owner: 10Dzahn) [06:08:36] (03PS1) 10Marostegui: mariadb: Decommission dbproxy1005 [puppet] - 10https://gerrit.wikimedia.org/r/534563 (https://phabricator.wikimedia.org/T231967) [06:09:02] (03PS3) 10Dzahn: add discovery CNAME for releases [dns] - 10https://gerrit.wikimedia.org/r/534402 [06:10:37] (03PS1) 10Dzahn: requesttracker: use unprivileged port 1443 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/534564 [06:14:50] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 39 probes of 453 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [06:14:53] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission dbproxy1005 [puppet] - 10https://gerrit.wikimedia.org/r/534563 (https://phabricator.wikimedia.org/T231967) (owner: 10Marostegui) [06:18:15] (03PS1) 10Marostegui: mariadb: Fix typo on dbproxy1005 spare role [puppet] - 10https://gerrit.wikimedia.org/r/534565 [06:19:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Fix typo on dbproxy1005 spare role [puppet] - 10https://gerrit.wikimedia.org/r/534565 (owner: 10Marostegui) [06:19:52] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 24 probes of 453 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [06:30:52] (03PS2) 10KartikMistry: Update cxserver to 2019-09-04-065911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/534427 (https://phabricator.wikimedia.org/T213255) [06:31:14] (03CR) 10KartikMistry: [V: 03+2 C: 03+2] Update cxserver to 2019-09-04-065911-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/534427 (https://phabricator.wikimedia.org/T213255) (owner: 10KartikMistry) [06:37:36] PROBLEM - Disk space on restbase-dev1005 is CRITICAL: DISK CRITICAL - free space: /srv 107908 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops [06:38:00] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [06:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:15] 10Operations, 10DBA, 10Patch-For-Review: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) [06:39:35] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [06:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:11] 10Operations, 10ops-eqiad, 10decommission: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) a:05Marostegui→03RobH [06:40:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10Marostegui) This host is ready for #dc-ops to get it decommissioned [06:41:48] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [06:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:05] !log Updated cxserver to 2019-09-04-065911-production (T213255, T206310) [06:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:08] T213255: CX2: Doesn't handle correctly ISBN, should not put nowiki tags around them - https://phabricator.wikimedia.org/T213255 [06:44:09] T206310: CX2: Highlight references with a template that is missing mandatory parameters after being adapted - https://phabricator.wikimedia.org/T206310 [06:51:34] (03CR) 10Dzahn: [C: 03+2] add discovery CNAME for releases [dns] - 10https://gerrit.wikimedia.org/r/534402 (owner: 10Dzahn) [06:55:15] (03CR) 10Tim Starling: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/534405 (owner: 10Giuseppe Lavagetto) [06:58:43] (03CR) 10Dzahn: [C: 03+2] requesttracker: use unprivileged port 1443 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/534564 (owner: 10Dzahn) [06:58:53] (03PS2) 10Dzahn: requesttracker: use unprivileged port 1443 for envoy [puppet] - 10https://gerrit.wikimedia.org/r/534564 [06:59:46] (03PS1) 10Muehlenhoff: Switch remaining restbase servers to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/534572 [07:01:29] (03PS2) 10Muehlenhoff: Switch remaining restbase servers to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/534572 [07:01:45] (03CR) 10Nikerabbit: Add Draft and Draft_talk aliases for wikis that define draft namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510780 (https://phabricator.wikimedia.org/T223472) (owner: 10Petar.petkovic) [07:03:02] RECOVERY - Check systemd state on ununpentium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:03:27] (03CR) 10Muehlenhoff: [C: 03+2] Switch remaining restbase servers to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/534572 (owner: 10Muehlenhoff) [07:03:46] (03PS3) 10Muehlenhoff: Add library hint for nghttp2 [puppet] - 10https://gerrit.wikimedia.org/r/534449 [07:05:46] (03PS1) 10Marostegui: wmnet: Replace dbproxy1010 with dbproxy1018 [dns] - 10https://gerrit.wikimedia.org/r/534573 (https://phabricator.wikimedia.org/T231520) [07:07:14] PROBLEM - Check systemd state on ununpentium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:19] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for nghttp2 [puppet] - 10https://gerrit.wikimedia.org/r/534449 (owner: 10Muehlenhoff) [07:07:27] !log ununpentium - manually delete /etc/envoy/listeners.d/00-tls_terminator_443.yaml after changing port to 1443 - puppet does not remove it [07:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:23] ACKNOWLEDGEMENT - Check systemd state on ununpentium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn switching envoy config / port (dzahn) https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:11:39] (03PS2) 10Muehlenhoff: Add DNS entries for new Buster-based LDAP/corp replicas [dns] - 10https://gerrit.wikimedia.org/r/534432 (https://phabricator.wikimedia.org/T231015) [07:11:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I don't think this is the solution to your problem, but this class will be used for tls termination between internal services as well. So " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534421 (owner: 10Dzahn) [07:17:51] (03CR) 10Jcrespo: "This is ok, but you are aware this is a noop, right?" [dns] - 10https://gerrit.wikimedia.org/r/534573 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui) [07:19:44] PROBLEM - Disk space on restbase-dev1005 is CRITICAL: DISK CRITICAL - free space: /srv 107812 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops [07:24:10] 10Operations, 10DBA, 10Data-Services, 10Patch-For-Review: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) I have been talking to Jaime about this, and we might better wait for https://gerrit.wikimedia.org/r/#/c/operations/puppe... [07:24:36] (03CR) 10Marostegui: "> This is ok, but you are aware this is a noop, right?" [dns] - 10https://gerrit.wikimedia.org/r/534573 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui) [07:31:36] !log ununpentium - removed /etc/envoy/envoy.yaml; ran /usr/local/sbin/build-envoy-config -c /etc/envoy to regenarate config without 443 listener; ran puppet; envoy now running on jessie [07:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:13] !log upgrading mw1293-mw1296, mw1299-mw1306 to PHP 7.2.22 [07:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:15] (03CR) 10Dzahn: tlsproxy/envoy: limit connections on 443 to cache servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534421 (owner: 10Dzahn) [07:35:30] 10Operations, 10Traffic: varnishreqstats sends truncated statsd traffic - https://phabricator.wikimedia.org/T212310 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Varnish doesn't send statsd any longer per {T220116}, resolving [07:39:38] (03PS1) 10Marostegui: production-m1: Remove puppet grants [puppet] - 10https://gerrit.wikimedia.org/r/534576 (https://phabricator.wikimedia.org/T231539) [07:40:04] PROBLEM - Disk space on restbase-dev1005 is CRITICAL: DISK CRITICAL - free space: /srv 107368 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops [07:40:10] 10Operations, 10ops-codfw: ganeti2005 - mgmt interface stopped responding and reset fails - https://phabricator.wikimedia.org/T232067 (10Dzahn) p:05Triage→03Normal [07:41:46] RECOVERY - Check systemd state on ununpentium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:57] (03CR) 10Marostegui: [C: 03+2] production-m1: Remove puppet grants [puppet] - 10https://gerrit.wikimedia.org/r/534576 (https://phabricator.wikimedia.org/T231539) (owner: 10Marostegui) [07:45:10] !log Remove puppet grants from m1 for the following IPs: 10.64.0.165 10.64.16.159 10.64.16.18 T231539 [07:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:14] T231539: Drop puppet database from m1 - https://phabricator.wikimedia.org/T231539 [07:47:16] 10Operations, 10DBA, 10Patch-For-Review: Drop puppet database from m1 - https://phabricator.wikimedia.org/T231539 (10Marostegui) The following grants have been dropped: ` root@db1135.eqiad.wmnet[(none)]> drop user if exists 'puppet'@'10.64.0.165'; Query OK, 0 rows affected (0.00 sec) root@db1135.eqiad.wmnet... [07:53:50] !log Remove old backups for db2037 and db2042 from dbprov2001 [07:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:13] !log Switching "wikidatawiki" on mwdebug1001 to 1.34.0-wmf.21 by editing /srv/mediawiki/wikiversions.php # T232035 [07:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:21] T232035: 1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500 - https://phabricator.wikimedia.org/T232035 [07:56:29] 10Operations, 10vm-requests: vm request for RT to replace ununpentium - https://phabricator.wikimedia.org/T232077 (10Dzahn) [07:56:56] <_joe_> !log uploading scap 3.12.1 to reprepro on all distros 224857 [07:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:44] <_joe_> !log upgrading scap on mwdebug1001 [07:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:51] (03PS1) 10Marostegui: wikireplica_dns: Replace dbproxy1010 with dbproxy1018 [puppet] - 10https://gerrit.wikimedia.org/r/534577 (https://phabricator.wikimedia.org/T231520) [07:58:31] (03CR) 10Marostegui: [C: 04-2] "Wait for T231520#5467304" [puppet] - 10https://gerrit.wikimedia.org/r/534577 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui) [08:02:58] 10Operations, 10vm-requests: vm request for RT to replace ununpentium - https://phabricator.wikimedia.org/T232077 (10MoritzMuehlenhoff) Looks good [08:07:09] 10Operations, 10DBA, 10Data-Services, 10Patch-For-Review: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) a:05Marostegui→03None [08:09:40] !log depooling cp3034 due to intermittent network issues [08:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:17] PROBLEM - Disk space on restbase-dev1005 is CRITICAL: DISK CRITICAL - free space: /srv 107544 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops [08:13:32] <_joe_> !log upgrading scap on deploy1001 [08:13:33] (03PS1) 10Hashar: Promote wikidata to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534578 (https://phabricator.wikimedia.org/T232035) [08:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:44] going to upgrade wikidatawiki to 1.34.0-wmf.21 for termbox issue [08:13:46] which is a blocker [08:14:05] <_joe_> hashar: can you please hold? see what I wrote elsewhere [08:14:20] * hashar looks "elsewhere" [08:14:22] :D [08:15:29] _joe_: just le me sync wikiversions ;) that is fast enough!? [08:15:46] <_joe_> hashar: actually [08:15:54] <_joe_> wait 1 minute, we can test scap with that [08:16:00] sure [08:16:14] <_joe_> the only thing that could fail, it can fail on the mwdebugs and has no impact on the deploy [08:16:17] we tried to reproduce the termbox / wikidata api query timeout using mwdebug1001 but that is a dead end :-\ [08:16:45] !log reimage restbase-dev1004 to Stretch T224554 [08:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:48] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [08:17:01] <_joe_> hashar: I'll tell you when to proceed [08:17:29] <_joe_> hashar: go on! [08:17:30] ok [08:17:32] merging merging [08:17:35] (03CR) 10Hashar: [C: 03+2] Promote wikidata to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534578 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar) [08:18:41] (03Merged) 10jenkins-bot: Promote wikidata to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534578 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar) [08:18:57] (03CR) 10jenkins-bot: Promote wikidata to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534578 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar) [08:19:50] _joe_: tarrow: and I am now promoting wikidatawiki again [08:20:00] <_joe_> ok [08:20:03] right ho [08:20:08] <_joe_> it seems like the right test to do [08:20:24] <_joe_> tarrow: did you time your requests to the api before the promotion? [08:20:37] <_joe_> having some benchmark can be interesting [08:20:58] we timed with curl (but from the deployoment hosts) [08:21:35] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Promote wikidatawiki to 1.34.0-wmf.21 for T232035 - T220746 [08:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:38] T220746: 1.34.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T220746 [08:21:39] T232035: 1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500 - https://phabricator.wikimedia.org/T232035 [08:22:05] <_joe_> hashar: did scap complete? [08:22:25] yeah [08:22:43] <_joe_> any errors? [08:22:49] <_joe_> if not, that's great [08:23:01] (03CR) 10Gehel: [C: 04-1] Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [08:23:25] seem to be some 500's again [08:23:49] <_joe_> tarrow: can you see what's the url called on the api? [08:23:49] !log repooling cp3034 [08:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:08] <_joe_> that times out [08:24:34] yep [08:24:38] it's in each logstash entry [08:24:44] <_joe_> heh ok [08:24:50] (03PS2) 10Gehel: Pick a new canary for elastic [puppet] - 10https://gerrit.wikimedia.org/r/534461 (owner: 10Muehlenhoff) [08:24:54] <_joe_> did you try to call it now from deploy2001? [08:25:20] /termbox?preferredLanguages=de%7Cen&entity=Q1&editLink=%2Fedit%2FQ1347&language=de&revision=103 [08:26:21] <_joe_> this is the termbox url or the mw api url? [08:26:44] so yeah that is slow again from time to time [08:26:50] https://phabricator.wikimedia.org/T232035#5467373 [08:26:57] 750ms - 1000 ms [08:27:10] and sometime there is a timeout of some sort and that service checker is flagged critical with 3600ms run time [08:28:06] yeah [08:28:07] (03CR) 10Gehel: [C: 04-1] Pick a new canary for elastic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534461 (owner: 10Muehlenhoff) [08:28:14] <_joe_> hashar: I would see what changed in either [08:28:19] https://www.irccloud.com/pastebin/tzv3Qlcg/ [08:28:20] <_joe_> the mediawiki api response times [08:28:26] 6s to respond there [08:28:27] <_joe_> or in the content of the response [08:28:42] <_joe_> given termbox didn't change [08:28:54] <_joe_> something must have changed in terms of the backend calls it makes [08:29:11] <_joe_> the only way to properly debug the problem is to go look at the mw api :) [08:29:13] odd how it isn't consistent though [08:29:16] 10Operations, 10netops: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 (10MoritzMuehlenhoff) 05Declined→03Open @ayounsi ; Could you doublecheck whether we have any remaining router/firewall rules for the IPs used by iron.wikimedia.... [08:29:35] I suspect it depends which server it is hitting [08:29:49] 0.08s or >6s [08:29:51] pure speculation, termbox does the query to the mediawiki API with a 3 seconds time out isn't it? [08:29:52] <_joe_> tarrow: could be for a number of reasons. Can you give me the exact url that is called on the mediawiki api? [08:29:55] so eventually it dies out [08:30:05] <_joe_> hashar: yes, that's what's happening [08:30:05] curl -w "@curl-format.txt" -H 'Host: www.wikidata.org' 'http://api-ro.discovery.wmnet/w/index.php?title=Special:EntityData&format=json&id=Q1&revision=103' [08:30:08] but maybe the query is still going on on the mediawiki API servers and we might see some timeout there [08:30:24] ignore the curl-format bit for logging the times if you like [08:30:24] <_joe_> tarrow: what's that curl-format.txt file? [08:30:27] tarrow: do you get logstash access? [08:30:31] <_joe_> ahah ok [08:30:36] !log rebooting cp3034 [08:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:50] https://www.irccloud.com/pastebin/jvGcwqwu/ [08:30:57] (03PS3) 10Gehel: Pick a new canary for elastic [puppet] - 10https://gerrit.wikimedia.org/r/534461 (owner: 10Muehlenhoff) [08:30:58] https://logstash.wikimedia.org/goto/af5df8a95de6a72c184ac1072a4fdb78 is errors for wikidatawiki [08:31:01] is the contents if you want to breakdown the slow bits [08:31:29] but that does not have much clues :-\ [08:31:35] (03PS3) 10Muehlenhoff: Decom iron [puppet] - 10https://gerrit.wikimedia.org/r/531867 (https://phabricator.wikimedia.org/T220505) [08:32:06] _joe_: is there some easy way to aim at different api servers? [08:32:07] !log depool restbase1022 T232007 [08:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:10] T232007: Restbase: significant increase of outbound dropped packets - https://phabricator.wikimedia.org/T232007 [08:32:15] e.g. see if it is the php7 ones that are slow [08:32:53] <_joe_> tarrow: responses to that request are super fast [08:32:57] <_joe_> on any server [08:33:15] well not from deploy2002 [08:33:21] <_joe_> tarrow: yes, you can s/api-ro.discovery/mw1347.eqiad/ [08:33:28] <_joe_> tarrow: uh interesting [08:33:38] yeah, sometime >6s [08:33:42] (03PS4) 10Muehlenhoff: Decom iron [puppet] - 10https://gerrit.wikimedia.org/r/531867 (https://phabricator.wikimedia.org/T220505) [08:33:44] other times 0.08 [08:34:10] <_joe_> tarrow: uhm never happened to me [08:34:12] <_joe_> but lemme check [08:34:41] PROBLEM - Host cp3034 is DOWN: PING CRITICAL - Packet loss = 100% [08:34:46] I would say between 10 and 20% of requests [08:34:50] are slow [08:35:00] <_joe_> tarrow: not my experience, which is weird [08:35:06] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime [08:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:10] https://www.irccloud.com/pastebin/Nl7tAnFf/ [08:35:20] just run now [08:35:25] <_joe_> so it's clearly slower on php7 [08:35:29] <_joe_> takes 0.8 seconds [08:35:57] <_joe_> tarrow: can you please log the Server header too? [08:36:27] (03CR) 10Muehlenhoff: "Ack, sounds good." [puppet] - 10https://gerrit.wikimedia.org/r/534461 (owner: 10Muehlenhoff) [08:36:28] <_joe_> uh wait a sec [08:36:36] <_joe_> why is api-ro going to codfw FFS [08:36:37] (03CR) 10Muehlenhoff: [C: 03+2] Pick a new canary for elastic [puppet] - 10https://gerrit.wikimedia.org/r/534461 (owner: 10Muehlenhoff) [08:36:53] 10Operations, 10observability, 10Performance-Team (Radar): Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105 (10fgiunchedi) I was re-reading this task and {T175087} in the context of progressively moving away from statsd/graphite (T205870), some technical thoughts on ho... [08:36:54] <_joe_> tarrow: argh, I might know what's up [08:36:54] https://www.irccloud.com/pastebin/QM790zsh/ [08:37:01] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:11] <_joe_> indeed... [08:37:32] :) [08:37:43] * tarrow is not sure what he's looking at [08:37:53] are we active active for api-ro? [08:37:58] * hashar watches black magic going on [08:38:03] <_joe_> tarrow: we shouldn't be [08:38:12] <_joe_> so wait a couple minutes [08:38:47] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=a.*-ro,name=codfw [08:38:48] okay :) [08:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:57] how have you managed to add extra logs (time_connect, time_appconnect etc) to the query output? [08:39:33] https://stackoverflow.com/questions/18215389/how-do-i-measure-request-and-response-times-at-once-using-curl [08:39:41] using that [08:40:07] <_joe_> ok tarrow try now? [08:40:14] <_joe_> I bet you won't see issues now [08:40:35] <_joe_> in terms of response times [08:40:55] <_joe_> but still, why is termbox consistently alerting is beyond explanation with "some requests are slow" [08:41:00] tarrow: nice hack :] [08:41:22] <_joe_> anyways if you can confirm you don't see those timeouts anymore [08:41:55] seems it is constantly taking 750ms now [08:42:06] (well 750 - 820ms) [08:42:10] <_joe_> hashar: what are you doing? [08:42:14] I cant tell how fast it was before [08:42:20] deploy1001:~$ time service-checker-swagger -t 15 termbox.svc.codfw.wmnet http://termbox.svc.codfw.wmnet:3030 [08:42:26] <_joe_> oh ok [08:42:31] <_joe_> so not the single curl [08:42:51] so what black magic happened? [08:43:01] seems better to me :) [08:43:09] <_joe_> I made requests to api-ro go to the live dc [08:43:18] <_joe_> instead than to the inactive one [08:43:27] so what black magic happened? [08:43:29] sorry [08:43:31] <_joe_> how they were incorrectly configured right now [08:43:40] so potentially we got timeout because some kind of caches are cold on the inactive DC? [08:43:55] <_joe_> the php caches and the db caches and memcached, yes [08:43:57] <_joe_> all of them [08:43:59] not sure why it worked fine with 1.34.0-wmf.20 though!:^/ [08:44:13] <_joe_> that is something I don't know either [08:44:32] RECOVERY - Disk space on restbase-dev1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase-dev1005&var-datasource=eqiad+prometheus/ops [08:44:32] did we slowly warm the cache for that request [08:44:34] maybe due to some changes to the mediawiki/core cache system [08:44:35] <_joe_> and I'd urge you to rollback and retest the speed of that specific endpoint if you want to verify if there was an issue [08:44:41] yeah [08:44:47] that is what I had in mind [08:44:48] and then bumping meant it was cooled [08:44:52] <_joe_> of perf degradation [08:45:06] <_joe_> tarrow: nope, what happened is I made you do your requests cross-dc [08:45:28] I mean why 20 worked fine [08:45:33] <_joe_> instead of going to the damn same dc, which shouldn't have happened, but we screwed up [08:45:51] <_joe_> tarrow: it would've been a problem when we switched to .20 as well though [08:45:56] <_joe_> the cold caches [08:46:02] <_joe_> so something definitely changed [08:46:06] maybe it was and no-one noticed?? [08:46:12] <_joe_> if it's not a perf degradation, it's ok [08:46:18] <_joe_> tarrow: oh that is very possible too [08:46:22] RECOVERY - Host cp3034 is UP: PING OK - Packet loss = 0%, RTA = 83.41 ms [08:46:22] so rollback to 1.34.0-wmf.20 , measure [08:46:29] and then back to .21 right? [08:46:51] (possibly we can also look at the Icinga probe for termbox last week and see whether it alarmed) [08:47:42] (03PS1) 10Hashar: Rollback wikidata to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534583 (https://phabricator.wikimedia.org/T232035) [08:47:46] tarrow: _joe_ ^^? [08:48:06] <_joe_> hashar: +1 from me [08:48:09] Er... yes... I guess [08:48:23] but just be clear *what* exactly are we measuring? [08:48:36] the time for that response? [08:48:44] the service response using service-checker-swagger [08:48:49] got it :) [08:48:52] that is the only thing that I noticed slowing down [08:49:00] (since the queries to index.php worked just fine apparently) [08:49:41] hmm [08:49:54] https://logstash.wikimedia.org/goto/336ca058ad55c63297324fd699fd7b83 [08:50:02] Icinga alerts for termbox over 15 days [08:50:52] looks like it times with the train [08:51:00] <_joe_> yeah looks like it [08:51:05] <_joe_> but please confirm [08:51:19] <_joe_> if that's the case, we have a satisfying explanation of the problems we've seen [08:52:37] (03CR) 10Hashar: [C: 03+2] "Move fingers doing black magic." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534583 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar) [08:52:48] _joe_: tarrow: huge thank you to both you for the help :] [08:53:31] after that I will have to escape [08:53:32] (03Merged) 10jenkins-bot: Rollback wikidata to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534583 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar) [08:53:36] (03PS1) 10Giuseppe Lavagetto: scap: restart php-fpm if needed when doing a full deploy [puppet] - 10https://gerrit.wikimedia.org/r/534584 (https://phabricator.wikimedia.org/T224857) [08:53:43] Thank you! [08:53:48] (03CR) 10jenkins-bot: Rollback wikidata to 1.34.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534583 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar) [08:54:33] (03CR) 10jerkins-bot: [V: 04-1] scap: restart php-fpm if needed when doing a full deploy [puppet] - 10https://gerrit.wikimedia.org/r/534584 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [08:55:07] and not sure whether it is useful, but https://grafana.wikimedia.org/d/AJf0z_7Wz/termbox?refresh=1m&orgId=1 might need some metrics about the queries latency [08:55:25] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Rollback wikidatawiki to 1.34.0-wmf.20 for T232035 [08:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:29] also [08:55:36] T232035: 1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500 - https://phabricator.wikimedia.org/T232035 [08:55:38] I don't get why there are like 140 requests per seconds [08:55:41] but barely any errors reported [08:55:52] <_joe_> eqiad doesn't have errors? [08:55:54] hashar: latency in more detail than what is there? [08:56:02] <_joe_> and termbox is called by mediawiki which is active in eqiad [08:56:07] OH [08:56:13] <_joe_> so all requests are going to termbox/eqiad [08:56:25] <_joe_> tarrow: he means telemetry for calls to the backend [08:56:42] <_joe_> hashar: telemetry for backend calls and full tracing are coming(TM) [08:56:43] ther eis the Datacenter selector at the top of the graph bah [08:56:48] so yeah on codfw that shows some errors [08:58:21] ok back to 1.34.0-wmf.21 [08:59:27] (03PS1) 10Hashar: Wikidata back to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534585 (https://phabricator.wikimedia.org/T232035) [08:59:48] sometime I have the feeling I am just wasting my (and everyone else) time :-\ [08:59:55] 10Operations, 10vm-requests: vm request for RT to replace ununpentium - https://phabricator.wikimedia.org/T232077 (10Dzahn) [08:59:56] Did you collect the stats you wanted? [08:59:57] 10Operations: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 (10Dzahn) [09:00:00] (03CR) 10Hashar: [C: 03+2] Wikidata back to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534585 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar) [09:00:07] hashar: that is 100% not true [09:00:20] doesn't change my feeling about it hehe [09:00:21] ;-] [09:00:29] or rather you might have that feeling but you aren't wasting anyone's time [09:00:38] :P [09:00:43] I never know whether i am just too obsessive / too strict [09:00:47] cool [09:00:49] thank you ! [09:00:56] so claiming it as fixed [09:00:57] (03Merged) 10jenkins-bot: Wikidata back to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534585 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar) [09:01:01] woo [09:01:02] blaming cold caches in codfw [09:01:14] (03CR) 10jenkins-bot: Wikidata back to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534585 (https://phabricator.wikimedia.org/T232035) (owner: 10Hashar) [09:01:16] with the root cause undetermined but blaming mediawiki/core change of behavior [09:01:47] <_joe_> hashar: I think it happened at each train release on wikidata [09:01:53] <_joe_> you were just the first one to notice [09:02:24] ah [09:04:01] !log rolling back from ats-tls to nginx on cp3034 - T231433 [09:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:04] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [09:04:37] hashar: sorry, were you talking about mw changetags before? [09:04:59] I think I read someone mentioning them, but cannot remember who [09:05:08] <_joe_> vgutierrez: I see issues on upload in esams, transient. Was that you doing things on cp3034? [09:05:22] <_joe_> jynus: no I think it was tim [09:05:23] (03PS1) 10Vgutierrez: Revert "hiera: Move ats-tls from port 8443 to port 443 on cp3034" [puppet] - 10https://gerrit.wikimedia.org/r/534586 [09:05:26] <_joe_> related to php7? [09:05:27] tarrow: _joe_ I have marked the termbox issue fixed. Thank you very much [09:05:39] jynus: wasn't me sorry :-^ [09:05:52] _joe_: yeah, I've migrated this morning from nginx to ats-tls, I'm about to rollback [09:05:54] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Promote wikidatawiki to 1.34.0-wmf.21 for T232035 - T220746 [09:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:58] T220746: 1.34.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T220746 [09:05:58] T232035: 1.34.0-wmf.21 cause termbox to emit: Test get rendered termbox returned the unexpected status 500 - https://phabricator.wikimedia.org/T232035 [09:07:11] (03CR) 10Vgutierrez: [C: 03+2] Revert "hiera: Move ats-tls from port 8443 to port 443 on cp3034" [puppet] - 10https://gerrit.wikimedia.org/r/534586 (owner: 10Vgutierrez) [09:07:24] (03PS2) 10Vgutierrez: Revert "hiera: Move ats-tls from port 8443 to port 443 on cp3034" [puppet] - 10https://gerrit.wikimedia.org/r/534586 [09:08:38] (03PS2) 10Giuseppe Lavagetto: scap: restart php-fpm if needed when doing a full deploy [puppet] - 10https://gerrit.wikimedia.org/r/534584 (https://phabricator.wikimedia.org/T224857) [09:08:52] I am off for now be back this afternoon [09:11:37] (03PS1) 10Vgutierrez: Revert "hiera: Move nginx from port 443 to port 4443 on cp3034" [puppet] - 10https://gerrit.wikimedia.org/r/534587 [09:11:55] (03CR) 10Vgutierrez: [C: 03+2] Revert "hiera: Move nginx from port 443 to port 4443 on cp3034" [puppet] - 10https://gerrit.wikimedia.org/r/534587 (owner: 10Vgutierrez) [09:12:37] (03PS2) 10Vgutierrez: Revert "hiera: Move nginx from port 443 to port 4443 on cp3034" [puppet] - 10https://gerrit.wikimedia.org/r/534587 [09:14:36] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp3034 is CRITICAL: connect to address 10.20.0.169 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:14:47] that's expected [09:14:54] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3034 is CRITICAL: connect to address 10.20.0.169 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:15:02] PROBLEM - Check systemd state on cp3034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:12] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3034 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:15:34] PROBLEM - HTTPS Unified RSA on cp3034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [09:15:34] PROBLEM - HTTPS Unified ECDSA on cp3034 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [09:16:08] (03PS1) 10Dzahn: add moscovium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/534588 (https://phabricator.wikimedia.org/T232077) [09:16:30] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3034 is OK: HTTP OK: HTTP/1.0 200 OK - 19048 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:16:33] (03CR) 10jerkins-bot: [V: 04-1] add moscovium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/534588 (https://phabricator.wikimedia.org/T232077) (owner: 10Dzahn) [09:16:38] RECOVERY - Check systemd state on cp3034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:48] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3034 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:17:00] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:17:01] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:33] RECOVERY - HTTPS Unified ECDSA on cp3034 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345098 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 77 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:19:33] RECOVERY - HTTPS Unified RSA on cp3034 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345098 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 77 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:23:29] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10MoritzMuehlenhoff) restbase-dev1004 has been reinstalled as Stretch. @Eevans, you can bootstrap 1004 in Cassandr... [09:25:28] (03PS2) 10Dzahn: add moscovium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/534588 (https://phabricator.wikimedia.org/T232077) [09:25:30] (03PS1) 10Vgutierrez: Revert "hiera: Move ats-tls from port 8443 to 4443 on cp1076" [puppet] - 10https://gerrit.wikimedia.org/r/534589 [09:25:44] (03PS2) 10Vgutierrez: Revert "hiera: Move ats-tls from port 8443 to 4443 on cp1076" [puppet] - 10https://gerrit.wikimedia.org/r/534589 [09:25:52] 10Operations, 10vm-requests, 10Patch-For-Review: vm request for RT to replace ununpentium - https://phabricator.wikimedia.org/T232077 (10Dzahn) p:05Triage→03High [09:25:53] (03CR) 10Vgutierrez: [C: 03+2] Revert "hiera: Move ats-tls from port 8443 to 4443 on cp1076" [puppet] - 10https://gerrit.wikimedia.org/r/534589 (owner: 10Vgutierrez) [09:25:57] 10Operations, 10vm-requests, 10Patch-For-Review: vm request for RT to replace ununpentium - https://phabricator.wikimedia.org/T232077 (10Dzahn) a:03Dzahn [09:26:31] !log rolling back from ats-tls to nginx on cp1076 - T231433 [09:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:34] T231433: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 [09:29:21] check the Phabricator and Gerrit contributions of a user - new tool https://tools.wmflabs.org/wikicontrib/ [09:30:02] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [09:32:15] (03PS1) 10Vgutierrez: Revert "hiera: Move nginx from port 443 to 4443 on cp1076" [puppet] - 10https://gerrit.wikimedia.org/r/534590 [09:32:25] (03PS2) 10Vgutierrez: Revert "hiera: Move nginx from port 443 to 4443 on cp1076" [puppet] - 10https://gerrit.wikimedia.org/r/534590 [09:32:34] (03CR) 10Vgutierrez: [C: 03+2] Revert "hiera: Move nginx from port 443 to 4443 on cp1076" [puppet] - 10https://gerrit.wikimedia.org/r/534590 (owner: 10Vgutierrez) [09:33:24] (03CR) 10Dzahn: [C: 03+2] add moscovium.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/534588 (https://phabricator.wikimedia.org/T232077) (owner: 10Dzahn) [09:39:25] !log ganeti1001 - creating VM moscovium (T232077) [09:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:28] T232077: vm request for RT to replace ununpentium - https://phabricator.wikimedia.org/T232077 [09:40:08] 10Operations: Conffile handling for PHP 7.2 packages - https://phabricator.wikimedia.org/T231881 (10MoritzMuehlenhoff) I tracked this down: Our way of integrating the puppetised php.ini files is working fine and as expected. The current round of conffile prompts is triggered by an upstream change between 7.2.16... [09:43:19] (03CR) 10Dzahn: "thanks !:)" [labs/private] - 10https://gerrit.wikimedia.org/r/534275 (owner: 10Dzahn) [09:44:16] 10Operations, 10ops-codfw: codfw humidity too high - https://phabricator.wikimedia.org/T225137 (10Marostegui) Is this good to be closed? [09:45:58] (03PS1) 10Muehlenhoff: Remove neodymium/sarin from MySQL root clients [puppet] - 10https://gerrit.wikimedia.org/r/534591 (https://phabricator.wikimedia.org/T220503) [09:47:49] (03CR) 10Marostegui: [C: 03+1] Remove neodymium/sarin from MySQL root clients [puppet] - 10https://gerrit.wikimedia.org/r/534591 (https://phabricator.wikimedia.org/T220503) (owner: 10Muehlenhoff) [09:54:04] (03PS1) 10Dzahn: releases: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534594 (https://phabricator.wikimedia.org/T210411) [09:59:29] (03PS2) 10Muehlenhoff: Remove neodymium/sarin from MySQL root clients [puppet] - 10https://gerrit.wikimedia.org/r/534591 (https://phabricator.wikimedia.org/T220503) [10:01:00] (03CR) 10Muehlenhoff: [C: 03+2] Remove neodymium/sarin from MySQL root clients [puppet] - 10https://gerrit.wikimedia.org/r/534591 (https://phabricator.wikimedia.org/T220503) (owner: 10Muehlenhoff) [10:03:18] (03PS2) 10Dzahn: releases: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534594 (https://phabricator.wikimedia.org/T210411) [10:03:20] (03PS1) 10Dzahn: install_server: add moscovium to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/534595 (https://phabricator.wikimedia.org/T232077) [10:07:36] (03PS2) 10Dzahn: install_server: add moscovium to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/534595 (https://phabricator.wikimedia.org/T232077) [10:07:52] (03CR) 10Dzahn: [C: 04-2] install_server: add moscovium to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/534595 (https://phabricator.wikimedia.org/T232077) (owner: 10Dzahn) [10:14:32] (03PS1) 10Dzahn: webperf: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534597 (https://phabricator.wikimedia.org/T210411) [10:18:03] (03PS1) 10Muehlenhoff: Decomission sarin [puppet] - 10https://gerrit.wikimedia.org/r/534598 (https://phabricator.wikimedia.org/T220504) [10:20:57] (03CR) 10Muehlenhoff: [C: 03+2] Decomission sarin [puppet] - 10https://gerrit.wikimedia.org/r/534598 (https://phabricator.wikimedia.org/T220504) (owner: 10Muehlenhoff) [10:21:50] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10MoritzMuehlenhoff) [10:22:48] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH This is ready to be decommissioned now. [10:25:36] !log upgrading mw1238-mw1258 to PHP 7.2.22 [10:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:04] <_joe_> !log upgrading scap across the fleet T224857 [10:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:07] T224857: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 [10:31:08] !log upgrading mw1319-mw1333 to PHP 7.2.22 [10:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:02] <_joe_> moritzm: can you make debdeploy use a special script to perform a service restart? [10:32:13] <_joe_> or it defaults to systemctl restart? [10:33:39] it doesn't handle restarts (apart from the restarts triggered by postinst scripts automatically), all restarts need to be done with Cumin or cook books [10:33:48] <_joe_> ok [10:34:11] <_joe_> I was confused by the "non-daemon update, no service restart needed" [10:34:39] <_joe_> can I update scap across the clusters now? [10:34:42] the initial debdeploy version based on salt had restart support, but it turned out to be better handled outside of debdeploy, so wasn't ported over to the cumin version [10:34:54] <_joe_> ack! [10:35:08] <_joe_> we need to amend the service restarts page btw [10:35:31] for envoy? [10:35:38] <_joe_> not only [10:35:49] <_joe_> but for the work I've done on safe restarts [10:36:00] ah yes [10:37:43] BTW, one thing we still might miss in the cook books for mw restarts is to wait for ffmpeg on video scalers, some of the video scaler jobs can take quite a bit, so for reboots I've always doublechecked with cumin that no remaining ffmpeg processes are around [10:38:13] OTOH if the tmh code has retry logic we can probably also simply rely on that, not sure [10:38:14] <_joe_> heh that's going to be a huge problem with php7 btw [10:38:30] <_joe_> given it is restarted regularly for opcache reasons [10:38:38] <_joe_> but yeah, it has a retry logic [10:38:38] yeah, I had been wondering about that :-) [10:38:51] <_joe_> and we should *really* have a separate service for videoscaling [10:39:06] agreed [10:40:35] <_joe_> does any FLOSS video encoding service exist that we could use to this end? [10:44:12] not sure if anything existing exists, Brion probably knows best [10:46:52] !log upgrading mw1221-mw1335 to PHP 7.2.22 [10:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:36] FYI I'm going to delete a bunch of puppetdb spammy metrics from prometheus eqiad, T228395 [10:48:37] T228395: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395 [10:49:25] !log filippo@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=prometheus1004.eqiad.wmnet [10:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:42] <_joe_> heh about spammy metrics, I guess we'll need to create filters when we start collecting envoy metrics [10:50:38] does envoy spam metrics by default? [10:50:45] or rather, emit spammy metrics? [10:53:34] !log temporarily enable prometheus admin web api in prometheus@ops in eqiad to delete spammy metrics - T228395 [10:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T1100). [11:00:05] kostajh: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:26] \o [11:03:14] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [11:03:22] expected ^ [11:06:43] (03PS3) 10Ema: lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) [11:11:56] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [11:12:08] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [11:12:18] Anyone around for SWAT? The two patches will take a while to get through CI [11:16:55] Amir1 / Urbanecm ^ [11:18:29] kostajh: I can SWAT [11:18:38] Thx dcausse [11:31:22] (03PS1) 10Muehlenhoff: Decommission neodymium [puppet] - 10https://gerrit.wikimedia.org/r/534600 [11:32:16] (03CR) 10jerkins-bot: [V: 04-1] Decommission neodymium [puppet] - 10https://gerrit.wikimedia.org/r/534600 (owner: 10Muehlenhoff) [11:38:54] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [11:39:06] (03PS2) 10Muehlenhoff: Decommission neodymium [puppet] - 10https://gerrit.wikimedia.org/r/534600 (https://phabricator.wikimedia.org/T220503) [11:39:16] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [11:48:31] !log dcausse@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/CirrusSearch/: T159321: Add morelikethis a non-greedy version of the morelike keyword (duration: 00m 59s) [11:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:35] T159321: [Bug] Unpredictable behavior with the order of Special:Search parameters - https://phabricator.wikimedia.org/T159321 [11:48:49] kostajh: it's live ^ [11:49:25] kostajh: sorry, at meetings:( [11:49:27] dcausse: lovely, thanks [11:50:28] !log EU swat done [11:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:20] (03CR) 10Muehlenhoff: [C: 03+2] Decommission neodymium [puppet] - 10https://gerrit.wikimedia.org/r/534600 (https://phabricator.wikimedia.org/T220503) (owner: 10Muehlenhoff) [11:52:39] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10MoritzMuehlenhoff) [11:53:00] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH This is ready for decom [11:56:38] (03PS2) 10Mathew.onipe: elasticsearch: logging.yml template is ensure=absent [puppet] - 10https://gerrit.wikimedia.org/r/534398 [11:56:41] (03PS5) 10Mathew.onipe: elasticsearch: add syslog logging option [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) [11:56:43] (03PS3) 10Mathew.onipe: elasticsearch: switch relforge to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) [11:57:07] (03CR) 10Mathew.onipe: elasticsearch: switch relforge to new logging pipeline (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [11:57:29] !log upgrading remaining job runners to PHP 7.2.22 [11:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:52] (03CR) 10Mathew.onipe: elasticsearch: logging.yml template is ensure=absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534398 (owner: 10Mathew.onipe) [11:58:52] (03CR) 10Mathew.onipe: elasticsearch: add syslog logging option (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [12:02:05] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [12:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:48] (03CR) 10Mathew.onipe: "PCC output is expected: https://puppet-compiler.wmflabs.org/compiler1002/18186/" [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [12:13:56] !log upgrading mw1284-mw1290 to PHP 7.2.22 [12:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:43] (03CR) 10Giuseppe Lavagetto: [C: 04-1] lvs: add restbase-ssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [12:37:23] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: only enable tideways/mongodb where needed [puppet] - 10https://gerrit.wikimedia.org/r/534405 [12:42:14] (03CR) 10Marostegui: "Moritzm, keep in mind that the proxies are accessed by all the tools that want to connect to the services labsdb services (web and analyti" [puppet] - 10https://gerrit.wikimedia.org/r/531670 (owner: 10Muehlenhoff) [12:47:37] !log @ helmfile [STAGING] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [12:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: only enable tideways/mongodb where needed [puppet] - 10https://gerrit.wikimedia.org/r/534405 (owner: 10Giuseppe Lavagetto) [12:59:42] o/ [13:00:04] hashar: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - European version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T1300). [13:00:56] (03PS1) 10Hashar: all wikis to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534604 [13:00:58] (03CR) 10Hashar: [C: 03+2] all wikis to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534604 (owner: 10Hashar) [13:00:58] wish me luck [13:01:12] good luck… [13:02:37] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534604 (owner: 10Hashar) [13:02:55] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534604 (owner: 10Hashar) [13:03:43] apaches syncing [13:04:25] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.21 [13:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:36] hmm uneventful [13:15:32] (03PS3) 10Muehlenhoff: Add DNS entries for new Buster-based LDAP/corp replicas [dns] - 10https://gerrit.wikimedia.org/r/534432 (https://phabricator.wikimedia.org/T231015) [13:15:36] 10Operations, 10Puppet, 10User-fgiunchedi: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395 (10fgiunchedi) prometheus1004 completed, with this process: ` # depool # stop puppet # add --web.enable-admin-api to /lib/systemd/system/prometheus@ops.service systemctl daemo... [13:15:49] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Ottomata) ping @Groceryheist I don't know ryanmax's phab id, so I will email him. [13:17:41] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus1004.eqiad.wmnet [13:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:00] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [13:19:20] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [13:21:12] that's me ^ [13:21:20] there will others for prometheus1003 shortly [13:21:47] !log filippo@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=prometheus1003.eqiad.wmnet [13:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:51] <_joe_> and those alerts are better not acknowledged AIUI [13:22:24] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entries for new Buster-based LDAP/corp replicas [dns] - 10https://gerrit.wikimedia.org/r/534432 (https://phabricator.wikimedia.org/T231015) (owner: 10Muehlenhoff) [13:23:29] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Nuria) Give that this is likely to impact other users can we temporarily compress that directory ( (/home/ryanmax) to make up space? [13:23:36] yeah, also they'll auto resolve [13:26:51] (03PS1) 10Muehlenhoff: Add partman config for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/534609 [13:30:17] (03PS1) 10Giuseppe Lavagetto: Make one user out of 3 use php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534610 (https://phabricator.wikimedia.org/T219150) [13:30:58] <_joe_> Reedy: ^^ :P [13:31:15] :D [13:31:24] (03CR) 10Reedy: [C: 03+1] Make one user out of 3 use php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534610 (https://phabricator.wikimedia.org/T219150) (owner: 10Giuseppe Lavagetto) [13:33:06] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [13:34:00] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [13:34:22] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Ottomata) I deleted a little bit from my home dir, so we have a little bit of room for a bit. I'll give them a little time to respond. [13:35:05] (03CR) 10Arlolra: "> Patch Set 2: Code-Review+2" [puppet] - 10https://gerrit.wikimedia.org/r/534271 (https://phabricator.wikimedia.org/T229356) (owner: 10Dzahn) [13:35:51] train looks fine to me so far [13:37:09] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [13:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:04] RECOVERY - Disk space on notebook1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1004&var-datasource=eqiad+prometheus/ops [13:39:52] !log upgrading remaining API servers to PHP 7.2.22 [13:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:58] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [13:41:28] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [13:45:07] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10RyanSteinberg) I just deleted some files and I'm compressing others. I didn't realize space was so tight ... my apologies. [13:49:15] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Ottomata) The Jupyter Notebook servers are meant mostly to be an GUI/Cli interface to Hadoop based systems. If you can, please consider storing data in HDFS. [13:51:27] 10Operations, 10DBA, 10Patch-For-Review, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Agusbou2015) Will enwiki the only wiki affected to this failover? [13:52:18] (03PS1) 10Alexandros Kosiaris: sessionstore: Bump again memory limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/534613 (https://phabricator.wikimedia.org/T229697) [13:52:37] 10Operations, 10DBA, 10Patch-For-Review, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Marostegui) >>! In T231403#5468025, @Agusbou2015 wrote: > Will enwiki the only wiki affected to this failover? enwiki w... [13:52:46] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] sessionstore: Bump again memory limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/534613 (https://phabricator.wikimedia.org/T229697) (owner: 10Alexandros Kosiaris) [13:54:31] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'sessionstore' for release 'production' . [13:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:38] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.85% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:54:40] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [13:55:24] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [14:01:48] 10Operations, 10Analytics, 10Analytics-Cluster: notebook1004 - /srv is full - https://phabricator.wikimedia.org/T232068 (10Nuria) @RyanSteinberg + 1 to andrew's suggestion. data should not be kept on notebook servers, rather you can keep it on your user database in hadoop. This is due to space concerns in no... [14:11:02] !log restarted swiftrepl on ms-fe1005 T231110 [14:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:06] T231110: bring swiftrepl back to life - https://phabricator.wikimedia.org/T231110 [14:13:48] (03PS1) 10CDanis: swiftrepl: fix missing local variable assignment [software] - 10https://gerrit.wikimedia.org/r/534621 [14:14:00] !log @ helmfile [CODFW] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [14:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:12] (03PS4) 10Ema: lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) [14:15:08] !log @ helmfile [EQIAD] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [14:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:15] (03CR) 10Ayounsi: "I'd recommend putting `include ::profile::base::firewall` in the role instead of in the profile." [puppet] - 10https://gerrit.wikimedia.org/r/534490 (owner: 10CRusnov) [14:16:35] (03CR) 10Ema: lvs: add restbase-ssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:19:56] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) >>! In T224554#5467470, @MoritzMuehlenhoff wrote: > restbase-dev1004 has been reinstalled as Stretch. @E... [14:21:29] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:21:29] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [14:24:53] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:25:01] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:43] (03PS2) 10Giuseppe Lavagetto: restart-appservers: fix to the cli args, some other cosmetic changes [cookbooks] - 10https://gerrit.wikimedia.org/r/534445 [14:30:59] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus1003.eqiad.wmnet [14:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:06] 10Operations, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10mark) Hi Anusha, Greg, Looking into this. Unfortunately it seems the way this is being implemented, we would effectively be signing away complete control of our email security settings/policy for //wikimedi... [14:32:25] 10Operations, 10WMF-Communications: Updating DNS records - https://phabricator.wikimedia.org/T231387 (10mark) 05Stalled→03Open [14:32:57] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:35:08] (03CR) 10CDanis: [C: 03+2] swiftrepl: fix missing local variable assignment [software] - 10https://gerrit.wikimedia.org/r/534621 (owner: 10CDanis) [14:35:15] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:37] (03Merged) 10jenkins-bot: swiftrepl: fix missing local variable assignment [software] - 10https://gerrit.wikimedia.org/r/534621 (owner: 10CDanis) [14:39:39] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:55] !log remove iron from mr* routers - T231811 [14:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:58] T231811: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 [14:43:36] (03PS5) 10Ema: restbase: TLS termination with envoy on port 7443 [puppet] - 10https://gerrit.wikimedia.org/r/533028 (https://phabricator.wikimedia.org/T210411) [14:44:03] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [14:45:43] 10Operations, 10netops: Check router ACLs for early install SSH access from puppet masters/cumin hosts - https://phabricator.wikimedia.org/T231811 (10ayounsi) 05Open→03Resolved Done! [14:46:00] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) >>! In T224554#5467470, @MoritzMuehlenhoff wrote: > restbase-dev1004 has been reinstalled as Stretch. @E... [14:48:10] (03CR) 10Ema: [C: 03+2] restbase: TLS termination with envoy on port 7443 [puppet] - 10https://gerrit.wikimedia.org/r/533028 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [14:50:50] !log restbase2009: depool and add TLS termination w/ envoy -- https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/533028/ T210411 [14:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:57] (03PS1) 10Herron: kafka-main1001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534625 [14:50:57] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [14:50:59] (03PS3) 10CRusnov: netbox: Add netbox* hosts to acmechief. [puppet] - 10https://gerrit.wikimedia.org/r/534490 [14:52:55] (03CR) 10CRusnov: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/534490 (owner: 10CRusnov) [14:53:03] (03PS4) 10CRusnov: netbox: Add netbox* hosts to acmechief. [puppet] - 10https://gerrit.wikimedia.org/r/534490 [14:53:05] (03CR) 10Ayounsi: [C: 03+1] "lgtm, don't forget to cleanup the old ones when not necessary anymore." [puppet] - 10https://gerrit.wikimedia.org/r/534490 (owner: 10CRusnov) [14:53:52] (03PS2) 10Herron: kafka-main1001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534625 [14:53:54] (03CR) 10CRusnov: [C: 03+2] netbox: Add netbox* hosts to acmechief. [puppet] - 10https://gerrit.wikimedia.org/r/534490 (owner: 10CRusnov) [14:54:27] !log restbase2009: repool after successful envoy deployment T210411 [14:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:02] (03PS3) 10Herron: kafka-main1001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534625 [14:55:59] (03CR) 10Herron: [C: 03+2] kafka-main1001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/534625 (owner: 10Herron) [14:57:30] akosiaris: restbase1022 has puppet disabled since a few hours, is that intentional or can we re-enable? [14:57:49] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:02:36] 10Operations, 10Puppet, 10User-fgiunchedi: puppetdb prometheus metrics per-host metrics - https://phabricator.wikimedia.org/T228395 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Both prometheus1003 and prometheus1004 have been cleaned and repooled, resolving. @EBernhardson please give the web ui anot... [15:20:27] 10Operations, 10Traffic: Track TLS related ATS metrics in prometheus - https://phabricator.wikimedia.org/T231286 (10ema) p:05Triage→03Normal [15:20:44] 10Operations, 10Acme-chief, 10Traffic: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users - https://phabricator.wikimedia.org/T230687 (10ema) p:05Triage→03Normal [15:22:41] 10Operations, 10ops-codfw: codfw humidity too high - https://phabricator.wikimedia.org/T225137 (10RobH) 05Open→03Resolved I went ahead and pulled the CyrusOne report for this month, and humidity seems to be in the 50% range. It started high, but seems CyrusOne rebalanced and now its back to normal. {F302... [15:23:13] !log beginning replacement of kafka1001 with kafka-main1001 T225005 [15:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:17] T225005: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 [15:25:32] (03PS2) 10Herron: kafka-main: replace kafka1001 hardware with kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/528271 (https://phabricator.wikimedia.org/T225005) [15:25:32] ACKNOWLEDGEMENT - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans T224554 - The acknowledgement expires at: 2019-09-09 15:24:59. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:40] PROBLEM - Nginx local proxy to apache on mw1340 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:26:02] (03PS2) 10Ottomata: Switch all events to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534506 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [15:26:56] RECOVERY - Nginx local proxy to apache on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:27:00] (03CR) 10Herron: [C: 03+2] kafka-main: replace kafka1001 hardware with kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/528271 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [15:54:33] !log jynus@deploy1001 Synchronized private/PrivateSettings.php: updating cli password (duration: 00m 47s) [15:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:21] !log restarting batch processes on mwmaint1002 T232106 [15:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] godog and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:01:09] PROBLEM - traffic_server tls process restarted on cp5001 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5001&var-layer=tls [16:02:40] (03PS1) 10Herron: kafka-main: move kafka1001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/534634 (https://phabricator.wikimedia.org/T225005) [16:04:03] (03CR) 10Ottomata: [C: 03+2] Switch all events to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534506 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [16:04:34] !log switching remaining job queue events (and all remaining events) to eventgate - T228705 [16:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:42] T228705: Migrate JobQueue to eventgate - https://phabricator.wikimedia.org/T228705 [16:05:45] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch all events to eventgate - T228705 (duration: 00m 48s) [16:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:43] (03CR) 10jenkins-bot: Switch all events to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534506 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [16:07:29] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:09:17] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) Next steps? [16:22:34] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch all events to eventgate - T228705 - take 2 (duration: 00m 49s) [16:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:46] T228705: Migrate JobQueue to eventgate - https://phabricator.wikimedia.org/T228705 [16:28:21] (03PS1) 10Ppchelko: Remove references to eventlogging-service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534637 (https://phabricator.wikimedia.org/T211248) [16:29:30] (03PS2) 10Ppchelko: Remove EventBusRCFeedEngine eventServiceName. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534236 (https://phabricator.wikimedia.org/T229863) [16:33:08] (03PS2) 10Ppchelko: Remove references to eventlogging-service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534637 (https://phabricator.wikimedia.org/T232122) [16:33:24] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:38:49] (03PS1) 10CRusnov: netbox: Add dhparam [puppet] - 10https://gerrit.wikimedia.org/r/534640 [16:42:48] (03CR) 10CRusnov: [C: 03+2] "Uncontroversial change." [puppet] - 10https://gerrit.wikimedia.org/r/534640 (owner: 10CRusnov) [16:47:07] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:53:26] (03PS3) 10Gehel: elasticsearch: logging.yml template is ensure=absent [puppet] - 10https://gerrit.wikimedia.org/r/534398 (owner: 10Mathew.onipe) [16:54:16] (03CR) 10Gehel: [C: 03+2] elasticsearch: logging.yml template is ensure=absent [puppet] - 10https://gerrit.wikimedia.org/r/534398 (owner: 10Mathew.onipe) [16:57:46] (03PS6) 10Gehel: elasticsearch: add syslog logging option [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [17:00:04] cscott, arlolra, subbu, halfak, and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T1700). [17:00:07] (03CR) 10Gehel: [C: 03+2] elasticsearch: add syslog logging option [puppet] - 10https://gerrit.wikimedia.org/r/533928 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [17:12:29] (03PS3) 10Bstorm: toolforge: add CORS header to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/528617 (owner: 10BryanDavis) [17:12:44] 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 - https://phabricator.wikimedia.org/T232126 (10Papaul) [17:14:47] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 replaced with WMF6403 - https://phabricator.wikimedia.org/T200209 (10Papaul) [17:15:48] 10Operations, 10ops-codfw, 10decommission, 10observability, 10Patch-For-Review: Decom graphite2001/WMF6160 - https://phabricator.wikimedia.org/T200209 (10Papaul) [17:16:07] 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 replaced with WMF6403 - https://phabricator.wikimedia.org/T232126 (10Papaul) p:05Triage→03Normal [17:16:37] (03PS4) 10Mathew.onipe: elasticsearch: switch relforge to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) [17:16:39] (03PS1) 10Mathew.onipe: elasticsearch: fix syntax error in logging config [puppet] - 10https://gerrit.wikimedia.org/r/534645 [17:16:46] (03CR) 10Bstorm: [C: 03+2] toolforge: add CORS header to docker-registry [puppet] - 10https://gerrit.wikimedia.org/r/528617 (owner: 10BryanDavis) [17:18:00] (03PS2) 10Gehel: elasticsearch: fix syntax error in logging config [puppet] - 10https://gerrit.wikimedia.org/r/534645 (owner: 10Mathew.onipe) [17:19:12] (03CR) 10Herron: [C: 03+2] kafka-main: move kafka1001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/534634 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [17:19:22] (03PS2) 10Herron: kafka-main: move kafka1001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/534634 (https://phabricator.wikimedia.org/T225005) [17:19:26] (03CR) 10Gehel: [C: 03+2] elasticsearch: fix syntax error in logging config [puppet] - 10https://gerrit.wikimedia.org/r/534645 (owner: 10Mathew.onipe) [17:20:33] (03PS3) 10Herron: kafka-main: move kafka1001 to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/534634 (https://phabricator.wikimedia.org/T225005) [17:21:10] (03CR) 10Mathew.onipe: "change is only applied on relforge: https://puppet-compiler.wmflabs.org/compiler1002/18189/" [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [17:21:16] 10Operations, 10ops-codfw: ganeti2005 - mgmt interface stopped responding and reset fails - https://phabricator.wikimedia.org/T232067 (10Papaul) a:03Papaul [17:29:45] (03PS1) 10Herron: Revert "kafka-main1001: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/534646 [17:30:02] (03PS2) 10Herron: Revert "kafka-main1001: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/534646 [17:31:29] !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: test deploy for netbox split [17:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:28] (03CR) 10Herron: [C: 03+2] Revert "kafka-main1001: disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/534646 (owner: 10Herron) [17:33:27] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10herron) [18:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:10:08] !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: test deploy for netbox split (duration: 38m 39s) [18:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:36] (03PS17) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [18:16:39] (03PS1) 10Bstorm: Revert "toolforge: add CORS header to docker-registry" [puppet] - 10https://gerrit.wikimedia.org/r/534648 [18:17:52] (03PS2) 10Bstorm: Revert "toolforge: add CORS header to docker-registry" [puppet] - 10https://gerrit.wikimedia.org/r/534648 [18:18:04] (03CR) 10Bstorm: [V: 03+2 C: 03+2] Revert "toolforge: add CORS header to docker-registry" [puppet] - 10https://gerrit.wikimedia.org/r/534648 (owner: 10Bstorm) [18:21:46] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install new eqiad netsec server - https://phabricator.wikimedia.org/T232137 (10RobH) p:05Triage→03Normal [18:21:59] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install new eqiad netsec server - https://phabricator.wikimedia.org/T232137 (10RobH) [18:22:51] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frnetmon1001 - https://phabricator.wikimedia.org/T232137 (10RobH) [18:24:54] 10Operations, 10observability, 10Discovery-Search (Current work): Alert when a jvm hits more than 100 old gc ops/hour - https://phabricator.wikimedia.org/T231516 (10debt) 05Open→03Resolved [18:26:31] 10Operations, 10Discovery-Search (Current work): Run jstack / jmap / etc... with PrivateTmp=true - https://phabricator.wikimedia.org/T230774 (10debt) 05Open→03Resolved [18:32:58] 10Operations, 10Elasticsearch, 10Discovery-Search (Current work): Icinga reports read time out error for some checks on cloudelastic cluster - https://phabricator.wikimedia.org/T230366 (10debt) 05Open→03Resolved a:03debt [18:33:32] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-upload site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:34:56] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:36:17] 10Operations, 10Analytics, 10Discovery, 10Research-Backlog, 10Patch-For-Review: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10debt) [18:37:15] (03PS1) 10Andrew Bogott: codf1dev: move the puppetmaster enc database to cloudb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/534657 (https://phabricator.wikimedia.org/T229441) [18:39:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10debt) 05Open→03Resolved >>! In T214283#5451622, @RobH wrote: > Also, in the future, please open a new task for hardware trou... [18:40:49] (03CR) 10Andrew Bogott: [C: 03+2] codf1dev: move the puppetmaster enc database to cloudb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/534657 (https://phabricator.wikimedia.org/T229441) (owner: 10Andrew Bogott) [18:49:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1021 with 10G interfaces - https://phabricator.wikimedia.org/T229873 (10Cmjohnson) [18:49:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1021 with 10G interfaces - https://phabricator.wikimedia.org/T229873 (10Cmjohnson) a:05Cmjohnson→03Andrew @andrew this is ready for you to re-image [18:50:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1022 with 10G interfaces - https://phabricator.wikimedia.org/T229872 (10Cmjohnson) [18:50:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1022 with 10G interfaces - https://phabricator.wikimedia.org/T229872 (10Cmjohnson) a:05Cmjohnson→03Andrew @andrew this is ready for you to re-image [18:53:54] (03PS1) 10CRusnov: netbox: fix includes of ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/534658 [18:59:29] (03PS5) 10Krinkle: Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz) [18:59:53] (03CR) 10Ayounsi: [C: 03+1] netbox: fix includes of ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/534658 (owner: 10CRusnov) [19:00:22] (03CR) 10CRusnov: [C: 03+2] netbox: fix includes of ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/534658 (owner: 10CRusnov) [19:00:27] * Krinkle deploying https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMaintenance/+/534660/ [19:01:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Cmjohnson) B0:26:28:29:6A:E0 [19:06:51] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:15:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Cmjohnson) [19:16:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Cmjohnson) a:05Cmjohnson→03Andrew @andrew the new mac is in an earlier update. The server is moved, connected to the new port... [19:21:53] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.21/extensions/WikimediaMaintenance/blameStartupRegistry.php: 7adf466614d (duration: 00m 48s) [19:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:08] (03CR) 10Krinkle: [C: 03+2] Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz) [19:23:33] (03Merged) 10jenkins-bot: Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz) [19:23:51] (03CR) 10jenkins-bot: Remove $wgSiteStatsAsyncFactor setting which had the same effect as the default (disabled) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521004 (owner: 10Aaron Schulz) [19:24:41] * Krinkle staging on mwdebug1002 [19:28:29] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: c7678f0e3d638 (duration: 00m 47s) [19:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:33] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Cmjohnson) @ottomata the on-site work is done, They will need updated production DNS but all are moved and c... [19:32:05] (03PS1) 10Andrew Bogott: Cloudvirt1023: move to 10G nic [puppet] - 10https://gerrit.wikimedia.org/r/534663 (https://phabricator.wikimedia.org/T229871) [19:32:40] (03PS2) 10Andrew Bogott: Cloudvirt1023: move to 10G nic [puppet] - 10https://gerrit.wikimedia.org/r/534663 (https://phabricator.wikimedia.org/T229871) [19:33:39] (03CR) 10Andrew Bogott: [C: 03+2] Cloudvirt1023: move to 10G nic [puppet] - 10https://gerrit.wikimedia.org/r/534663 (https://phabricator.wikimedia.org/T229871) (owner: 10Andrew Bogott) [19:36:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:46:46] (03CR) 10Jforrester: [C: 03+1] Make one user out of 3 use php7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534610 (https://phabricator.wikimedia.org/T219150) (owner: 10Giuseppe Lavagetto) [19:53:24] (03PS1) 10Andrew Bogott: update nic names for cloudvirt1021 and cloudvirt1022 [puppet] - 10https://gerrit.wikimedia.org/r/534669 (https://phabricator.wikimedia.org/T229873) [19:54:02] (03CR) 10Andrew Bogott: [C: 03+2] update nic names for cloudvirt1021 and cloudvirt1022 [puppet] - 10https://gerrit.wikimedia.org/r/534669 (https://phabricator.wikimedia.org/T229873) (owner: 10Andrew Bogott) [20:05:58] PROBLEM - Nginx local proxy to apache on mw1266 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:07:22] RECOVERY - Nginx local proxy to apache on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:11:44] (03PS1) 10Jhedden: openstack: Add codfw1dev glance API to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/534680 (https://phabricator.wikimedia.org/T223907) [20:18:48] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/18191/" [puppet] - 10https://gerrit.wikimedia.org/r/534680 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [20:21:35] (03PS1) 10Andrew Bogott: openstack scheduler: update comments for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/534681 (https://phabricator.wikimedia.org/T229873) [20:26:58] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10dr0ptp4kt) We're mulling this over still. [20:28:15] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 4 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle) [20:29:39] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 6 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle) [20:29:50] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 6 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle) Tagging Multimedia for possible CR of . [20:47:18] (03PS2) 10Andrew Bogott: openstack scheduler: update comments for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/534681 (https://phabricator.wikimedia.org/T229873) [20:47:20] (03PS1) 10Andrew Bogott: cloudvirt1023: rename network interfaces [puppet] - 10https://gerrit.wikimedia.org/r/534682 (https://phabricator.wikimedia.org/T229871) [20:48:14] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1023: rename network interfaces [puppet] - 10https://gerrit.wikimedia.org/r/534682 (https://phabricator.wikimedia.org/T229871) (owner: 10Andrew Bogott) [21:08:16] (03PS3) 10Andrew Bogott: openstack scheduler: update comments for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/534681 (https://phabricator.wikimedia.org/T229873) [21:08:18] (03PS1) 10Andrew Bogott: cloudvirt1023: rename interfaces, again [puppet] - 10https://gerrit.wikimedia.org/r/534684 (https://phabricator.wikimedia.org/T229871) [21:09:20] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1023: rename interfaces, again [puppet] - 10https://gerrit.wikimedia.org/r/534684 (https://phabricator.wikimedia.org/T229871) (owner: 10Andrew Bogott) [21:12:01] jouncebot: next [21:12:01] In 1 hour(s) and 47 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T2300) [21:12:14] (03PS1) 10CRusnov: netbox: fix role includes for really reals [puppet] - 10https://gerrit.wikimedia.org/r/534685 [21:12:38] Krinkle: How do you feel about me pushing the write-JSON change out to prod? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/533592 [21:21:33] James_F: checking [21:21:44] Thanks [21:25:28] (03CR) 10Ayounsi: [C: 03+1] "change and compiler look good." [puppet] - 10https://gerrit.wikimedia.org/r/534685 (owner: 10CRusnov) [21:25:53] (03CR) 10CRusnov: [C: 03+2] netbox: fix role includes for really reals [puppet] - 10https://gerrit.wikimedia.org/r/534685 (owner: 10CRusnov) [21:26:07] (03PS2) 10CRusnov: netbox: fix role includes for really reals [puppet] - 10https://gerrit.wikimedia.org/r/534685 [21:29:40] (03CR) 10Jforrester: Variant configuration: Read from JSON, not serialised PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [21:30:07] (03CR) 10Krinkle: Variant configuration: Write to static (JSON) as well as serialised cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [21:33:23] James_F: Wanna to IS => array, as first step? [21:33:26] do* [21:34:46] also curious whether we'd be able to actually disuse '+foo'. Seems doable, but I don't know if there's cases where we really need it within wgConf vs doing it in CommonSettings.php afterwards. [21:34:55] !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: test deploy for netbox split - again [21:35:07] !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: test deploy for netbox split - again (duration: 00m 12s) [21:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:49] Krinkle: I worry about converting IS to an array too early. [21:36:01] (03CR) 10Jforrester: [C: 04-1] "> Patch Set 3:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [21:36:26] Krinkle: Sadly we're running HHVM and PHP72 not PHP73 so we can't use JSON_THROW_ON_ERROR [21:37:50] James_F: yeah, I tested that on 3v4l.org before I submitted my comment to see if it would make a difference. It means it'll throw instead of returning false for invalid utf8, but it still doesn't communicate in any way about invalid values like functions or non-std class instances [21:38:08] Yes, but we don't use functions of class instances. [21:38:22] And in the medium term it'll be impossible to try, as it'll be configured in YAML. [21:38:32] Right [21:38:49] So authoring in YAML or JSON would be great. But I conflated those with what the expanded format is. [21:38:58] I mixed them up in my mind. [21:39:04] Why switch to .json for the compiled format? [21:39:08] Authoring in YAML, converting to JSON. [21:39:33] Because the compiled format will be committed in git, and this way it'll (a) not vary by PHP run time and (b) be manually inspectable as to the outcome. [21:40:07] Essentially, a poor man's T220775. [21:40:07] T220775: Consider creating a puppet-compiler equivalent for mediawiki-config.git - https://phabricator.wikimedia.org/T220775 [21:40:21] serialised php doesn't vary by PHP run time. We only do that now because we allow changing config itself by HHVM e.g. in Setup.php and extension hooks. [21:40:29] But yes, human readable expansion matters. [21:40:41] we can use static arrays for that, like were doing for interwiki, wikversions and (soon) localisation cache. [21:40:48] Yes. [21:41:04] would parse quicker than json, and forgoes the need for APC [21:41:08] because it'll be in opcache [21:41:13] Eh. [21:41:23] "Quicker" in terms of nanoseconds. [21:42:17] !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 [21:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:20] !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 (duration: 00m 03s) [21:42:20] T223291: Netbox: move it to dedicated Ganeti VMs - https://phabricator.wikimedia.org/T223291 [21:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:24] Essentially, this is the reverse of wikiversion.json vs. wikiversions.php. [21:43:18] difference between unserialize and json_decode was 0.1 ms, not much indeed. The file read is about ~ 1ms, which would also be skipped. [21:43:23] but yeah, not much either way. [21:43:33] But we already do the file read, right? [21:43:40] It's not currently opcached. [21:43:53] HHVM has an elaborate file stat cache, which we will soon lose. [21:43:57] So the marginal difference for now is small. [21:44:01] Oh, yes, true. [21:44:19] There is a task about fixing ExtensionProcessor to not read JSON and mtime stat calls as much. [21:44:22] I need to fix that. [21:44:33] it doesn't scale currently for N extensions. [21:44:37] 1 config file is fine though. [21:45:02] (03PS4) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) [21:45:14] Oh, the "enable wgStoreMTime" or whatever task? [21:45:38] but yeah, we could go from ~ 0 file reads on HHVM with stat cache to 1 file read on PHP 72 + json parse (+ 1 ms, + 0.1ms), or go to 0 file reads and also skip the ~ 0.1ms for unserialize/json_decode with a static array file in opcache. [21:45:54] or we can go to PHP72 with no json_parse or file read if we use APCu and an mtime check only [21:46:16] T187154 [21:46:17] T187154: Consider enabling wgExtensionInfoMTime in wmf-production - https://phabricator.wikimedia.org/T187154 [21:46:22] Yeah, that one. [21:46:39] I can see that portion growing in the flame graph when we got more PHP72 traffic [21:46:53] initially a bit random on excier due to small sampling [21:46:55] more obvious now [21:46:57] Well, we're about to go to 1/3rd PHP72. [21:47:01] So… [21:47:09] It's going to get worse quite quickly. [21:49:00] (03PS5) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) [21:52:35] (03PS3) 10Zoranzoki21: Set noindex for user and user_talk on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534471 (https://phabricator.wikimedia.org/T231982) [21:52:44] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:40] PROBLEM - netbox HTTPS on netbox1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 312 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Netbox [22:05:00] expected, downtiming [22:05:14] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:07:10] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:10:28] (03PS1) 10CRusnov: netbox: add netbox hosts to dsh host list. [puppet] - 10https://gerrit.wikimedia.org/r/534692 [22:10:54] James_F: oh my [22:10:54] https://performance.wikimedia.org/arclamp/svgs/daily/2019-09-04.excimer.load.svgz [22:11:06] 13% (!) is spent in ExtensionRegistry::loadFromQueue [22:11:06] (03CR) 10jerkins-bot: [V: 04-1] netbox: add netbox hosts to dsh host list. [puppet] - 10https://gerrit.wikimedia.org/r/534692 (owner: 10CRusnov) [22:11:14] That's php72 only [22:11:43] Krinkle: That is definitely not great. [22:11:43] It's ~0% on HHVM (not sampled at all over 24 hours, so very tiny) [22:12:33] Right. [22:22:40] PROBLEM - Check systemd state on netboxdb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:37] (03PS2) 10CRusnov: netbox: add netbox hosts to dsh host list. [puppet] - 10https://gerrit.wikimedia.org/r/534692 [22:38:11] (03CR) 10Ayounsi: [C: 03+1] "code and PCC looks good." [puppet] - 10https://gerrit.wikimedia.org/r/534692 (owner: 10CRusnov) [22:38:15] (03PS1) 10Jforrester: Stop setting wgCookieSetOnAutoBlock and wgCookieSetOnIpBlock to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534698 (https://phabricator.wikimedia.org/T191922) [22:38:34] (03CR) 10CRusnov: [C: 03+2] netbox: add netbox hosts to dsh host list. [puppet] - 10https://gerrit.wikimedia.org/r/534692 (owner: 10CRusnov) [22:57:36] (03PS5) 10Reedy: Require that passwords are not in the most common 100k list for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester) [22:59:43] (03PS1) 10CRusnov: netbox: Undo some mistakes in the netbox user [puppet] - 10https://gerrit.wikimedia.org/r/534703 [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190905T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:41] (03CR) 10CRusnov: [C: 03+2] netbox: Undo some mistakes in the netbox user [puppet] - 10https://gerrit.wikimedia.org/r/534703 (owner: 10CRusnov) [23:00:49] (03CR) 10Jforrester: "Oh, right, we said we'd do this today. Let's roll?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester) [23:01:08] James_F: If you want to we can [23:01:09] * Reedy grins [23:01:19] Might aswell get it done when we said we would [23:01:46] (03CR) 10Jforrester: [C: 03+2] "Service, with a smile." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester) [23:02:45] (03Merged) 10jenkins-bot: Require that passwords are not in the most common 100k list for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester) [23:03:06] lol ^ [23:03:57] (03CR) 10jenkins-bot: Require that passwords are not in the most common 100k list for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester) [23:04:34] Reedy: Live on mwdebug1002 if you want to test? [23:05:00] I don't see much point testing it... [23:07:53] Well, I can definitely log out and log back in both of my prod accounts. [23:08:29] Let's go. [23:09:13] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T151425 Require that passwords are not in the most common 100k list for all users (duration: 00m 48s) [23:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:32] T151425: Enlarge Popular Password File to 100,000 entries and enforce the new minimum in the config - https://phabricator.wikimedia.org/T151425 [23:09:54] <3 [23:12:47] !log ayounsi@deploy1001 Started deploy [netbox/deploy@367ca84]: test [23:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:29] !log ayounsi@deploy1001 Finished deploy [netbox/deploy@367ca84]: test (duration: 00m 42s) [23:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:53] (03PS1) 10Bstorm: sssd: Add some new images to test sssd in containers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534704 (https://phabricator.wikimedia.org/T229058) [23:15:31] (03CR) 10jerkins-bot: [V: 04-1] sssd: Add some new images to test sssd in containers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534704 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [23:18:12] (03PS6) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) [23:21:22] Reedy: Do we want to set MinimumPasswordLengthToLogin to 10 for priv groups (right now it's just +staff)? [23:21:55] I think we do at some point, for sure [23:22:06] Do we need some communications for that first? Likely [23:22:07] (03PS1) 10Jforrester: Drop PasswordCannotBePopular compatibility hack, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534706 [23:22:08] But not right now? [23:22:18] Eh. You're such a goody-goody. [23:22:22] Heh [23:22:25] I mean [23:22:34] Part of me would love to see how many people it affected... [23:23:04] It depends how you read https://meta.wikimedia.org/wiki/Password_policy for example [23:23:09] Password requirements for privileged users: [23:23:09] Must be at least 10 characters [23:23:20] I'd see must... As in, MW will make you [23:23:35] So, in some regards, it's literally following the policy... So nothing to actually announce? [23:23:51] (03PS1) 10Jforrester: Set MinimumPasswordLengthToLogin to 10 for all prived groups, not just +staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534707 [23:24:08] We can announce and wait another week? [23:24:20] MinimumPasswordLengthToLogin is a bit aggressive. [23:24:39] I'm not sure how well it works on API login, e.g. the apps. [23:24:40] (03CR) 10Reedy: [C: 03+1] "As per https://meta.wikimedia.org/wiki/Password_policy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534707 (owner: 10Jforrester) [23:24:45] That seems reasonable [23:25:03] Helps remove more cruft and edge cases from CS [23:28:04] https://meta.wikimedia.org/w/index.php?diff=19355540&oldid=19355349&title=Tech/News/2019/37&diffmode=visual [23:50:16] (03PS2) 10Bstorm: sssd: Add some new images to test sssd in containers [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534704 (https://phabricator.wikimedia.org/T229058) [23:50:35] (03CR) 10Krinkle: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [23:51:59] (03CR) 10Jforrester: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester)