[00:41:12] 10Operations, 10Phabricator, 10hardware-requests, 10serviceops, 10Release-Engineering-Team (Development services): The phabricator server, WMF7426, was given to us temporarily, we would like to make it permanent - https://phabricator.wikimedia.org/T232887 (10Dzahn) a:05faidon→03Dzahn [00:44:24] !log tstarling@deploy1001 Synchronized php-1.35.0-wmf.5/includes/Rest/Handler/PageHistoryCountHandler.php: fix extremely slow query T238378 (duration: 00m 59s) [00:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:30] T238378: Bot edit count endpoint is timing out - https://phabricator.wikimedia.org/T238378 [00:49:30] 10Operations, 10Traffic: Proxy-connection HTTP response header being sent to some users in some cases causing HTTP/2 protocol errors - https://phabricator.wikimedia.org/T238509 (10Krenair) [01:02:34] 10Operations, 10Traffic: Proxy-connection HTTP response header being sent to some users in some cases causing HTTP/2 protocol errors - https://phabricator.wikimedia.org/T238509 (10Krenair) [03:25:11] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Vgutierrez) >>! In T238285#5666584, @Urbanecm wrote: > @Vgutierrez I guess what you quoted wouldn't be valid, b... [03:38:49] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp2007 [puppet] - 10https://gerrit.wikimedia.org/r/551358 (https://phabricator.wikimedia.org/T231627) [03:38:51] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2007 [puppet] - 10https://gerrit.wikimedia.org/r/551359 (https://phabricator.wikimedia.org/T231627) [03:40:08] !log Move cp2007 from nginx to ats-tls - T231627 [03:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:14] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [03:41:00] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp2007 [puppet] - 10https://gerrit.wikimedia.org/r/551358 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [03:42:31] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp2007 [puppet] - 10https://gerrit.wikimedia.org/r/551359 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [03:51:02] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [04:01:44] (03PS1) 10Vgutierrez: ncredir: Increase HSTS max-age to 3.37 years [puppet] - 10https://gerrit.wikimedia.org/r/551362 (https://phabricator.wikimedia.org/T231514) [04:06:00] (03CR) 10Vgutierrez: [C: 03+2] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/19436/" [puppet] - 10https://gerrit.wikimedia.org/r/551362 (https://phabricator.wikimedia.org/T231514) (owner: 10Vgutierrez) [04:14:12] 10Operations, 10Traffic: Enable HSTS for non canonical domains using the ncredir service - https://phabricator.wikimedia.org/T231514 (10Vgutierrez) 05Open→03Resolved [04:15:53] 10Operations, 10Traffic: Submit non-canonical domains to the HSTS preload list - https://phabricator.wikimedia.org/T238513 (10Vgutierrez) [04:16:47] 10Operations, 10Traffic, 10Patch-For-Review: ats-tls shows spikes on H/2 recv settings bad param errors - https://phabricator.wikimedia.org/T238307 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Solved with 8.0.5-1wm11 [04:17:57] 10Operations, 10Traffic: Submit non-canonical domains to the HSTS preload list - https://phabricator.wikimedia.org/T238513 (10Vgutierrez) p:05Triage→03Normal [05:42:31] 10Operations, 10DBA: Related to scheduled renaming - https://phabricator.wikimedia.org/T238512 (10Marostegui) p:05Triage→03Normal Thank you! Normally there is no need to tag #dba for such operations, but given how big these two renames are, I am also going to tag #Operations and ping a few people from Wi... [05:42:56] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Related to scheduled renaming - https://phabricator.wikimedia.org/T238512 (10Marostegui) [05:52:27] 10Operations, 10DBA, 10SRE-Access-Requests, 10Patch-For-Review: Read access for phabricator-admins (aklapper) to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Marostegui) The user @Dzahn suggests to use I believe it is `phstats`, which only has access... [05:53:57] !log Deploy schema change on s5 primary master db1100 - T233135 T234066 [05:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:04] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [05:54:04] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [05:55:57] 10Operations, 10DBA, 10User-notice: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 (10Marostegui) Thank you both for pointing it out and for fixing it! :) [06:01:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2072, db2088:3311, db2087:3316, db2086:3317 after maintenances and schema changes', diff saved to https://phabricator.wikimedia.org/P9650 and previous config saved to /var/cache/conftool/dbconfig/20191118-060114-marostegui.json [06:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105:3312 for compression', diff saved to https://phabricator.wikimedia.org/P9651 and previous config saved to /var/cache/conftool/dbconfig/20191118-060207-marostegui.json [06:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 for compression', diff saved to https://phabricator.wikimedia.org/P9652 and previous config saved to /var/cache/conftool/dbconfig/20191118-060508-marostegui.json [06:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:23] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp2012 [puppet] - 10https://gerrit.wikimedia.org/r/551365 (https://phabricator.wikimedia.org/T231627) [06:10:25] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp2012 [puppet] - 10https://gerrit.wikimedia.org/r/551366 (https://phabricator.wikimedia.org/T231627) [06:12:31] !log Move cp2012 from nginx to ats-tls - T231627 [06:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:36] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [06:12:45] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp2012 [puppet] - 10https://gerrit.wikimedia.org/r/551365 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [06:14:23] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp2012 [puppet] - 10https://gerrit.wikimedia.org/r/551366 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [06:17:23] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Related to scheduled renaming - https://phabricator.wikimedia.org/T238512 (10Sotiale) Yes, I also remember that this is the first time such a big rename has happened. You mean, you want me to deal with another account after one accou... [06:19:36] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp2012 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [06:19:45] ^^ expected [06:20:55] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp2012 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345295 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-11-22 07:59:59 +0000 (expires in 370 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:24:27] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Related to scheduled renaming - https://phabricator.wikimedia.org/T238512 (10Marostegui) >>! In T238512#5670210, @Sotiale wrote: > Yes, I also remember that this is the first time such a big rename has happened. > > You mean, you wan... [06:25:51] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [06:29:59] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp1081 [puppet] - 10https://gerrit.wikimedia.org/r/551367 (https://phabricator.wikimedia.org/T231627) [06:30:01] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp1081 [puppet] - 10https://gerrit.wikimedia.org/r/551368 (https://phabricator.wikimedia.org/T231627) [06:30:43] !log Restart tendril mysql - T231769 [06:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:49] T231769: Investigate possible memory leak on db1115 - https://phabricator.wikimedia.org/T231769 [06:32:44] !log Move cp1081 from nginx to ats-tls - T231627 [06:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:49] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [06:33:28] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp1081 [puppet] - 10https://gerrit.wikimedia.org/r/551367 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [06:33:39] PROBLEM - HTTP-dbtree on dbmonitor1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 280 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [06:33:46] ^ expected [06:34:53] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp1081 [puppet] - 10https://gerrit.wikimedia.org/r/551368 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [06:34:59] RECOVERY - HTTP-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 85885 bytes in 0.459 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [06:42:40] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [06:48:58] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to port 4443 on cp1083 [puppet] - 10https://gerrit.wikimedia.org/r/551369 (https://phabricator.wikimedia.org/T231627) [06:49:00] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp1083 [puppet] - 10https://gerrit.wikimedia.org/r/551370 (https://phabricator.wikimedia.org/T231627) [06:52:43] !log Move cp1083 from nginx to ats-tls - T231627 [06:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:49] T231627: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 [06:53:20] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to port 4443 on cp1083 [puppet] - 10https://gerrit.wikimedia.org/r/551369 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [06:54:52] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp1083 [puppet] - 10https://gerrit.wikimedia.org/r/551370 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [07:06:21] (03PS1) 10Marostegui: mariadb: Provision db2135 into m5 [puppet] - 10https://gerrit.wikimedia.org/r/551371 (https://phabricator.wikimedia.org/T238183) [07:08:57] 10Operations, 10Traffic, 10Patch-For-Review: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [07:09:18] !log Stop MySQL on db2070 to clone db2135 - T238183 [07:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:24] T238183: Productionize db213[2-5} - https://phabricator.wikimedia.org/T238183 [07:11:03] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision db2135 into m5 [puppet] - 10https://gerrit.wikimedia.org/r/551371 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [07:14:55] (03PS2) 10Marostegui: realm.pp: Add oauth2_access_tokens as a private table [puppet] - 10https://gerrit.wikimedia.org/r/551140 (https://phabricator.wikimedia.org/T238370) [07:16:43] (03CR) 10Marostegui: [C: 03+2] realm.pp: Add oauth2_access_tokens as a private table [puppet] - 10https://gerrit.wikimedia.org/r/551140 (https://phabricator.wikimedia.org/T238370) (owner: 10Marostegui) [07:17:52] !log Upgrade and restart mysql on sanitarium hosts on codfw to pick up new replication filters: db2094 and db2095 - T238370 [07:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:57] T238370: Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 [07:18:57] (03PS2) 10KartikMistry: New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/550092 (https://phabricator.wikimedia.org/T233697) [07:31:18] 10Operations, 10Traffic: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10Vgutierrez) [07:33:08] (03CR) 10jerkins-bot: [V: 04-1] New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/550092 (https://phabricator.wikimedia.org/T233697) (owner: 10KartikMistry) [07:34:41] (03PS3) 10KartikMistry: New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/550092 (https://phabricator.wikimedia.org/T233697) [07:42:51] (03PS2) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/551040 (https://phabricator.wikimedia.org/T235469) [07:49:20] (03CR) 10jerkins-bot: [V: 04-1] New upstream release [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/550092 (https://phabricator.wikimedia.org/T233697) (owner: 10KartikMistry) [07:54:03] (03PS1) 10Elukey: hadoop: add analytics kerberos keytab for test/prod master host [puppet] - 10https://gerrit.wikimedia.org/r/551392 (https://phabricator.wikimedia.org/T237269) [07:56:53] (03CR) 10Elukey: [C: 03+2] hadoop: add analytics kerberos keytab for test/prod master host [puppet] - 10https://gerrit.wikimedia.org/r/551392 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [07:57:16] (03CR) 10Muehlenhoff: [C: 03+2] Document database setup [software/debmonitor] - 10https://gerrit.wikimedia.org/r/545811 (owner: 10Muehlenhoff) [08:00:03] 10Operations, 10Traffic: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10Vgutierrez) [08:00:18] 10Operations, 10Traffic: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10Vgutierrez) p:05Triage→03Normal [08:01:16] (03PS1) 10Elukey: hadoop: correct filename of kerberos analytics keytab for master nodes [puppet] - 10https://gerrit.wikimedia.org/r/551393 [08:02:59] good morning Luca [08:03:12] (03CR) 10Elukey: [C: 03+2] hadoop: correct filename of kerberos analytics keytab for master nodes [puppet] - 10https://gerrit.wikimedia.org/r/551393 (owner: 10Elukey) [08:08:20] (03PS1) 10Vgutierrez: ssl_ciphersuite: Allow TLSv1/TLSv1.1 in compat mode only [puppet] - 10https://gerrit.wikimedia.org/r/551396 (https://phabricator.wikimedia.org/T238518) [08:18:46] (03PS1) 10Elukey: profile::analytics::cluster::client: fix nagios' sudo permissions [puppet] - 10https://gerrit.wikimedia.org/r/551398 (https://phabricator.wikimedia.org/T237269) [08:20:37] (03PS1) 10Vgutierrez: idp: Set SSL compatibilty mode to strong [puppet] - 10https://gerrit.wikimedia.org/r/551413 (https://phabricator.wikimedia.org/T238518) [08:21:57] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::client: fix nagios' sudo permissions [puppet] - 10https://gerrit.wikimedia.org/r/551398 (https://phabricator.wikimedia.org/T237269) (owner: 10Elukey) [08:23:36] (03CR) 10Vgutierrez: [C: 03+1] "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1003/19439/" [puppet] - 10https://gerrit.wikimedia.org/r/551413 (https://phabricator.wikimedia.org/T238518) (owner: 10Vgutierrez) [08:26:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/551413 (https://phabricator.wikimedia.org/T238518) (owner: 10Vgutierrez) [08:28:43] 10Operations: Integrate Buster 10.1 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) [08:28:52] 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) p:05Triage→03Normal [08:32:45] (03CR) 10Vgutierrez: [C: 03+2] idp: Set SSL compatibilty mode to strong [puppet] - 10https://gerrit.wikimedia.org/r/551413 (https://phabricator.wikimedia.org/T238518) (owner: 10Vgutierrez) [08:33:05] (03PS2) 10Vgutierrez: idp: Set SSL compatibilty mode to strong [puppet] - 10https://gerrit.wikimedia.org/r/551413 (https://phabricator.wikimedia.org/T238518) [08:36:19] 10Operations, 10Traffic, 10Patch-For-Review: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10Vgutierrez) [08:55:18] (03PS1) 10Elukey: profile::analytics::cluster::client: rework nagios' sudo permissions [puppet] - 10https://gerrit.wikimedia.org/r/551494 [08:57:08] 10Operations, 10Traffic, 10Patch-For-Review: Disable TLSv1/TLSv1.1 on sites without caching layer - https://phabricator.wikimedia.org/T238518 (10MoritzMuehlenhoff) [09:02:16] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::client: rework nagios' sudo permissions [puppet] - 10https://gerrit.wikimedia.org/r/551494 (owner: 10Elukey) [09:03:20] !log Restart MySQL on db1124 and db1125 to apply new replication filters T238370 [09:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:25] T238370: Apply schema changes for OAuth 2.0 - https://phabricator.wikimedia.org/T238370 [09:04:43] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: parse DPT and SPT from ulogd events [puppet] - 10https://gerrit.wikimedia.org/r/551270 (https://phabricator.wikimedia.org/T238416) (owner: 10Herron) [09:11:12] !log Deploy schema change on s8 codfw, this will generate lag on s8 codfw - T233135 T234066 [09:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:18] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [09:11:18] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [09:11:49] !log Remove ar_comment from triggers on db2094:3318 - T234704 [09:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:54] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [09:12:56] (03PS2) 10Muehlenhoff: Bump system user UID range in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/548269 (https://phabricator.wikimedia.org/T235162) [09:19:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "+1 modulo Andrew's question" [puppet] - 10https://gerrit.wikimedia.org/r/551247 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [09:20:45] (03CR) 10Filippo Giunchedi: [C: 03+1] Add LVS for eventgate-logging-external using TLS port [puppet] - 10https://gerrit.wikimedia.org/r/550922 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [09:30:18] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: remove wezen from service [puppet] - 10https://gerrit.wikimedia.org/r/547246 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [09:30:28] (03PS3) 10Filippo Giunchedi: hieradata: remove wezen from service [puppet] - 10https://gerrit.wikimedia.org/r/547246 (https://phabricator.wikimedia.org/T224564) [09:30:48] (03PS2) 10Elukey: Enable Kerberos in Hadoop Analytics and Druid Analytics/Public [puppet] - 10https://gerrit.wikimedia.org/r/549566 (https://phabricator.wikimedia.org/T237269) [09:33:28] !log remove wezen from service, pending reimage [09:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:42] there will be some rsyslog delivery errors alerts, expected [09:39:43] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:11] (03PS1) 10Ema: dbtree: add https VirtualHost [puppet] - 10https://gerrit.wikimedia.org/r/551496 (https://phabricator.wikimedia.org/T210411) [09:53:26] (03PS3) 10Muehlenhoff: Bump system user UID range in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/548269 (https://phabricator.wikimedia.org/T235162) [09:53:37] (03PS2) 10Ema: dbtree: add https VirtualHost [puppet] - 10https://gerrit.wikimedia.org/r/551496 (https://phabricator.wikimedia.org/T210411) [09:58:48] 10Operations, 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for minwiktionary - https://phabricator.wikimedia.org/T238522 (10jhsoby) [09:59:38] (03CR) 10Ema: [C: 03+2] dbtree: add https VirtualHost [puppet] - 10https://gerrit.wikimedia.org/r/551496 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:59:47] 10Operations, 10DBA, 10Data-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for minwiktionary - https://phabricator.wikimedia.org/T238522 (10Marostegui) p:05Triage→03Normal Let us know when the database is created so we can sanitize it before sending it to t... [10:00:45] (03CR) 10Muehlenhoff: [C: 03+2] Bump system user UID range in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/548269 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [10:01:46] marostegui, jynus: dbtree.wikimedia.org should be working fine with ATS too now, no need to go via codfw anymore [10:03:14] ema: indeed :) [10:03:17] thank you! [10:03:59] you're welcome! Sorry for breaking it in the first place :) [10:05:06] ema: vgutierrez confirms it also works well from eqsin! \o/ [10:06:26] (03CR) 10ArielGlenn: "ok by me for dumps, adding brooke since the public facing dumps servers 'belong' to wmcs" [puppet] - 10https://gerrit.wikimedia.org/r/551396 (https://phabricator.wikimedia.org/T238518) (owner: 10Vgutierrez) [10:07:01] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:25] (03PS1) 10Muehlenhoff: Enable managed adduser.conf for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/551500 [10:10:11] (03PS1) 10Jon Harald Søby: RESTRouter: Add minwiktionary [deployment-charts] - 10https://gerrit.wikimedia.org/r/551501 (https://phabricator.wikimedia.org/T238523) [10:16:45] !log Upgrade MySQL on labsdb1012 [10:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:06] (03CR) 10Muehlenhoff: [C: 03+2] Enable managed adduser.conf for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/551500 (owner: 10Muehlenhoff) [10:23:52] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [10:25:11] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [10:26:38] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for minwiktionary - https://phabricator.wikimedia.org/T238522 (10Urbanecm) a:05Urbanecm→03None Unassigning, there's nothing I can do here :-). [10:28:11] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:28:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:30:04] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191118T1030). [10:31:35] Telia link down, schedule maintenance --^ [10:40:25] (03CR) 10Vgutierrez: [C: 03+1] ATS: enable mwdebug routes for noc.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/551154 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema) [10:44:18] 10Operations: Write ulogd logs to a dedicated logfile - https://phabricator.wikimedia.org/T238414 (10jbond) This was discussed in [[ https://phabricator.wikimedia.org/T116011#4927275 | T116011 ]] and the code to log to a separate file exists. the Reason for choosing to log to syslog was to simplify shipping lo... [10:45:20] 10Operations, 10observability: Write ulogd logs to a dedicated logfile - https://phabricator.wikimedia.org/T238414 (10jbond) [10:45:37] !log updated buster netinst image for 10.2 T238519 [10:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:42] T238519: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 [10:49:18] !log installing python-cryptography bugfix updates from buster point release [10:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:14] (03CR) 10Ema: [C: 03+2] ATS: enable mwdebug routes for noc.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/551154 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema) [10:53:25] RECOVERY - snapshot of s3 in codfw on db1115 is OK: snapshot for s3 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-11-18 06:38:42 from db2098.codfw.wmnet:3313 (810 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [10:53:57] !log installing gdb updates from buster point release [10:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:28] !log Deploy schema change on db2078 (codfw master for wikidatawiki), this will create lag on s8 codfw - T237120 [10:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:34] T237120: Schema change on production for increase the size of wbt_text_in_lang.wbxl_language - https://phabricator.wikimedia.org/T237120 [10:56:55] !log installing python-werkzeug security updates [10:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:55] (03PS1) 10ArielGlenn: add new partman recipe that skips format of /data partition for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/551503 (https://phabricator.wikimedia.org/T224563) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191118T1100). [11:00:04] awight: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] I can do my patches :-) [11:00:26] awight: great! [11:01:13] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:01:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/551270 (https://phabricator.wikimedia.org/T238416) (owner: 10Herron) [11:02:35] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:03:23] 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) [11:03:37] 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) [11:04:08] !log installing postgresql-common security updates [11:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:17] which mwdebug should I be using now..? [11:04:38] I though I saw the recommendation bumped to mwdebug1003 the other day? [11:05:04] mwdebug1003 doesn't exist :-) [11:06:01] Effie reimaged mwdebug1002 last week, but don't know if it's fully working again [11:06:21] thank you, that helps narrow it down quite a bit ;-) 1001 it shall be [11:06:22] it should be good to go [11:06:46] awight: please use mwdebug1002 and let me know if it behaves as it should [11:06:57] effie: Great ty! [11:07:05] :) [11:08:55] (03CR) 10Volans: "Slightly -1 post-merge, being an RST needs some specific syntax to have it being well formatted. I'll send a patch with the fixes." [software/debmonitor] - 10https://gerrit.wikimedia.org/r/545811 (owner: 10Muehlenhoff) [11:12:29] (03PS1) 10Volans: README: fix reST format [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551507 [11:14:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551507 (owner: 10Volans) [11:14:46] (03CR) 10Volans: [C: 03+2] README: fix reST format [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551507 (owner: 10Volans) [11:16:54] (03Merged) 10jenkins-bot: README: fix reST format [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551507 (owner: 10Volans) [11:18:47] effie: I'm not sure it would be related to your reimaging, but I ran into this error when using the FireFox mwdebug extension on 1002: x-wikimedia-debug-routing: no match found for the backend specified in X-Wikimedia-Debug [11:19:09] No such error encountered if using mwdebug1001. [11:21:42] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for minwiktionary - https://phabricator.wikimedia.org/T238522 (10jhsoby) Right, sorry – forgot to remove the defaults when creating the task. :-) [11:26:53] !log awight@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Popups: SWAT: [[gerrit:551397|Don't record Popups actions on non-content pages (T214493)]] (duration: 00m 51s) [11:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:59] T214493: Track interaction with ReferencePreviews - https://phabricator.wikimedia.org/T214493 [11:28:16] !log awight@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Cite: SWAT: [[gerrit:551389|Track pageviews only on content page views, not edits (T214493)]] (duration: 00m 51s) [11:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:33] !log EU SWAT complete [11:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:01] awight: I think I know what is wrong [11:35:01] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Create in-cloud, cloud-vps-wide cumin masters - https://phabricator.wikimedia.org/T232429 (10DannyS712) [batch] remove patch for review tag from resolved tasks [11:35:03] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): cloud-puppetmasters: move some hiera settings from Horizon to git/gerrit - https://phabricator.wikimedia.org/T232509 (10DannyS712) [batch] remove patch for review tag from resolved tasks [11:35:15] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Resolve local commits on cloud-puppetmaster-01.cloudinfra.eqiad.wmflabs and cloud-puppetmaster-02.cloudinfra.eqiad.wmflabs - https://phabricator.wikimedia.org/T232428 (10DannyS712) [batch] remove patch for review tag from resolved tasks [11:35:17] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): cloud-vps puppet cert cleaner not working properly - https://phabricator.wikimedia.org/T232427 (10DannyS712) [batch] remove patch for review tag from resolved tasks [11:37:09] 10Operations, 10Traffic: cergen fails signing CSR - https://phabricator.wikimedia.org/T231423 (10DannyS712) [batch] remove patch for review tag from resolved tasks [11:37:29] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10DannyS712) [batch] remove patch for review tag from resolved tasks [11:37:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission db1063.eqiad.wmnet - https://phabricator.wikimedia.org/T232564 (10DannyS712) [batch] remove patch for review tag from resolved tasks [11:37:38] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10DannyS712) [batch] remove patch for review tag from resolved tasks [11:37:47] 10Operations, 10Traffic: Investigate segfaults on ats-tls running on cp5001 - https://phabricator.wikimedia.org/T232298 (10DannyS712) [batch] remove patch for review tag from resolved tasks [11:37:51] effie: FYI I'm not blocked, I finished the deployment testing on 1001. But let me know if I can help test a fix, later. [11:37:59] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2047.codfw.wmnet - https://phabricator.wikimedia.org/T231852 (10DannyS712) [batch] remove patch for review tag from resolved tasks [11:38:01] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10DannyS712) [batch] remove patch for review tag from resolved tasks [11:38:03] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10DannyS712) [batch] remove patch for review tag from resolved tasks [11:38:14] awight: oh sure, tx [11:39:01] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Rename wezen to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/547247 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [11:39:04] (03PS1) 10Effie Mouzeli: trafficserver: re-enable mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/551515 (https://phabricator.wikimedia.org/T214734) [11:40:58] (03CR) 10Volans: cables: detect duplicate cable names, and blank cable names (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [11:42:00] (03PS2) 10Filippo Giunchedi: Rename wezen to centrallog2001 [dns] - 10https://gerrit.wikimedia.org/r/547241 (https://phabricator.wikimedia.org/T224564) [11:42:23] (03CR) 10jerkins-bot: [V: 04-1] Rename wezen to centrallog2001 [dns] - 10https://gerrit.wikimedia.org/r/547241 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [11:44:15] (03PS3) 10Filippo Giunchedi: Rename wezen to centrallog2001 [dns] - 10https://gerrit.wikimedia.org/r/547241 (https://phabricator.wikimedia.org/T224564) [11:46:38] 10Operations, 10DBA, 10SRE-Access-Requests, 10Patch-For-Review: Read access for phabricator-admins (aklapper) to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Aklapper) That's also my understanding of the minimum "core databases" I'd need. So let's go... [11:46:48] volans: quick question re: dns validator, when renaming an host is it expected that all duplicated mgmt entries need to be ignored for TOO_MANY_MGMT_NAMES rather than only the offending record? works either way for me but I was curious if feature or bug [11:47:00] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Scheduled renaming of two Wikimedia user accounts - https://phabricator.wikimedia.org/T238512 (10Aklapper) [11:47:07] https://gerrit.wikimedia.org/r/c/operations/dns/+/547241 that is [11:47:34] (03CR) 10Filippo Giunchedi: [C: 03+2] Rename wezen to centrallog2001 [dns] - 10https://gerrit.wikimedia.org/r/547241 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [11:47:36] godog: looking [11:47:38] (03PS4) 10Filippo Giunchedi: Rename wezen to centrallog2001 [dns] - 10https://gerrit.wikimedia.org/r/547241 (https://phabricator.wikimedia.org/T224564) [11:49:07] godog: that pathc is missing something [11:49:43] the A record for centrallog2001 mgmt [11:49:56] in wmnet zonefile [11:50:17] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [11:50:43] ah indeed you are right, thanks! fixing [11:52:13] (03PS1) 10Filippo Giunchedi: wmnet: rename wezen mgmt to centrallog2001 [dns] - 10https://gerrit.wikimedia.org/r/551518 (https://phabricator.wikimedia.org/T224564) [11:52:52] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: rename wezen mgmt to centrallog2001 [dns] - 10https://gerrit.wikimedia.org/r/551518 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [11:53:09] (03PS2) 10Muehlenhoff: Extend debmonitor config with option to add links to images [puppet] - 10https://gerrit.wikimedia.org/r/550245 (https://phabricator.wikimedia.org/T237978) [11:55:22] (03CR) 10Mobrovac: [C: 03+2] RESTRouter: Add minwiktionary [deployment-charts] - 10https://gerrit.wikimedia.org/r/551501 (https://phabricator.wikimedia.org/T238523) (owner: 10Jon Harald Søby) [11:55:34] (03Merged) 10jenkins-bot: RESTRouter: Add minwiktionary [deployment-charts] - 10https://gerrit.wikimedia.org/r/551501 (https://phabricator.wikimedia.org/T238523) (owner: 10Jon Harald Søby) [11:55:44] * godog wears the brown paperbag [11:56:50] godog: and to reply to your question, yes with the current implementation is expected because all three lines would be reported as error (the script cannot know which one is correct and which not) and at logging time them ignore is evaluated. The ignores cannot be evaluated at parsing time because you can ignore some specific errors but not others [11:56:52] (03PS1) 10Filippo Giunchedi: wmnet: fix centrallog2001.mgmt A [dns] - 10https://gerrit.wikimedia.org/r/551519 [11:57:05] s/them ignore/the ignore/ [11:57:42] (03PS2) 10Filippo Giunchedi: wmnet: fix centrallog2001.mgmt A [dns] - 10https://gerrit.wikimedia.org/r/551519 [11:57:59] volans: that makes sense, thanks for the explanation! [11:58:10] third time's a charm [11:58:13] I guess you're folling https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging [11:58:33] yeah exactly [11:58:36] please have a look at inconsistencies as we don't do renames very often [11:58:46] so it's good to check if that's all still valid and working :) [11:59:51] heheh I got one mistake just now actually, I put centrallog2001 A entry in mgmt.eqiad.wmnet rather than mgmt.codfw.wmnet [12:00:20] though CI was fine with it, not sure if we're WARNING on that tho [12:00:33] and expect to have to add the ignore for the correct one (IIRC) [12:01:05] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: fix centrallog2001.mgmt A [dns] - 10https://gerrit.wikimedia.org/r/551519 (owner: 10Filippo Giunchedi) [12:02:11] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Reimage wezen to Stretch or Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` wezen.codfw.wmnet... [12:02:20] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Reimage wezen to Stretch or Buster (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wezen.codfw.wmnet'] ` Of which those **FAILED**: ` ['wezen.codfw.w... [12:02:56] godog: mmmh, it was in eqiad.wmnet, not mgmt.eqiad.wmnet the wrong one, so I'm wondering what should fail there [12:03:52] (03CR) 10Muehlenhoff: [C: 03+2] Extend debmonitor config with option to add links to images [puppet] - 10https://gerrit.wikimedia.org/r/550245 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [12:03:55] yep it added a warning for MISSING_IP_FOR_NAME_AND_PTR [12:04:09] and a MISSING_PTR_FOR_NAME_AND_IP [12:06:04] volans: I'm wondering if it'd make sense to check subnets too in case we're adding internal records that point to a non-local site ? [12:07:16] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) >>! In T213089#5655528, @elukey wrote: > Not all new slab metrics are rendered, opened an issue upstream: https://gith... [12:07:35] the zone validator has no specific knowledge of subnets <-> naming convention at the moment [12:08:10] IIRC of course, going by memory [12:10:20] ack, will followup with a wishlist task [12:10:25] sure! [12:10:56] * volans about to go for lunch [12:18:48] (03PS1) 10Filippo Giunchedi: hieradata: cleanup wezen [puppet] - 10https://gerrit.wikimedia.org/r/551522 (https://phabricator.wikimedia.org/T224564) [12:21:10] (03CR) 10Awight: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/551515 (https://phabricator.wikimedia.org/T214734) (owner: 10Effie Mouzeli) [12:22:08] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: cleanup wezen [puppet] - 10https://gerrit.wikimedia.org/r/551522 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [12:24:20] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:26:09] (03PS11) 10Jbond: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [12:30:01] (03CR) 10jerkins-bot: [V: 04-1] CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [12:40:17] (03PS3) 10Ema: ATS: network settings for ats-be [puppet] - 10https://gerrit.wikimedia.org/r/550866 (https://phabricator.wikimedia.org/T227432) [12:44:28] (03CR) 10Ema: [C: 03+2] ATS: network settings for ats-be [puppet] - 10https://gerrit.wikimedia.org/r/550866 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [12:45:02] (03CR) 10Faidon Liambotis: Automatically cast network strings to ipaddress objects (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [12:45:42] (03PS1) 10Effie Mouzeli: logstash: remove HHVM references [puppet] - 10https://gerrit.wikimedia.org/r/551524 (https://phabricator.wikimedia.org/T229792) [12:47:04] !log Run mwscript recountCategories.php --wiki=dewiki --mode=subcats (T238500) [12:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:13] T238500: Inconsistencies in Database dewiki.p - https://phabricator.wikimedia.org/T238500 [12:48:03] !log Run mwscript recountCategories.php --wiki=dewiki --mode=pages (T238500) [12:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:17] !log Run mwscript recountCategories.php --wiki=dewiki --mode=files (T238500) [12:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:25] (03CR) 10Ema: [C: 03+2] trafficserver: re-enable mwdebug1002 [puppet] - 10https://gerrit.wikimedia.org/r/551515 (https://phabricator.wikimedia.org/T214734) (owner: 10Effie Mouzeli) [12:48:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551159 (https://phabricator.wikimedia.org/T233951) (owner: 10Jbond) [12:50:23] (03PS1) 10Effie Mouzeli: hhvm: Remove hhvm module from puppet [puppet] - 10https://gerrit.wikimedia.org/r/551526 (https://phabricator.wikimedia.org/T229792) [12:51:15] !log Run mwscript recountCategories.php --wiki=cswiki --mode={subcats,pages,files} (T228585) [12:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:20] T228585: Counter for pages in category is below zero - https://phabricator.wikimedia.org/T228585 [13:00:12] (03PS1) 10Effie Mouzeli: (WIP) mediawiki: remove all hhvm related files and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) [13:00:24] (03CR) 10Ema: [C: 03+2] ATS: further increase log_buffer_size and max_line_size [puppet] - 10https://gerrit.wikimedia.org/r/550825 (https://phabricator.wikimedia.org/T237608) (owner: 10Ema) [13:06:53] 10Operations, 10DBA, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Scheduled renaming of two Wikimedia user accounts - https://phabricator.wikimedia.org/T238512 (10Ladsgroup) As I said in the global renamers mailing list, the only thing I'm worried about is FlaggedRevs and AbuseFilter tables that mig... [13:10:54] !log cp-ats: rolling ats-{tls,backend} restart to apply log_buffer_size config changes T237608 [13:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:59] T237608: ATS skipping certain logs due to lack of buffer space - https://phabricator.wikimedia.org/T237608 [13:17:03] (03PS1) 10Mathew.onipe: prometheus: Update prometheus target config for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/551529 [13:23:21] (03PS1) 10Ema: vcl: move XWD pass logic to wm_common [puppet] - 10https://gerrit.wikimedia.org/r/551531 (https://phabricator.wikimedia.org/T233768) [13:25:14] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: add openstack ocata release [puppet] - 10https://gerrit.wikimedia.org/r/551533 (https://phabricator.wikimedia.org/T238338) [13:26:48] (03PS2) 10Arturo Borrero Gonzalez: aptrepo: add openstack ocata release [puppet] - 10https://gerrit.wikimedia.org/r/551533 (https://phabricator.wikimedia.org/T238338) [13:28:13] (03PS12) 10Jbond: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [13:28:15] (03PS1) 10Jbond: CI - Python3: Fix minor flake8 issues in python3 files [puppet] - 10https://gerrit.wikimedia.org/r/551534 [13:28:22] (03PS2) 10Gehel: prometheus: Update prometheus target config for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/551529 (owner: 10Mathew.onipe) [13:28:56] (03PS4) 10Ema: bot_blocked_nets: also block blank/unset UA [puppet] - 10https://gerrit.wikimedia.org/r/547792 (https://phabricator.wikimedia.org/T237134) (owner: 10CDanis) [13:32:38] (03CR) 10jerkins-bot: [V: 04-1] CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [13:32:52] (03CR) 10Gehel: [C: 03+2] prometheus: Update prometheus target config for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/551529 (owner: 10Mathew.onipe) [13:36:26] (03CR) 10Ema: [C: 03+2] bot_blocked_nets: also block blank/unset UA [puppet] - 10https://gerrit.wikimedia.org/r/547792 (https://phabricator.wikimedia.org/T237134) (owner: 10CDanis) [13:46:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/551533 (https://phabricator.wikimedia.org/T238338) (owner: 10Arturo Borrero Gonzalez) [13:49:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: add openstack ocata release [puppet] - 10https://gerrit.wikimedia.org/r/551533 (https://phabricator.wikimedia.org/T238338) (owner: 10Arturo Borrero Gonzalez) [13:51:43] (03CR) 10Volans: CI - Python3: Fix minor flake8 issues in python3 files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551534 (owner: 10Jbond) [13:53:07] (03PS2) 10Jbond: CI - Python3: Fix minor flake8 issues in python3 files [puppet] - 10https://gerrit.wikimedia.org/r/551534 [13:53:42] (03CR) 10Jbond: CI - Python3: Fix minor flake8 issues in python3 files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551534 (owner: 10Jbond) [13:54:29] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:54:35] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:55:06] (03CR) 10Jbond: [C: 03+2] apereo_cas: update systemd to run as a system user [puppet] - 10https://gerrit.wikimedia.org/r/550872 (https://phabricator.wikimedia.org/T233951) (owner: 10Jbond) [14:03:43] (03PS1) 10Muehlenhoff: Enable cas-server-support-saml-idp in Gradle config [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/551540 [14:04:07] (03CR) 10Jbond: [C: 03+2] apereo_cas: improve systemd security [puppet] - 10https://gerrit.wikimedia.org/r/551159 (https://phabricator.wikimedia.org/T233951) (owner: 10Jbond) [14:05:57] (03PS3) 10Jbond: apereo_cas: improve systemd security [puppet] - 10https://gerrit.wikimedia.org/r/551159 (https://phabricator.wikimedia.org/T233951) [14:09:31] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:10:53] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: openstack-ocata-stretch: remove unused upstream backports component [puppet] - 10https://gerrit.wikimedia.org/r/551543 (https://phabricator.wikimedia.org/T238338) [14:14:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: openstack-ocata-stretch: remove unused upstream backports component [puppet] - 10https://gerrit.wikimedia.org/r/551543 (https://phabricator.wikimedia.org/T238338) (owner: 10Arturo Borrero Gonzalez) [14:18:35] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:19:37] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:20:12] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: actually install file reprepro-update-filter-wmcs-openstack-ocata.sh [puppet] - 10https://gerrit.wikimedia.org/r/551544 (https://phabricator.wikimedia.org/T238338) [14:22:17] !log Deploy schema change on dbstore1005:3318 [14:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:31] 10Operations, 10Maps: OSM Replication failed at eqiad and codfw - https://phabricator.wikimedia.org/T237228 (10MSantos) >>! In T237228#5644468, @Arjunaraoc wrote: > As per https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&var-cluster=maps1 the failure is now 10 days old. An update on the is... [14:23:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: actually install file reprepro-update-filter-wmcs-openstack-ocata.sh [puppet] - 10https://gerrit.wikimedia.org/r/551544 (https://phabricator.wikimedia.org/T238338) (owner: 10Arturo Borrero Gonzalez) [14:27:47] !log imported openstack ocata deb packages into stretch-wikimedia/thirdpartdy/openstack-ocata-stretch (T238338) [14:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:53] T238338: Import packages for Openstack Ocata - https://phabricator.wikimedia.org/T238338 [14:28:01] 10Operations, 10Maps: OSM Replication failed at eqiad and codfw - https://phabricator.wikimedia.org/T237228 (10MSantos) [14:28:08] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b3b288c] (dev-cluster): Parsoid: mirror traffic in split mode; add minwiktionary [14:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:53] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b3b288c] (dev-cluster): Parsoid: mirror traffic in split mode; add minwiktionary (duration: 02m 45s) [14:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:41] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b3b288c] (dev-cluster): Parsoid: mirror traffic in split mode; add minwiktionary - T229015 T238523 [14:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:47] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [14:31:48] T238523: Add minwiktionary to restbase - https://phabricator.wikimedia.org/T238523 [14:32:50] anybody working on centrallog2001? [14:33:31] godog is reimaging it [14:33:53] ah okok, I saw the BFD alerts for codfw and show bgp summary pointed to that [14:34:08] just wanted to make sure that nothing was exploing :D [14:34:11] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b3b288c] (dev-cluster): Parsoid: mirror traffic in split mode; add minwiktionary - T229015 T238523 (duration: 02m 30s) [14:34:11] *exploding [14:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:25] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b3b288c]: Parsoid: mirror traffic in split mode; add minwiktionary - T229015 T238523 [14:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:25] just checked the puppet config, wasn't aware it was anycast [14:36:27] nice [14:36:44] godog: probably ok to ack those BFD alerts? [14:38:13] 10Operations, 10Performance-Team, 10Traffic: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) There's been a decrease in local backend hitrate on ats-be compared to varnish-be. While on 2019-11-11 (before reimages to ats) the local hitrate... [14:39:55] (03PS3) 10Filippo Giunchedi: rsyslog: setup temporary secure rsync for logs transfer [puppet] - 10https://gerrit.wikimedia.org/r/547245 (https://phabricator.wikimedia.org/T224564) [14:39:57] (03PS4) 10Filippo Giunchedi: hieradata: pool centrallog2001 in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/547248 (https://phabricator.wikimedia.org/T224564) [14:39:59] (03PS1) 10Filippo Giunchedi: install_server: add raid1+lvm standard recipe for GPT [puppet] - 10https://gerrit.wikimedia.org/r/551547 (https://phabricator.wikimedia.org/T224564) [14:40:01] (03PS1) 10Filippo Giunchedi: install_server: use raid1-gpt-lvm-ext4-srv for centrallog [puppet] - 10https://gerrit.wikimedia.org/r/551548 (https://phabricator.wikimedia.org/T224564) [14:40:18] (03PS1) 10DCausse: [wdqs] add logging config for exporting updated entities [puppet] - 10https://gerrit.wikimedia.org/r/551549 (https://phabricator.wikimedia.org/T231411) [14:40:41] elukey: oh yeah totally forgot about the anycast there, will downtime [14:41:28] moritzm: https://gerrit.wikimedia.org/r/c/operations/puppet/+/551547 [14:48:22] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b3b288c]: Parsoid: mirror traffic in split mode; add minwiktionary - T229015 T238523 (duration: 13m 58s) [14:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:30] T229015: Tracking: Direct live production traffic at Parsoid/PHP - https://phabricator.wikimedia.org/T229015 [14:48:31] T238523: Add minwiktionary to restbase - https://phabricator.wikimedia.org/T238523 [14:50:08] (03CR) 10Filippo Giunchedi: "Another way to do achieve sth similar would be to force "partman-partitioning/choose_label gpt" in the standard recipe and add the biosgru" [puppet] - 10https://gerrit.wikimedia.org/r/551547 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [14:53:18] (03CR) 10Mathew.onipe: [C: 03+1] [wdqs] add logging config for exporting updated entities [puppet] - 10https://gerrit.wikimedia.org/r/551549 (https://phabricator.wikimedia.org/T231411) (owner: 10DCausse) [14:56:34] (03PS13) 10Jbond: CI - python3: first attempt at adding python3 CI [puppet] - 10https://gerrit.wikimedia.org/r/510613 [14:57:03] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:58:47] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:07:37] (03PS1) 10Anomie: Set MCR migration stage to NEW on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551551 (https://phabricator.wikimedia.org/T198312) [15:08:38] (03PS1) 10Ema: vtc: 127.0.0.1 is not a valid value for Host [puppet] - 10https://gerrit.wikimedia.org/r/551552 [15:08:53] (03CR) 10Anomie: [C: 03+2] "Deploying planned config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551551 (https://phabricator.wikimedia.org/T198312) (owner: 10Anomie) [15:09:42] (03Merged) 10jenkins-bot: Set MCR migration stage to NEW on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551551 (https://phabricator.wikimedia.org/T198312) (owner: 10Anomie) [15:10:18] o/ are the prometheus endpoints already included anywhere in mediawiki-config / in mediawiki ? [15:12:05] (03CR) 10Ema: [C: 03+2] vtc: 127.0.0.1 is not a valid value for Host [puppet] - 10https://gerrit.wikimedia.org/r/551552 (owner: 10Ema) [15:13:42] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set MCR migration stage to NEW on remaining wikis for T198312 (duration: 00m 53s) [15:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:48] T198312: Set the WMF cluster to use the new MCR-only schema - https://phabricator.wikimedia.org/T198312 [15:15:40] awight: can you please try mwdebug1002 again? [15:16:58] (03PS1) 10Mathew.onipe: prometheus: Use the correct parameters to build jmx wdqs-* targets [puppet] - 10https://gerrit.wikimedia.org/r/551553 (https://phabricator.wikimedia.org/T238408) [15:20:38] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: add raid1+lvm standard recipe for GPT [puppet] - 10https://gerrit.wikimedia.org/r/551547 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [15:20:40] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: use raid1-gpt-lvm-ext4-srv for centrallog [puppet] - 10https://gerrit.wikimedia.org/r/551548 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [15:20:52] (03PS2) 10Filippo Giunchedi: install_server: add raid1+lvm standard recipe for GPT [puppet] - 10https://gerrit.wikimedia.org/r/551547 (https://phabricator.wikimedia.org/T224564) [15:21:41] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] install_server: add raid1+lvm standard recipe for GPT [puppet] - 10https://gerrit.wikimedia.org/r/551547 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [15:22:04] (03PS2) 10Filippo Giunchedi: install_server: use raid1-gpt-lvm-ext4-srv for centrallog [puppet] - 10https://gerrit.wikimedia.org/r/551548 (https://phabricator.wikimedia.org/T224564) [15:22:16] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] install_server: use raid1-gpt-lvm-ext4-srv for centrallog [puppet] - 10https://gerrit.wikimedia.org/r/551548 (https://phabricator.wikimedia.org/T224564) (owner: 10Filippo Giunchedi) [15:30:09] (03Abandoned) 10Anomie: Set all sites to use the new MCR-only schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/549449 (https://phabricator.wikimedia.org/T198312) (owner: 10Daniel Kinzler) [15:38:31] (03CR) 10BBlack: [C: 03+1] vcl: move XWD pass logic to wm_common [puppet] - 10https://gerrit.wikimedia.org/r/551531 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema) [15:40:09] 10Operations, 10Performance-Team, 10Traffic: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) a:03ema [15:40:49] 10Operations, 10Performance-Team, 10Traffic: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Gilles) Assigning this to you, as it sounds like a very likely root cause to the sizeable performance regression [15:50:43] 10Operations, 10Performance-Team, 10Traffic: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10BBlack) @Gilles This could also be related to TLS certificate changes that were happening around the same dates, and could be inflating the bytes transf... [15:51:53] (03CR) 10Gehel: "minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551553 (https://phabricator.wikimedia.org/T238408) (owner: 10Mathew.onipe) [15:52:50] (03PS1) 10Filippo Giunchedi: install_server: fix raid1-gpt partition numbering [puppet] - 10https://gerrit.wikimedia.org/r/551560 [15:59:59] (03PS2) 10Mathew.onipe: prometheus: Use the correct parameters to build jmx wdqs-* targets [puppet] - 10https://gerrit.wikimedia.org/r/551553 (https://phabricator.wikimedia.org/T238408) [16:01:59] (03CR) 10Mathew.onipe: prometheus: Use the correct parameters to build jmx wdqs-* targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551553 (https://phabricator.wikimedia.org/T238408) (owner: 10Mathew.onipe) [16:03:42] (03PS2) 10Effie Mouzeli: (WIP) mediawiki: remove all hhvm related files and hieradata [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) [16:03:48] (03PS3) 10Gehel: prometheus: Use the correct parameters to build jmx wdqs-* targets [puppet] - 10https://gerrit.wikimedia.org/r/551553 (https://phabricator.wikimedia.org/T238408) (owner: 10Mathew.onipe) [16:06:27] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [16:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:49] (03PS2) 10DCausse: [wdqs] add logging config for exporting updated entities [puppet] - 10https://gerrit.wikimedia.org/r/551549 (https://phabricator.wikimedia.org/T238557) [16:08:01] (03CR) 10Gehel: [C: 03+2] prometheus: Use the correct parameters to build jmx wdqs-* targets [puppet] - 10https://gerrit.wikimedia.org/r/551553 (https://phabricator.wikimedia.org/T238408) (owner: 10Mathew.onipe) [16:08:29] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:23] (03PS4) 10CRusnov: cables: detect duplicate cable names, and blank cable names [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) [16:11:42] (03CR) 10jerkins-bot: [V: 04-1] cables: detect duplicate cable names, and blank cable names [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [16:11:59] (03CR) 10Jbond: "Tox is not intergrated correctly [i think] so ready for another set of reviews" [puppet] - 10https://gerrit.wikimedia.org/r/510613 (owner: 10Jbond) [16:14:13] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:14:14] (03CR) 10CRusnov: "Thanks!" (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [16:14:37] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:15:53] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:16:06] 10Operations, 10ops-eqiad: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Cmjohnson) a:05RobH→03Jclark-ctr @Jclark-ctr the mgmt password for ms-be1057 is still not working, can you try again. [16:17:18] PROBLEM - Check the Netbox report cables for fail status. on netbox1001 is CRITICAL: cables.Cables CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:19:56] (03PS1) 10Cmjohnson: Adding mac address for ms-be1058-59 to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/551563 (https://phabricator.wikimedia.org/T237438) [16:21:59] (03CR) 10jerkins-bot: [V: 04-1] Adding mac address for ms-be1058-59 to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/551563 (https://phabricator.wikimedia.org/T237438) (owner: 10Cmjohnson) [16:22:04] RECOVERY - Check the Netbox report cables for fail status. on netbox1001 is OK: cables.Cables OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:22:37] (03PS1) 10Andrew Bogott: Revert "Remove 'globalblocks' table from maintain-views" [puppet] - 10https://gerrit.wikimedia.org/r/551566 [16:23:10] (03PS2) 10Cmjohnson: Adding mac address for ms-be1058-59 to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/551563 (https://phabricator.wikimedia.org/T237438) [16:24:57] (03Abandoned) 10CRusnov: netbox: Setup automated DNS generation [puppet] - 10https://gerrit.wikimedia.org/r/539182 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:25:11] (03CR) 10jerkins-bot: [V: 04-1] Adding mac address for ms-be1058-59 to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/551563 (https://phabricator.wikimedia.org/T237438) (owner: 10Cmjohnson) [16:29:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/551540 (owner: 10Muehlenhoff) [16:29:50] (03PS3) 10Cmjohnson: Adding mac address for ms-be1058-59 to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/551563 (https://phabricator.wikimedia.org/T237438) [16:30:56] (03PS1) 10Marostegui: Revert "packages_wmf: Install 10.1 with buster" [puppet] - 10https://gerrit.wikimedia.org/r/551568 [16:31:04] (03PS2) 10Marostegui: Revert "packages_wmf: Install 10.1 with buster" [puppet] - 10https://gerrit.wikimedia.org/r/551568 [16:33:07] (03CR) 10Marostegui: [C: 03+2] Revert "packages_wmf: Install 10.1 with buster" [puppet] - 10https://gerrit.wikimedia.org/r/551568 (owner: 10Marostegui) [16:34:09] (03PS2) 10Andrew Bogott: Revert "Remove 'globalblocks' table from maintain-views" [puppet] - 10https://gerrit.wikimedia.org/r/551566 [16:36:04] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Enable cas-server-support-saml-idp in Gradle config [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/551540 (owner: 10Muehlenhoff) [16:40:22] !log ✔️ cdanis@install1002.wikimedia.org ~ 🕦 sudo -E reprepro --restrict grafana update buster-wikimedia [16:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:00] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Cmjohnson) a:05Jgreen→03Cmjohnson I will work on network setup [16:47:41] (03PS8) 10CRusnov: Add script to generate DNS records from Netbox [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) [16:49:12] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/551572 [16:49:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1096:3316 after compression', diff saved to https://phabricator.wikimedia.org/P9656 and previous config saved to /var/cache/conftool/dbconfig/20191118-164923-marostegui.json [16:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:59] (03CR) 10CRusnov: "Latest PS implements the script side of the strategy we've discussed for deployment." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:50:32] (03CR) 10CRusnov: "I have also tested this on netbox1001 for functionality." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [16:51:00] (03CR) 10Jhedden: [C: 03+1] Revert "Remove 'globalblocks' table from maintain-views" [puppet] - 10https://gerrit.wikimedia.org/r/551566 (owner: 10Andrew Bogott) [16:51:16] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Remove 'globalblocks' table from maintain-views" [puppet] - 10https://gerrit.wikimedia.org/r/551566 (owner: 10Andrew Bogott) [16:51:47] 10Operations, 10Traffic: Proxy-connection HTTP response header being sent to some users in some cases causing HTTP/2 protocol errors - https://phabricator.wikimedia.org/T238509 (10crusnov) p:05Triage→03Normal [16:51:55] 10Operations, 10MediaWiki-extensions-Translate, 10Traffic, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10crusnov) p:05Triage→03Normal [16:52:23] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/551572 [16:52:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add system user analytics-privatedata to the anaytics-privatedata-users group - https://phabricator.wikimedia.org/T238306 (10crusnov) p:05Triage→03Normal [16:53:11] (03PS1) 10Muehlenhoff: Reduce to ProtectSystem=true [puppet] - 10https://gerrit.wikimedia.org/r/551573 [16:53:17] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.546 ge 1 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:54:02] (03PS2) 10Jbond: Reduce to ProtectSystem=true [puppet] - 10https://gerrit.wikimedia.org/r/551573 (owner: 10Muehlenhoff) [16:54:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3316 for compression', diff saved to https://phabricator.wikimedia.org/P9658 and previous config saved to /var/cache/conftool/dbconfig/20191118-165410-marostegui.json [16:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/551573 (owner: 10Muehlenhoff) [16:56:14] (03PS3) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/551572 [16:56:29] (03CR) 10Muehlenhoff: [C: 03+2] Reduce to ProtectSystem=true [puppet] - 10https://gerrit.wikimedia.org/r/551573 (owner: 10Muehlenhoff) [16:56:47] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)1 ge (W)0.2 ge 0.008333 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [16:57:33] (03PS1) 10Paladox: Revert "Gerrit: Wait 5s before starting" [puppet] - 10https://gerrit.wikimedia.org/r/551576 [16:58:09] jouncebot now [16:58:09] No deployments scheduled for the next 0 hour(s) and 1 minute(s) [16:58:12] jouncebot next [16:58:13] In 0 hour(s) and 1 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191118T1700) [16:58:17] (03PS4) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/551572 [16:58:23] (03CR) 10jerkins-bot: [V: 04-1] Revert "Gerrit: Wait 5s before starting" [puppet] - 10https://gerrit.wikimedia.org/r/551576 (owner: 10Paladox) [16:58:41] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: fix raid1-gpt partition numbering [puppet] - 10https://gerrit.wikimedia.org/r/551560 (owner: 10Filippo Giunchedi) [16:58:51] 10Operations, 10Traffic: Proxy-connection HTTP response header being sent to some users in some cases causing HTTP/2 protocol errors - https://phabricator.wikimedia.org/T238509 (10Vgutierrez) @ema this could be caused by ats-be? ats-tls is setting would set `Proxy-Connection: close` but we are actually prevent... [16:59:44] (03PS2) 10Paladox: Revert "Gerrit: Wait 5s before starting" [puppet] - 10https://gerrit.wikimedia.org/r/551576 [16:59:59] (03PS1) 10Andrew Bogott: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/551577 [17:00:04] gehel and onimisionipe: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191118T1700). [17:00:04] (03PS3) 10Paladox: Revert "Gerrit: Wait 5s before starting" [puppet] - 10https://gerrit.wikimedia.org/r/551576 [17:00:31] (03PS5) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/551572 [17:01:00] (03CR) 10CDanis: [C: 03+1] Revert "Gerrit: Wait 5s before starting" [puppet] - 10https://gerrit.wikimedia.org/r/551576 (owner: 10Paladox) [17:02:06] (03CR) 10jerkins-bot: [V: 04-1] Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/551577 (owner: 10Andrew Bogott) [17:02:16] (03PS1) 10Andrew Bogott: Revert "Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/551579 [17:04:34] (03PS1) 10Andrew Bogott: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/551580 (https://phabricator.wikimedia.org/T237509) [17:04:36] (03PS1) 10Andrew Bogott: Revert "Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/551581 [17:06:11] (03PS6) 10CDanis: grafana1002: disable HTML sanitization in panels [puppet] - 10https://gerrit.wikimedia.org/r/551572 (https://phabricator.wikimedia.org/T220838) [17:06:22] (03CR) 10CDanis: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/19452/" [puppet] - 10https://gerrit.wikimedia.org/r/551572 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [17:07:00] (03PS1) 10Addshore: DNM: mediawiki/wikidata maint cron for updateQueryServiceLag [puppet] - 10https://gerrit.wikimedia.org/r/551582 (https://phabricator.wikimedia.org/T221774) [17:07:49] (03PS1) 10Andrew Bogott: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/551583 (https://phabricator.wikimedia.org/T237509) [17:07:53] (03PS1) 10Andrew Bogott: Revert "Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/551584 [17:08:19] !log Deploy schema change on db1116:3318 [17:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:39] (03CR) 10CDanis: [C: 03+2] Revert "Gerrit: Wait 5s before starting" [puppet] - 10https://gerrit.wikimedia.org/r/551576 (owner: 10Paladox) [17:09:01] (03PS2) 10Andrew Bogott: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/551577 (https://phabricator.wikimedia.org/T237509) [17:09:03] (03PS2) 10Andrew Bogott: Revert "Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/551579 [17:16:57] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog, 10Release Pipeline, and 3 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Mholloway) 05Open→03Invalid This was implemented as a MediaWiki extension, and there is no plan to add an externa... [17:19:21] (03PS3) 10Physikerwelt: Enable links from math formulae on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551180 (https://phabricator.wikimedia.org/T208758) [17:27:02] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) considering we no have T237259 and the fact that the sooner we install the new certificate the more chance difficult services i.e. mysql servers will natural... [17:30:13] (03PS3) 10Mathew.onipe: query_service: add updater mode option [puppet] - 10https://gerrit.wikimedia.org/r/551167 (https://phabricator.wikimedia.org/T231411) [17:30:15] (03PS3) 10Mathew.onipe: Switch wdqs1004 to merging updater mode [puppet] - 10https://gerrit.wikimedia.org/r/551169 (https://phabricator.wikimedia.org/T231411) [17:34:12] 10Operations, 10Traffic: Proxy-connection HTTP response header being sent to some users in some cases causing HTTP/2 protocol errors - https://phabricator.wikimedia.org/T238509 (10Vgutierrez) From what I'm seeing on https://github.com/apache/trafficserver/blob/8.0.x/proxy/http/HttpTransact.cc#L6859-L6882 I don... [17:34:57] (03PS3) 10Andrew Bogott: Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/551577 (https://phabricator.wikimedia.org/T238480) [17:34:59] (03PS3) 10Andrew Bogott: Revert "Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/551579 (https://phabricator.wikimedia.org/T238480) [17:37:32] (03CR) 10Andrew Bogott: [C: 03+2] Depool labsdb1009 [puppet] - 10https://gerrit.wikimedia.org/r/551577 (https://phabricator.wikimedia.org/T238480) (owner: 10Andrew Bogott) [17:39:23] (03PS3) 10Gehel: [wdqs] add logging config for exporting updated entities [puppet] - 10https://gerrit.wikimedia.org/r/551549 (https://phabricator.wikimedia.org/T238557) (owner: 10DCausse) [17:39:44] (03CR) 10CDanis: [C: 03+2] grafana1002: disable HTML sanitization in panels [puppet] - 10https://gerrit.wikimedia.org/r/551572 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [17:42:05] 10Operations, 10Traffic: Proxy-connection HTTP response header being sent to some users in some cases causing HTTP/2 protocol errors - https://phabricator.wikimedia.org/T238509 (10Vgutierrez) as a PoC, you can force ATS to send `proxy-connection: keep-alive` using HTTP/1.0 like this: ` $ curl --http1.0 -H 'Pro... [17:44:12] !log rebooting grafana1002 (currently test host not used in prod) [17:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:47] (03CR) 10Ayounsi: "recheck" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550052 (https://phabricator.wikimedia.org/T237007) (owner: 10CRusnov) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191118T1800). Please do the needful. [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:01:38] jouncebot: next [18:01:39] In 1 hour(s) and 58 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191118T2000) [18:01:57] jouncebot: now [18:01:58] For the next 0 hour(s) and 58 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191118T1800) [18:02:22] strange..I will proceed with WDQS deployment [18:02:41] onimisionipe: What's strange? [18:03:23] jouncebot didnt announce WDQS deployment [18:03:30] maybe I did something wrong in the calendar [18:03:47] onimisionipe: It did announce it, going by the channel log. [18:03:56] Maybe you got the time wrong? [18:04:22] Though nothing to deploy in this swat window so you should be fine to deploy. [18:04:31] Oh. no.. strange. [18:07:50] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@582d394]: New WDQS build with merging updater [18:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:27] (03CR) 10ArielGlenn: "The dumps-related changes look fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/551527 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [18:10:38] 10Operations, 10Traffic: Proxy-connection HTTP response header being sent to some users in some cases causing HTTP/2 protocol errors - https://phabricator.wikimedia.org/T238509 (10Vgutierrez) so we have been able to reproduce the issue, apparently this happens when nginx is being used as the TLS termination l... [18:11:23] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Depool labsdb1009" [puppet] - 10https://gerrit.wikimedia.org/r/551579 (https://phabricator.wikimedia.org/T238480) (owner: 10Andrew Bogott) [18:11:36] (03CR) 10Andrew Bogott: [C: 03+2] Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/551580 (https://phabricator.wikimedia.org/T237509) (owner: 10Andrew Bogott) [18:21:17] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@582d394]: New WDQS build with merging updater (duration: 13m 27s) [18:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:35] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/551581 (owner: 10Andrew Bogott) [18:41:46] (03CR) 10Andrew Bogott: [C: 03+2] Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/551583 (https://phabricator.wikimedia.org/T237509) (owner: 10Andrew Bogott) [18:48:12] (03PS1) 10Ottomata: Kafka producer TLS support for eventgate charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) [18:48:28] (03CR) 10jerkins-bot: [V: 04-1] Kafka producer TLS support for eventgate charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [18:49:39] (03PS2) 10Ottomata: Kafka producer TLS support for eventgate charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/551610 (https://phabricator.wikimedia.org/T236386) [18:50:41] 10Operations, 10MediaWiki-extensions-Translate, 10Traffic, 10Wikidata, 10User-DannyS712: Bug: 502 error when marking page for translation - https://phabricator.wikimedia.org/T237319 (10Vgutierrez) For `2019-11-14 18:33:15 GMT`, ats-be in cp4030 is reporting `20191114.18h33m15s CONNECT: could not connect... [19:03:03] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) [19:04:15] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10Ottomata) [19:04:34] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10Ottomata) [19:06:56] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10Ottomata) [19:07:48] 10Operations, 10Puppet, 10DBA, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10Ottomata) [19:09:49] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) @Joe @akosiaris @ema I'd like to move forward with these patches this week, hopefully sooner rather than later. C... [19:14:29] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) [19:15:12] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 5 others: Public schema.wikimedia.org endpoint for schema.svc - https://phabricator.wikimedia.org/T233630 (10Ottomata) FYI for my own reference, an example of enabling envoyproxy in puppet: https://gerrit.wikimedia.org/r/c/operati... [19:16:05] (03PS1) 10Mholloway: Update wikifeeds to 2019-11-12-001916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/551615 (https://phabricator.wikimedia.org/T237790) [19:16:47] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) [19:16:58] (03CR) 10Ayounsi: coherence: Alert on ACTIVE devices with names future- or spare. (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550051 (https://phabricator.wikimedia.org/T237464) (owner: 10CRusnov) [19:17:31] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) [19:20:09] (03CR) 10Ayounsi: "Couple questions otherwise LGTM." (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550917 (https://phabricator.wikimedia.org/T237469) (owner: 10CRusnov) [19:20:47] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) [19:22:02] (03PS1) 10CDanis: grafana1002: security.cookie_secure per release notes [puppet] - 10https://gerrit.wikimedia.org/r/551616 [19:22:07] (03CR) 10CRusnov: coherence: Check device names for correct formatting (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550917 (https://phabricator.wikimedia.org/T237469) (owner: 10CRusnov) [19:22:30] (03CR) 10Ayounsi: [C: 03+1] coherence: Check device names for correct formatting [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550917 (https://phabricator.wikimedia.org/T237469) (owner: 10CRusnov) [19:23:58] (03CR) 10jerkins-bot: [V: 04-1] grafana1002: security.cookie_secure per release notes [puppet] - 10https://gerrit.wikimedia.org/r/551616 (owner: 10CDanis) [19:24:37] 10Operations, 10ops-codfw, 10ops-eqiad, 10netbox: Document PDU models - https://phabricator.wikimedia.org/T227632 (10RobH) I'm not sure where the documentation needs to be updated, did you mean in netbox or in puppet? If netbox do you mean the items with the generic 'Smart CDU' & 'Switched CDU' entries li... [19:26:41] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/551584 (owner: 10Andrew Bogott) [19:27:22] 10Operations, 10ops-codfw, 10ops-eqiad, 10netbox: Document PDU models - https://phabricator.wikimedia.org/T227632 (10ayounsi) In Netbox, `Smart CDU & Switched CDU` are generic Types. They should be replaced by the exact model. [19:31:27] 10Operations, 10Traffic, 10Inuka-Team (Kanban), 10Patch-For-Review, 10Performance-Team (Radar): Code for InukaPageView instrumentation - https://phabricator.wikimedia.org/T238029 (10Krinkle) >>! @Krinkle wrote on Gerrit (Patch Set 13): > This seems to vary server-response for anon users by aspects (user-... [19:33:11] (03PS2) 10CDanis: grafana1002: security.cookie_secure per release notes [puppet] - 10https://gerrit.wikimedia.org/r/551616 (https://phabricator.wikimedia.org/T220838) [19:34:33] (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19453/" [puppet] - 10https://gerrit.wikimedia.org/r/551616 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [19:35:44] 10Operations, 10DC-Ops, 10hardware-requests: eqiad: three clouvirt-wdqs servers for WDQS testing - https://phabricator.wikimedia.org/T232654 (10Andrew) a:03wiki_willy Are these racked? Or is there a task for racking? [19:38:59] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade grafana to 6.4.4 - https://phabricator.wikimedia.org/T220838 (10CDanis) [19:40:07] 10Operations, 10DC-Ops, 10hardware-requests: eqiad: three clouvirt-wdqs servers for WDQS testing - https://phabricator.wikimedia.org/T232654 (10wiki_willy) Hey @Andrew - I think this is the racking task for these - https://phabricator.wikimedia.org/T235685. Servers arrived on November 5. Thanks, Willy [19:41:36] can someone merge https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/551180/ so that we can test the new feature on beta? [19:44:49] (03CR) 10Dzahn: [C: 03+2] install_server: switch phab2001 to buster [puppet] - 10https://gerrit.wikimedia.org/r/551287 (https://phabricator.wikimedia.org/T190568) (owner: 10Dzahn) [19:44:56] (03PS2) 10Dzahn: install_server: switch phab2001 to buster [puppet] - 10https://gerrit.wikimedia.org/r/551287 (https://phabricator.wikimedia.org/T190568) [19:45:28] 10Operations, 10DC-Ops, 10hardware-requests: eqiad: three clouvirt-wdqs servers for WDQS testing - https://phabricator.wikimedia.org/T232654 (10Andrew) [19:45:30] 10Operations, 10ops-eqiad: rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Andrew) [19:45:49] 10Operations, 10ops-eqiad: rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Andrew) [19:46:29] (03CR) 10CRusnov: [C: 03+2] coherence: Check device names for correct formatting [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550917 (https://phabricator.wikimedia.org/T237469) (owner: 10CRusnov) [19:51:25] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:53:47] (03Abandoned) 10Hashar: prometheus: make ferm DNS record type configurable [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar) [19:53:51] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 22.18 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:56:40] (03PS1) 10CDanis: graphite: add grafana-beta to cors_origins [puppet] - 10https://gerrit.wikimedia.org/r/551624 (https://phabricator.wikimedia.org/T220838) [19:57:15] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 93.5 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:57:23] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 56.2 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:58:36] (03CR) 10CDanis: "https://puppet-compiler.wmflabs.org/compiler1002/19454/" [puppet] - 10https://gerrit.wikimedia.org/r/551624 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [20:00:05] cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191118T2000). [20:00:47] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 83.03 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:14:06] (03CR) 10Dzahn: [C: 03+1] graphite: add grafana-beta to cors_origins [puppet] - 10https://gerrit.wikimedia.org/r/551624 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [20:14:46] (03CR) 10CDanis: [C: 03+2] graphite: add grafana-beta to cors_origins [puppet] - 10https://gerrit.wikimedia.org/r/551624 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [20:19:32] (03PS1) 10CDanis: grafana-beta: rename to grafana-next [dns] - 10https://gerrit.wikimedia.org/r/551633 (https://phabricator.wikimedia.org/T220838) [20:20:12] (03CR) 10Dzahn: [C: 03+1] "not labs and not cloud :)" [dns] - 10https://gerrit.wikimedia.org/r/551633 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [20:20:33] (03CR) 10CDanis: [C: 03+2] grafana-beta: rename to grafana-next [dns] - 10https://gerrit.wikimedia.org/r/551633 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [20:23:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1098:3316 after some compression', diff saved to https://phabricator.wikimedia.org/P9659 and previous config saved to /var/cache/conftool/dbconfig/20191118-202259-marostegui.json [20:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:58] (03PS1) 10CDanis: grafana-beta: rename to grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/551636 (https://phabricator.wikimedia.org/T220838) [20:26:04] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/551636 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [20:27:30] (03CR) 10CDanis: [C: 03+2] grafana-beta: rename to grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/551636 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [20:31:57] (03CR) 10Dzahn: [C: 03+1] "@Giuseppe I checked about the proxy config in Apache and actually puppet code is smart enough. In the .erb template for Apache config it i" [puppet] - 10https://gerrit.wikimedia.org/r/551271 (owner: 10Dzahn) [20:32:06] (03PS2) 10Dzahn: phabricator: disable aphlict [puppet] - 10https://gerrit.wikimedia.org/r/551271 [20:33:59] (03PS3) 10Dzahn: phabricator: disable aphlict [puppet] - 10https://gerrit.wikimedia.org/r/551271 [20:34:34] (03CR) 10Dzahn: "As Mukunda pointed out aphlict does not have to run on the same server. We could also move it somewhere else." [puppet] - 10https://gerrit.wikimedia.org/r/551271 (owner: 10Dzahn) [20:41:02] (03PS15) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [20:41:42] (03PS16) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [20:47:48] (03PS4) 10Dzahn: phabricator: disable aphlict [puppet] - 10https://gerrit.wikimedia.org/r/551271 (https://phabricator.wikimedia.org/T238593) [20:49:45] 10Operations, 10Traffic: Proxy-connection HTTP response header being sent to some users in some cases causing HTTP/2 protocol errors - https://phabricator.wikimedia.org/T238509 (10Krenair) a:03Vgutierrez Thanks Valentin. [20:57:26] (03CR) 10Dzahn: [C: 03+2] phabricator: disable aphlict [puppet] - 10https://gerrit.wikimedia.org/r/551271 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:59:10] !log phab1003 - re-enabling puppet after merging gerrit::551271 - making sure aphlict stays disabled incl. the apache config ProxyPass lines using mod_proxy_wstunnel (T238593) [20:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:26] T238593: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 [21:00:04] Reedy and sbassett: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191118T2100). [21:04:51] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) I'm getting the following ` Warning: Downgrading to PSON for future requests Info: Using configured environment '... [21:07:58] (03CR) 10Dzahn: "Would it be possible to add a runbook to https://www.mediawiki.org/wiki/Continuous_integration/Zuul for this check? Even if that is just " [puppet] - 10https://gerrit.wikimedia.org/r/551347 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [21:08:02] (03CR) 10Dzahn: [C: 03+2] Revert "zuul: Remove zuul Gearman queue alert" [puppet] - 10https://gerrit.wikimedia.org/r/551347 (https://phabricator.wikimedia.org/T70113) (owner: 10Hashar) [21:08:49] (03PS3) 10Dzahn: Gerrit: Symlink lib/mysql-connector to gerrit deployment repo [puppet] - 10https://gerrit.wikimedia.org/r/548552 (owner: 10Paladox) [21:10:20] (03PS18) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [21:10:46] (03CR) 10Dzahn: "now after https://gerrit.wikimedia.org/r/c/operations/puppet/+/551285 i assume" [puppet] - 10https://gerrit.wikimedia.org/r/549906 (https://phabricator.wikimedia.org/T232883) (owner: 10Dzahn) [21:14:48] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) [21:14:50] 10Operations, 10ops-codfw, 10Cloud-Services: Build, package bdsync for Buster - https://phabricator.wikimedia.org/T234683 (10Andrew) 05Open→03Resolved [21:15:10] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) 05Open→03Resolved [21:15:41] (03PS19) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [21:27:46] (03PS20) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [21:31:29] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:38:44] (03PS1) 10Andrew Bogott: Attempt at normalizing clouddb2001-dev a bit [puppet] - 10https://gerrit.wikimedia.org/r/551652 [21:39:21] !log arlolra@deploy1001 Started deploy [parsoid/deploy@c6a457f]: Updating Parsoid to 2245b8f [21:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:02] (03PS1) 10CDanis: grafana: cert: add SANs grafana-next and grafana1002 [puppet] - 10https://gerrit.wikimedia.org/r/551653 (https://phabricator.wikimedia.org/T220838) [21:44:44] (03CR) 10CDanis: [C: 03+2] "X509v3 Subject Alternative Name:" [puppet] - 10https://gerrit.wikimedia.org/r/551653 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [21:44:59] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "DNS:grafana.discovery.wmnet, DNS:grafana.svc.eqiad.wmnet, DNS:grafana.svc.codfw.wmnet, DNS:grafana1001.eqiad.wmnet, DNS:grafana1002.eqiad." [puppet] - 10https://gerrit.wikimedia.org/r/551653 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [21:46:25] (03PS21) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [21:47:44] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@c6a457f]: Updating Parsoid to 2245b8f (duration: 08m 22s) [21:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:43] (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2019-11-12-001916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/551615 (https://phabricator.wikimedia.org/T237790) (owner: 10Mholloway) [21:52:57] (03Merged) 10jenkins-bot: Update wikifeeds to 2019-11-12-001916-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/551615 (https://phabricator.wikimedia.org/T237790) (owner: 10Mholloway) [21:55:13] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10Jclark-ctr) a:05Jclark-ctr→03RobH @Cmjohnson entered mgmt password for ms-be1057 again [21:55:48] 10Operations: envoyproxy does not automatically reload certificates - https://phabricator.wikimedia.org/T238597 (10CDanis) [21:56:01] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [21:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:16] 10Operations: envoyproxy does not automatically reload certificates - https://phabricator.wikimedia.org/T238597 (10crusnov) p:05Triage→03Normal [21:56:52] !log Upgraded Parsoid to 2245b8f (T237886, T237103, T236864, T237569, T236930, T237463, T236867, T234266) [21:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:09] T237886: Argument 1 passed to Parsoid\Utils\PHPUtils::encodeURIComponent() must be of the type string, integer given - https://phabricator.wikimedia.org/T237886 [21:57:10] T237463: Resolve all PORT-FIXMEs - https://phabricator.wikimedia.org/T237463 [21:57:10] T236867: PHP Notice: Trying to get property 'parsoid' of non-object - https://phabricator.wikimedia.org/T236867 [21:57:10] T234266: Cannot read property 'stored' of undefined - https://phabricator.wikimedia.org/T234266 [21:57:11] T236864: Call to a member function getArticleID() on null - https://phabricator.wikimedia.org/T236864 [21:57:11] T237103: PHP Notice: Undefined offset: 2 - https://phabricator.wikimedia.org/T237103 [21:57:11] T236930: API Developer supports different request media types - https://phabricator.wikimedia.org/T236930 [21:57:12] T237569: Linter extension is currently incompatible with Parsoid/PHP - https://phabricator.wikimedia.org/T237569 [21:57:24] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [21:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:07] (03PS2) 10Andrew Bogott: Add some standard maintenance access to clouddb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/551652 (https://phabricator.wikimedia.org/T238514) [21:59:11] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [21:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:59] (03PS22) 10ArielGlenn: store generated misc cron dump output on second nfs server [puppet] - 10https://gerrit.wikimedia.org/r/447402 (https://phabricator.wikimedia.org/T200180) [22:01:04] (03PS3) 10Andrew Bogott: Add some standard maintenance access to clouddb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/551652 (https://phabricator.wikimedia.org/T238514) [22:03:14] (03PS4) 10Andrew Bogott: Add some standard maintenance access to clouddb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/551652 (https://phabricator.wikimedia.org/T238514) [22:05:45] (03PS17) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [22:07:54] (03PS5) 10Andrew Bogott: Add some standard maintenance access to clouddb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/551652 (https://phabricator.wikimedia.org/T238514) [22:17:21] (03PS18) 10Herron: logstash: introduce logstash 7 and openjdk-11 support [puppet] - 10https://gerrit.wikimedia.org/r/548880 (https://phabricator.wikimedia.org/T217340) [22:22:00] (03PS1) 10CDanis: grafana: quickdatacopy 1001->1002 for plugins and for pngs [puppet] - 10https://gerrit.wikimedia.org/r/551660 (https://phabricator.wikimedia.org/T220838) [22:25:23] (03PS2) 10CDanis: grafana: quickdatacopy 1001->1002 for plugins and for pngs [puppet] - 10https://gerrit.wikimedia.org/r/551660 (https://phabricator.wikimedia.org/T220838) [22:26:37] (03PS4) 10Ayounsi: Automatically cast network strings to ipaddress objects [software/homer] - 10https://gerrit.wikimedia.org/r/551273 [22:28:18] (03CR) 10Ayounsi: Automatically cast network strings to ipaddress objects (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [22:29:59] (03PS3) 10CDanis: grafana: quickdatacopy 1001->1002 for plugins and for pngs [puppet] - 10https://gerrit.wikimedia.org/r/551660 (https://phabricator.wikimedia.org/T220838) [22:30:01] (03PS1) 10CDanis: rsync: fix multiple usages of quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/551667 (https://phabricator.wikimedia.org/T237424) [22:31:18] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` phab2001.co... [22:31:37] !log phab2001 - reinstalling with buster (T190568) [22:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:41] T190568: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 [22:33:14] (03CR) 10jerkins-bot: [V: 04-1] rsync: fix multiple usages of quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/551667 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [22:35:15] (03CR) 10Dzahn: [C: 03+1] "code change looks good. jenkins just doesn't like the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/551667 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [22:36:31] (03CR) 10Dzahn: [C: 03+1] ""Fixes:" is like a magic word around here :)" [puppet] - 10https://gerrit.wikimedia.org/r/551667 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [22:37:02] (03PS2) 10CDanis: rsync: fix multiple usages of quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/551667 (https://phabricator.wikimedia.org/T237424) [22:37:04] (03PS4) 10CDanis: grafana: quickdatacopy 1001->1002 for plugins and for pngs [puppet] - 10https://gerrit.wikimedia.org/r/551660 (https://phabricator.wikimedia.org/T220838) [22:37:29] PROBLEM - PyBal backends health check on lvs2005 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:37:49] PROBLEM - PyBal backends health check on lvs2002 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:38:01] lol I'm pretty sure that isn't the first time I've ran into the 'Fixes:' thing [22:38:02] 10Operations, 10serviceops: envoyproxy does not automatically reload certificates - https://phabricator.wikimedia.org/T238597 (10Dzahn) [22:38:48] argg.. and the Pybal alerts are from me reinstalling that server [22:38:59] should have removed that first [22:39:22] cdanis: hehe, yes [22:39:43] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab2001.codfw.wmnet [22:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:31] depooled. rescheduled next service check... [22:40:48] (03CR) 10CDanis: [C: 03+2] rsync: fix multiple usages of quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/551667 (https://phabricator.wikimedia.org/T237424) (owner: 10CDanis) [22:41:01] PROBLEM - PyBal IPVS diff check on lvs2005 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab2001-vcs.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [22:41:11] (03CR) 10CDanis: [C: 03+2] grafana: quickdatacopy 1001->1002 for plugins and for pngs [puppet] - 10https://gerrit.wikimedia.org/r/551660 (https://phabricator.wikimedia.org/T220838) (owner: 10CDanis) [22:42:45] PROBLEM - PyBal IPVS diff check on lvs2002 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab2001-vcs.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [22:43:11] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10CGlenn) [22:43:37] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2002 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab2001-vcs.codfw.wmnet]) daniel_zahn phab2001 reimaging https://wikitech.wikimedia.org/wiki/PyBal [22:43:37] ACKNOWLEDGEMENT - PyBal backends health check on lvs2002 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled daniel_zahn phab2001 reimaging https://wikitech.wikimedia.org/wiki/PyBal [22:43:37] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2005 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([phab2001-vcs.codfw.wmnet]) daniel_zahn phab2001 reimaging https://wikitech.wikimedia.org/wiki/PyBal [22:43:37] ACKNOWLEDGEMENT - PyBal backends health check on lvs2005 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled daniel_zahn phab2001 reimaging https://wikitech.wikimedia.org/wiki/PyBal [22:43:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10CGlenn) @MoritzMuehlenhoff Hi! I updated the SSH key. [22:44:09] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab2001-vcs.codfw.wmnet [22:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10Dzahn) 05Stalled→03Open [22:51:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10Nuria) @CGlenn Please take the time to read: https://wikitech.wikimedia.org/wiki/Analytics/Data_access [22:55:39] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:56:05] (03CR) 10Bstorm: toolforge: proxy: adjust setup for the new k8s cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543135 (https://phabricator.wikimedia.org/T234037) (owner: 10Arturo Borrero Gonzalez) [22:56:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:26] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191118T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:05:47] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:06:19] (03PS1) 10Jhedden: ceph: preserve docker iptables chains [puppet] - 10https://gerrit.wikimedia.org/r/551677 [23:07:40] (03PS8) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [23:08:04] (03CR) 10Jhedden: "While this does solve the ferm restart issue, I'm a bit leery of the `preserve` process." [puppet] - 10https://gerrit.wikimedia.org/r/551677 (owner: 10Jhedden) [23:09:38] (03PS2) 10Jhedden: ceph: preserve docker iptables chains [puppet] - 10https://gerrit.wikimedia.org/r/551677 [23:10:09] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade grafana to 6.4.4 - https://phabricator.wikimedia.org/T220838 (10CDanis) upgraded the pie chart plugin to a recent version that actually works with 6.x: `❌cdanis@grafana1002.eqiad.wmnet ~ 🕕🍺 sudo http_proxy=http://webprox... [23:10:33] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:10] 10Operations, 10DBA, 10SRE-Access-Requests, 10Patch-For-Review: Read access for phabricator-admins (aklapper) to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Dzahn) @Aklapper Given that Phabricator uses multiple databases, do you just need the "phabri... [23:18:35] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade grafana to 6.4.4 - https://phabricator.wikimedia.org/T220838 (10CDanis) [23:20:38] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['phab2001.codfw.wmnet'] ` and were **ALL** successful. [23:32:03] !log catrope@deploy1001 Started scap: Update GrowthExperiments to master in wmf.5 (includes i18n) [23:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:34] !log phab2001 - restart ssh-phab service after reimaging (some race condition binding to the IP before getting it on the interface after fresh install .. reschedule pybal checks (T190568) [23:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:40] T190568: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 [23:39:21] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:47:25] (03PS2) 10EBernhardson: secret: dummy credentials for airflow [labs/private] - 10https://gerrit.wikimedia.org/r/544993 [23:52:00] !log catrope@deploy1001 Finished scap: Update GrowthExperiments to master in wmf.5 (includes i18n) (duration: 19m 57s) [23:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:36] (03CR) 10Ayounsi: Initial templating for CR routing-options (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [23:56:55] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for project GLOW - https://phabricator.wikimedia.org/T238607 (10Moushira) [23:57:07] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Create mailing list for project GLOW - https://phabricator.wikimedia.org/T238607 (10Moushira)