[00:01:36] (03CR) 10BryanDavis: args: A few fixups (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [00:08:16] (03CR) 10Bstorm: "On the python3 changes, I think the burden should be on installing packages to continue to keep py2 on life support rather than changing t" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [00:10:42] (03CR) 10Bstorm: args: A few fixups (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [00:10:58] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission heka.frack.codfw.wmnet - https://phabricator.wikimedia.org/T248627 (10Papaul) [00:16:01] (03PS2) 10Bstorm: args: A few fixups [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) [00:16:30] (03CR) 10Bstorm: args: A few fixups (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [00:22:20] (03CR) 10Bstorm: "Actually, the "right way" to do this is add python-configparser to the dependencies for the deb package. Oops." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [00:27:48] (03PS3) 10Bstorm: args: A few fixups [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) [00:28:24] (03PS2) 10BryanDavis: Replace pykube with a custom API client [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/586162 (https://phabricator.wikimedia.org/T197930) [00:30:53] (03PS3) 10BryanDavis: Replace pykube with a custom API client [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/586162 (https://phabricator.wikimedia.org/T197930) [00:32:37] (03CR) 10BryanDavis: Replace pykube with a custom API client (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/586162 (https://phabricator.wikimedia.org/T197930) (owner: 10BryanDavis) [00:33:55] (03CR) 10Bstorm: args: A few fixups (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [00:35:31] (03CR) 10BryanDavis: [C: 03+2] "Nice work Brooke." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [00:38:53] (03Merged) 10jenkins-bot: args: A few fixups [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587369 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [00:42:24] (03PS2) 10BryanDavis: Yet another package rename mega patch [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/585963 (https://phabricator.wikimedia.org/T249079) [00:42:26] (03PS4) 10BryanDavis: Replace pykube with a custom API client [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/586162 (https://phabricator.wikimedia.org/T197930) [00:44:24] (03Abandoned) 10Bstorm: toolforge: ensure the python2 backport of configparser is installed [puppet] - 10https://gerrit.wikimedia.org/r/587372 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [00:53:27] (03PS1) 10Bstorm: d/changelog: prepare for 0.66 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587383 (https://phabricator.wikimedia.org/T249390) [01:03:45] (03CR) 10Bstorm: [C: 03+2] d/changelog: prepare for 0.66 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587383 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [01:06:54] (03Merged) 10jenkins-bot: d/changelog: prepare for 0.66 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587383 (https://phabricator.wikimedia.org/T249390) (owner: 10Bstorm) [02:27:28] just got [Xo02bgpAAD8AAH5uFswAAABX] 2020-04-08 02:27:03: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" while locking an account [02:28:09] I think it was due to conflict while doing it [02:38:28] (03PS3) 10Ppchelko: Use Request-Timeout header to set jobrunner PHP timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577642 (https://phabricator.wikimedia.org/T247114) [02:44:00] (03PS2) 10Ppchelko: Eventgate-main: add mediawiki/page-suppress stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/584667 (https://phabricator.wikimedia.org/T242025) [02:44:02] (03PS2) 10Ppchelko: Changeprop: Listen to mediawiki.page-suppress topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/584672 (https://phabricator.wikimedia.org/T242025) [02:44:37] (03CR) 10Ppchelko: "PS2 is a manual rebase" [deployment-charts] - 10https://gerrit.wikimedia.org/r/584672 (https://phabricator.wikimedia.org/T242025) (owner: 10Ppchelko) [03:16:31] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 (10Nuria) Like I mentioned to @Mooeypoo in irc my concern here is that we are trying to bridge the... [04:36:31] !log rolling restart of ats-tls - T249335 [04:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:38] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [04:57:50] PROBLEM - PHP opcache health on mw2358 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:05:38] (03PS3) 10Jcrespo: backups: Assume backups have its ssds on sda and sdb for partman [puppet] - 10https://gerrit.wikimedia.org/r/587214 (https://phabricator.wikimedia.org/T248934) [05:05:40] (03CR) 10Jcrespo: "Fixed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587214 (https://phabricator.wikimedia.org/T248934) (owner: 10Jcrespo) [05:07:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::services_proxy: allow adding XFP header, enable on parsoid/restbase (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/587227 (https://phabricator.wikimedia.org/T249535) (owner: 10Giuseppe Lavagetto) [05:13:04] (03CR) 10Papaul: [C: 03+2] backups: Assume backups have its ssds on sda and sdb for partman [puppet] - 10https://gerrit.wikimedia.org/r/587214 (https://phabricator.wikimedia.org/T248934) (owner: 10Jcrespo) [05:13:23] (03PS2) 10Giuseppe Lavagetto: profile::services_proxy: allow adding XFP header, enable on parsoid/restbase [puppet] - 10https://gerrit.wikimedia.org/r/587227 (https://phabricator.wikimedia.org/T249535) [05:14:39] (03PS4) 10Jcrespo: backups: Assume backups have their ssds on sda and sdb for partman [puppet] - 10https://gerrit.wikimedia.org/r/587214 (https://phabricator.wikimedia.org/T248934) [05:16:08] RECOVERY - PHP opcache health on mw2358 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:20:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::services_proxy: allow adding XFP header, enable on parsoid/restbase [puppet] - 10https://gerrit.wikimedia.org/r/587227 (https://phabricator.wikimedia.org/T249535) (owner: 10Giuseppe Lavagetto) [05:28:34] RECOVERY - Ensure local MW versions match expected deployment on wtp1025 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [05:33:11] <_joe_> good [05:33:48] <_joe_> !log repooling wtp1025, with envoy and logging any error above 404 T249535 [05:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:54] T249535: VE and Flow fail with "Error contacting the Parsoid/RESTBase server (HTTP 404)" / "…(HTTP 411)" on officewiki - https://phabricator.wikimedia.org/T249535 [05:34:10] !log Deploy schema change on dbstore1004:3313 [05:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:34] RECOVERY - mediawiki-installation DSH group on wtp1025 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [05:38:52] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:38:58] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:42:46] 10Operations, 10Wikimedia-Mailing-lists: add oauth login to mailing lists - https://phabricator.wikimedia.org/T249678 (10Gryllida) [05:44:26] !log rolling upgrade ATS to 8.0.6-1wm6 in cp[5006,5012,3065,3064,2042,2041,1090,1089] [05:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:18] PROBLEM - Check no envoy runtime configuration is left persistent on wtp1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string entries: {} not found on http://localhost:9631/runtime - 392 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [05:46:33] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 (10Majavah) [05:46:51] <_joe_> the envoy alert is me ofc [05:51:19] bad _joe_ [05:51:23] * elukey runs away [05:52:33] (03PS5) 10Elukey: kibana: add kibana to relforge [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [05:54:32] (03PS6) 10Elukey: role::elasticsearch::relforce: add kibana [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [05:54:57] (03PS7) 10Elukey: role::elasticsearch::relforge: add kibana [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [05:56:24] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:00:38] RECOVERY - Check no envoy runtime configuration is left persistent on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [06:07:56] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:08:34] (03CR) 10Elukey: [C: 03+2] role::elasticsearch::relforge: add kibana [puppet] - 10https://gerrit.wikimedia.org/r/586460 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [06:11:10] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) Looks like there are no more connections going through dbproxy1011: ` root@cumin1001:/home/marostegui# host dbproxy1011 dbproxy1011.eqiad.wmnet... [06:11:24] !log Stop haproxy on dbproxy1011 - T231520 [06:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:30] T231520: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 [06:11:57] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) I have stopped haproxy on dbproxy1011 [06:12:48] (03CR) 10Muehlenhoff: "Buster + random old docker-ce needs to be tested the same way that Buster + docker.io needs to be tested. contint2001 will be the inactice" [puppet] - 10https://gerrit.wikimedia.org/r/586203 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [06:13:38] RECOVERY - Check systemd state on wtp1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:38] (03PS1) 10Marostegui: dbproxy: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/587403 (https://phabricator.wikimedia.org/T249188) [06:15:36] PROBLEM - PHP opcache health on mw2350 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:22:26] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10Marostegui) [06:22:46] 10Operations, 10Repository-Admins, 10Traffic: Requesting new gerrit project repository "operations/software/purged" - https://phabricator.wikimedia.org/T249606 (10ema) >>! In T249606#6036981, @Dzahn wrote: > Please see https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests Done, thanks @Dzahn. [06:24:49] (03PS2) 10Muehlenhoff: Initial version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/587219 [06:25:17] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10Marostegui) p:05Triage→03High [06:26:04] (03PS1) 10Ema: Revert "ATS: Disable wmf-analytics log" [puppet] - 10https://gerrit.wikimedia.org/r/587422 (https://phabricator.wikimedia.org/T249335) [06:26:09] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10Marostegui) [06:26:28] (03CR) 10jerkins-bot: [V: 04-1] Revert "ATS: Disable wmf-analytics log" [puppet] - 10https://gerrit.wikimedia.org/r/587422 (https://phabricator.wikimedia.org/T249335) (owner: 10Ema) [06:26:31] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10Marostegui) [06:27:24] (03PS1) 10Vgutierrez: ATS: Enable inbound TLSv1.3 in text@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/587423 (https://phabricator.wikimedia.org/T170567) [06:29:06] (03Abandoned) 10Muehlenhoff: Remove puppet-common [puppet] - 10https://gerrit.wikimedia.org/r/583335 (owner: 10Muehlenhoff) [06:29:22] (03PS6) 10Muehlenhoff: profile::url_downloader: Add types and switch to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/562472 [06:29:40] (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1001/21762/" [puppet] - 10https://gerrit.wikimedia.org/r/587423 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [06:31:00] !log Deploy schema change on db1095:3313 [06:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:25] (03PS1) 10Elukey: role::elasticsearch::relforge: use kibana-oss instead of kibana [puppet] - 10https://gerrit.wikimedia.org/r/587424 (https://phabricator.wikimedia.org/T246961) [06:35:34] RECOVERY - PHP opcache health on mw2350 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:36:54] (03CR) 10Elukey: [C: 03+2] role::elasticsearch::relforge: use kibana-oss instead of kibana [puppet] - 10https://gerrit.wikimedia.org/r/587424 (https://phabricator.wikimedia.org/T246961) (owner: 10Elukey) [06:42:18] RECOVERY - DPKG on wtp1025 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:47:34] (03PS1) 10Elukey: profile::kibana: add the package_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/587427 (https://phabricator.wikimedia.org/T246961) [06:52:19] 10Operations, 10Performance-Team: Occasional NIC Tx bandwidth saturation for mc1027 - https://phabricator.wikimedia.org/T248962 (10elukey) @aaron one thing that it would be useful is, in my opinion, having instrumentation in MediaWiki about key size volume/bytes. Even per "key family" would be enough, just to... [06:59:25] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ayounsi) FYI, `kafka-jumbo1008` switch port has been flapping and flooding logs. Please disable the switch port if the host is neither in production nor be... [07:03:44] 10Operations, 10ops-eqiad: msw-a2-eqiad missing from Netbox - https://phabricator.wikimedia.org/T249685 (10ayounsi) p:05Triage→03Low [07:03:59] (03PS1) 10Muehlenhoff: Setup idp-test2001 as IDP staging host [puppet] - 10https://gerrit.wikimedia.org/r/587429 (https://phabricator.wikimedia.org/T233930) [07:16:42] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10Jpita) sure [07:23:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075 for schema change', diff saved to https://phabricator.wikimedia.org/P10937 and previous config saved to /var/cache/conftool/dbconfig/20200408-072331-marostegui.json [07:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:39] !log Deploy schema change on db1075 [07:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:12] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10Ladsgroup) CREATE is already there but I want to emphasize that we need it for two reasons: 1- Sometimes devs, with coordination with DBAs, create tables in production. I have done it... [07:33:36] PROBLEM - puppet last run on mw2244 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:33:36] PROBLEM - puppet last run on mw2245 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:33:50] PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:34:50] <_joe_> this is me actually running puppet on those servers [07:38:25] (03CR) 10JMeybohm: Initial version (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/587219 (owner: 10Muehlenhoff) [07:39:24] RECOVERY - puppet last run on mw2244 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:39:24] RECOVERY - puppet last run on mw2245 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:39:38] RECOVERY - puppet last run on mw2216 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:43:52] (03CR) 10Dzahn: [C: 04-1] "this will be obsolete if we do TLS termination in envoy. but i'll wait before abandoning it." [puppet] - 10https://gerrit.wikimedia.org/r/587225 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [07:43:59] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) >>! In T244506#6038826, @ayounsi wrote: > FYI, `kafka-jumbo1008` switch port has been flapping and flooding logs. > > Please disable the switch por... [07:45:23] (03PS1) 10Giuseppe Lavagetto: parsoid: switch to envoy, take 2 [puppet] - 10https://gerrit.wikimedia.org/r/587490 (https://phabricator.wikimedia.org/T247389) [07:45:38] 10Operations, 10docker-pkg, 10serviceops: Investigate why the apt configuration of the wikimedia-buster docker image doesn't seem to prefer wikimedia packages - https://phabricator.wikimedia.org/T249218 (10JMeybohm) a:03JMeybohm [07:45:47] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 (10Marostegui) >>! In T249059#6037278, @Mooeypoo wrote: > > > However, we are still in need of a... [07:49:20] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10jcrespo) > creating indexes as well, right No, that would require ALTER rights. Creates are a non issue, that is why we let them be created by owners (e.g new extension or new wiki).... [07:50:44] (03PS2) 10Giuseppe Lavagetto: parsoid: switch to envoy, take 2 [puppet] - 10https://gerrit.wikimedia.org/r/587490 (https://phabricator.wikimedia.org/T247389) [07:52:11] (03CR) 10Muehlenhoff: Initial version (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/587219 (owner: 10Muehlenhoff) [07:52:15] (03CR) 10DannyS712: Initial version (036 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/587219 (owner: 10Muehlenhoff) [07:57:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/587370 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [07:57:10] (03PS3) 10Muehlenhoff: Initial version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/587219 [07:57:12] (03CR) 10Muehlenhoff: Initial version (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/587219 (owner: 10Muehlenhoff) [07:57:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21764/ SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/587490 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [08:04:09] (03PS2) 10Muehlenhoff: Setup idp-test2001 as IDP staging host [puppet] - 10https://gerrit.wikimedia.org/r/587429 (https://phabricator.wikimedia.org/T233930) [08:04:40] PROBLEM - Check systemd state on wtp1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:41] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) Nope serial settings are good, but I have powered it down to avoid spamming logs while we work on partman. [08:10:30] PROBLEM - Check systemd state on wtp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:33] <_joe_> the systemd thing is my fault [08:11:50] PROBLEM - Check systemd state on wtp2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:04] PROBLEM - Check systemd state on wtp2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:24] <_joe_> and not worrisome [08:12:48] PROBLEM - Check systemd state on wtp2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:51] 10Operations, 10Wikimedia-Mailing-lists: add oauth login to mailing lists - https://phabricator.wikimedia.org/T249678 (10Aklapper) 05Open→03Stalled What is "login to mailing lists"? Is this about the website https://lists.wikimedia.org/ ? Or emails? Or something else? Please read https://www.mediawiki.org/... [08:13:42] RECOVERY - Check systemd state on wtp1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:21] (03PS1) 10Dzahn: ci: allow rsyncing of data dirs for server migrations [puppet] - 10https://gerrit.wikimedia.org/r/587491 (https://phabricator.wikimedia.org/T224591) [08:14:34] PROBLEM - Check systemd state on wtp2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:26] PROBLEM - Check systemd state on wtp2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:27] <_joe_> !log switching parsoid to envoy (take 2) in eqiad [08:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:48] PROBLEM - Check systemd state on wtp2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:30] (03CR) 10Filippo Giunchedi: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587265 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [08:18:48] PROBLEM - Check systemd state on wtp2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:08] PROBLEM - Check systemd state on wtp2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:46] PROBLEM - Check systemd state on wtp2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:09] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) this is what is displayed before the error msg that Chris pointed out: ` ┌─────────────────────────┤ [!] Partition disks ├───────────────────────... [08:22:16] PROBLEM - Check systemd state on wtp2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:23:58] RECOVERY - Check systemd state on wtp2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:00] RECOVERY - Check systemd state on wtp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:00] RECOVERY - Check systemd state on wtp2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:06] RECOVERY - Check systemd state on wtp2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:06] RECOVERY - Check systemd state on wtp2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:06] RECOVERY - Check systemd state on wtp2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:06] RECOVERY - Check systemd state on wtp2013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:06] RECOVERY - Check systemd state on wtp2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:26] RECOVERY - Check systemd state on wtp2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:24:28] RECOVERY - Check systemd state on wtp2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:22] 10Operations, 10observability, 10serviceops: write some recording rules for queries used in the appserver RED dashboard - https://phabricator.wikimedia.org/T249663 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:33:42] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:34:11] 10Operations, 10Repository-Admins, 10Traffic: Requesting new gerrit project repository "operations/software/purged" - https://phabricator.wikimedia.org/T249606 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:35:12] 10Operations, 10SRE-swift-storage: Sanity check global-multiwrite logs for ConfirmEdit usage - https://phabricator.wikimedia.org/T159830 (10MoritzMuehlenhoff) @Reedy, @fgiunchedi : Is there anything actionable left for this task? [08:39:12] !log upgrade grafana on grafana1002 - T244208 [08:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:18] T244208: Upgrade Grafana to 6.7 - https://phabricator.wikimedia.org/T244208 [08:41:50] RECOVERY - Check systemd state on wtp2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:46:21] !log Rename wb_terms and recreate views on labsdb1009-labsdb1011 - T248592 T248086 [08:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:31] T248592: Move wb_terms data in cloud replicas to wb_terms_no_longer_updated - https://phabricator.wikimedia.org/T248592 [08:46:34] T248086: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 [08:48:57] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for aaron, dpifke, phedenskog - https://phabricator.wikimedia.org/T248797 (10MoritzMuehlenhoff) a:05elukey→03MoritzMuehlenhoff @dpifke @Peter @aaron I've created Kerberos accounts for you, you'll have received an email wi... [08:49:16] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [08:53:30] (03PS1) 10Muehlenhoff: Annotate Kerberos accounts for dpifke, phedenskog and aaron [puppet] - 10https://gerrit.wikimedia.org/r/587492 (https://phabricator.wikimedia.org/T248797) [08:56:21] (03PS2) 10Dzahn: ci: allow rsyncing of data dirs for server migrations [puppet] - 10https://gerrit.wikimedia.org/r/587491 (https://phabricator.wikimedia.org/T224591) [09:02:27] (03PS3) 10Ayounsi: Logstash: parse Juniper PFE firewall syslog [puppet] - 10https://gerrit.wikimedia.org/r/587265 (https://phabricator.wikimedia.org/T244147) [09:02:34] (03CR) 10Muehlenhoff: [C: 03+2] Annotate Kerberos accounts for dpifke, phedenskog and aaron [puppet] - 10https://gerrit.wikimedia.org/r/587492 (https://phabricator.wikimedia.org/T248797) (owner: 10Muehlenhoff) [09:03:23] (03CR) 10Ayounsi: Logstash: parse Juniper PFE firewall syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587265 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [09:03:36] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for aaron, dpifke, phedenskog - https://phabricator.wikimedia.org/T248797 (10MoritzMuehlenhoff) 05Open→03Resolved Closing the task, please reopen if there are any issues. [09:03:42] (03CR) 10Dzahn: [C: 03+1] "This compiles and will install rsyncd on contint2001 and allow pushing to it from contint1001." [puppet] - 10https://gerrit.wikimedia.org/r/587491 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [09:04:54] (03CR) 10JMeybohm: [C: 03+1] Initial version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/587219 (owner: 10Muehlenhoff) [09:05:52] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:06:04] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Initial version [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/587219 (owner: 10Muehlenhoff) [09:07:55] !log pooling wdqs200[78] - new servers ready to go! - T246343 [09:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:01] T246343: Service implementation on wdqs200[7-8].codfw.wmnet - https://phabricator.wikimedia.org/T246343 [09:08:28] (03CR) 10Jcrespo: [C: 03+1] dbproxy: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/587403 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [09:08:50] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:09:28] (03CR) 10Marostegui: [C: 03+2] dbproxy: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/587403 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [09:10:30] (03PS3) 10Dzahn: ci: allow rsyncing of data dirs for server migrations [puppet] - 10https://gerrit.wikimedia.org/r/587491 (https://phabricator.wikimedia.org/T224591) [09:10:45] !log Reload proxies on dbproxy1018 and dbproxy1019 to depool labsdb1011 - T249188 T248592 [09:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:51] T248592: Move wb_terms data in cloud replicas to wb_terms_no_longer_updated - https://phabricator.wikimedia.org/T248592 [09:10:52] T249188: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 [09:11:22] !log setting weight=10 for all pooled wdqs servers in codfw - T246343 [09:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:59] (03PS4) 10Dzahn: ci: allow rsyncing of data dirs for server migrations [puppet] - 10https://gerrit.wikimedia.org/r/587491 (https://phabricator.wikimedia.org/T224591) [09:14:18] 10Operations, 10observability, 10Patch-For-Review, 10User-CDanis: Upgrade Grafana to 6.7 - https://phabricator.wikimedia.org/T244208 (10fgiunchedi) Production upgrade is successful, next up is WMCS. cc #cloud-services-team [09:14:45] 10Operations, 10observability, 10Patch-For-Review, 10User-CDanis, 10cloud-services-team (Kanban): Upgrade Grafana to 6.7 - https://phabricator.wikimedia.org/T244208 (10fgiunchedi) [09:16:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/587265 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [09:17:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1075 after schema change', diff saved to https://phabricator.wikimedia.org/P10939 and previous config saved to /var/cache/conftool/dbconfig/20200408-091728-marostegui.json [09:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:24] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10Ladsgroup) hmm, my problem is that the command to create the table is like this: `lang=sql CREATE TABLE IF NOT EXISTS /*_*/wb_items_per_site ( ips_row_id BIGINT unsi... [09:19:29] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21769/contint2001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/587491 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [09:20:07] (03PS5) 10Dzahn: ci: allow rsyncing of data dirs for server migrations [puppet] - 10https://gerrit.wikimedia.org/r/587491 (https://phabricator.wikimedia.org/T224591) [09:20:14] !log upgrade grafana on cloudmetrics hosts - T244208 [09:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:20] T244208: Upgrade Grafana to 6.7 - https://phabricator.wikimedia.org/T244208 [09:20:59] 10Operations, 10observability, 10Patch-For-Review, 10User-CDanis, 10cloud-services-team (Kanban): Upgrade Grafana to 6.7 - https://phabricator.wikimedia.org/T244208 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi All done! I have documented the procedure here: https://wikitech.wikimedia.org/wiki/Gr... [09:23:42] (03CR) 10Jbond: "looks good but i have generated real secrets" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/587429 (https://phabricator.wikimedia.org/T233930) (owner: 10Muehlenhoff) [09:29:26] (03PS1) 10Jcrespo: Revert "restore: Add s8 instance to db1095" [puppet] - 10https://gerrit.wikimedia.org/r/587497 [09:30:12] (03CR) 10jerkins-bot: [V: 04-1] Revert "restore: Add s8 instance to db1095" [puppet] - 10https://gerrit.wikimedia.org/r/587497 (owner: 10Jcrespo) [09:30:25] !log stopping and removing db1095:s8 instance [09:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:04] (03PS2) 10Jcrespo: Revert "restore: Add s8 instance to db1095" [puppet] - 10https://gerrit.wikimedia.org/r/587497 [09:31:37] (03CR) 10Muehlenhoff: Setup idp-test2001 as IDP staging host (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/587429 (https://phabricator.wikimedia.org/T233930) (owner: 10Muehlenhoff) [09:33:00] (03CR) 10Jcrespo: [C: 03+2] Revert "restore: Add s8 instance to db1095" [puppet] - 10https://gerrit.wikimedia.org/r/587497 (owner: 10Jcrespo) [09:34:54] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) >>! In T224591#6024260, @hashar wrote:... [09:37:03] (03CR) 10Jbond: Setup idp-test2001 as IDP staging host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587429 (https://phabricator.wikimedia.org/T233930) (owner: 10Muehlenhoff) [09:43:03] (03CR) 10Ayounsi: [C: 03+2] Logstash: parse Juniper PFE firewall syslog [puppet] - 10https://gerrit.wikimedia.org/r/587265 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [09:43:26] (03PS1) 10Jbond: idp: only manage the keystore if ssl is enabled [puppet] - 10https://gerrit.wikimedia.org/r/587498 (https://phabricator.wikimedia.org/T233930) [09:43:49] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/587498 (https://phabricator.wikimedia.org/T233930) (owner: 10Jbond) [09:48:13] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:01:56] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10Tgr) This is the wrong direction to attack the problem from, IMO. ALTER is useful but also dangerous; there is no way to separate functionality into non-overlapping "safe" and "not ne... [10:03:12] (03CR) 10Hnowlan: [C: 03+1] Changeprop: Listen to mediawiki.page-suppress topic [deployment-charts] - 10https://gerrit.wikimedia.org/r/584672 (https://phabricator.wikimedia.org/T242025) (owner: 10Ppchelko) [10:05:00] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10Marostegui) I can see that ALTER might be useful sometimes and might require a longer discussion and some other deep changes (more different roles etc), I don't think we should be kee... [10:10:00] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10jcrespo) > I can't add the indexes as part of creating the table Why not? This is literally the definition live on the DB: ` CREATE TABLE `wb_items_per_site` ( `ips_row_id` bigint(... [10:14:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078 for schema change', diff saved to https://phabricator.wikimedia.org/P10940 and previous config saved to /var/cache/conftool/dbconfig/20200408-101431-marostegui.json [10:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:48] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10jcrespo) > This is the wrong direction to attack the problem from Removing wikiadmin grants, or stopping using that account and using 3 or 4 others without those grants are just a se... [10:14:51] !log Deploy schema change on db1078 [10:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:38] 10Operations, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699 (10ayounsi) [10:29:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/587498 (https://phabricator.wikimedia.org/T233930) (owner: 10Jbond) [10:30:55] 10Operations: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10ayounsi) [10:33:13] (03CR) 10Jbond: [C: 03+2] idp: only manage the keystore if ssl is enabled [puppet] - 10https://gerrit.wikimedia.org/r/587498 (https://phabricator.wikimedia.org/T233930) (owner: 10Jbond) [10:34:11] (03PS1) 10Muehlenhoff: Create a repository component component/wmf-sre-laptop [puppet] - 10https://gerrit.wikimedia.org/r/587500 [10:37:48] (03PS1) 10Ema: purged: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/587501 (https://phabricator.wikimedia.org/T249583) [10:37:50] (03PS1) 10Ema: cache: add support for purged to profile::cache::base [puppet] - 10https://gerrit.wikimedia.org/r/587502 (https://phabricator.wikimedia.org/T249583) [10:37:52] (03PS1) 10Ema: cache: test purged on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/587503 (https://phabricator.wikimedia.org/T249583) [10:39:47] !log restarting idp.wikimedia.org [10:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:07] (03PS2) 10Ema: cache: test purged on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/587503 (https://phabricator.wikimedia.org/T249583) [10:42:35] PROBLEM - PHP opcache health on mw2352 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:43:59] (03PS2) 10Cparle: Enable WikibaseQualityConstraints on test commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585766 (https://phabricator.wikimedia.org/T248117) [10:45:33] (03CR) 10Ema: "pcc output: https://puppet-compiler.wmflabs.org/compiler1002/21771/" [puppet] - 10https://gerrit.wikimedia.org/r/587503 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [10:47:46] (03PS2) 10Ema: Revert "ATS: Disable wmf-analytics log" [puppet] - 10https://gerrit.wikimedia.org/r/587422 (https://phabricator.wikimedia.org/T249335) [10:49:36] 10Operations, 10observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10ayounsi) 05Resolved→03Open Re-opening this task as iptables on kafkamon2001 is again discarded packets from prometheus2003 and prometheus2004 to port 9700/tcp since January 9th [10:50:22] @seen hashar [10:50:22] mutante: Last time I saw hashar they were quitting the network with reason: Quit: I am a virus. Please copy paste me in your /quit message to help me propagate N/A at 4/7/2020 10:56:50 PM (11h53m32s ago) [10:57:19] (03PS3) 10Muehlenhoff: Setup idp-test2001 as IDP staging host [puppet] - 10https://gerrit.wikimedia.org/r/587429 (https://phabricator.wikimedia.org/T233930) [10:59:27] cormacparle__, tgr: I have a ca. 20-30min meeting starting just now – would it be okay if the SWAT starts with the GrowthExperiments changes and enabling WikibaseQualityConstraints comes afterwards? then I’ll have more capacity to look after the constraints deployment [10:59:39] sure [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200408T1100). [11:00:04] cormacparle, tgr, and Nikerabbit: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:46] o/ [11:01:18] tgr: Nikerabbit: Guess you'll self-service? :-) [11:01:47] Urbanecm: let me find the checklist in that case :D [11:02:02] and need to know what is the order [11:02:16] yeah [11:02:30] a config change for cormacparle__ is first on the calendar but I’d like to do that later (currently in a meeting but would like to supervise that change) [11:02:38] I can deploy the other patches, too [11:02:39] RECOVERY - PHP opcache health on mw2352 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:02:43] Nikerabbit: it's beta cluster only change, right? In that case, you need to just +2 and git pull at deploy1001 [11:02:52] the other patch, then [11:03:19] either way sounds good [11:03:54] tgr: you mean you can deploy my patch too? that'd be fine with me, but Lucas wants to wait until after his meetings so he can monitor things [11:04:11] Nikerabbit's patch just needs to be merged, anyway [11:04:42] I wasn't sure how it is these days, before patches got reverted if they were not synced to prod [11:04:45] cormacparle__: I can if Lucas gets back before I finish with the rest [11:04:56] grand [11:05:00] (03PS2) 10Gergő Tisza: Enable MassMessage logging on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587251 (https://phabricator.wikimedia.org/T165128) (owner: 10Nikerabbit) [11:05:14] Nikerabbit: merge, rebase, doesn't need to be synced [11:05:24] good to know [11:05:26] rebase the directory on tin, I mean [11:05:29] !log push urpf log only to codfw - T244147 [11:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:44] or whatever the deploy server is called these days [11:06:28] (03CR) 10Gergő Tisza: [C: 03+2] Enable MassMessage logging on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587251 (https://phabricator.wikimedia.org/T165128) (owner: 10Nikerabbit) [11:07:23] (03Merged) 10jenkins-bot: Enable MassMessage logging on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587251 (https://phabricator.wikimedia.org/T165128) (owner: 10Nikerabbit) [11:08:02] (03CR) 10Vgutierrez: [C: 03+1] cache: test purged on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/587503 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [11:10:57] Nikerabbit: wrt the beta logging problem, why not just create a $wmgBetaMonologChannels and merge it manually in CommonSettings-labs.php? [11:11:08] anyway should be live about nowish [11:11:33] tgr: that's being explored in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/586353 [11:12:11] oh cool [11:12:51] (03PS9) 10Gergő Tisza: Deploy GrowthExperiments on Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584133 (https://phabricator.wikimedia.org/T241181) [11:16:26] (03CR) 10Gergő Tisza: [C: 03+2] Deploy GrowthExperiments on Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584133 (https://phabricator.wikimedia.org/T241181) (owner: 10Gergő Tisza) [11:17:24] (03Merged) 10jenkins-bot: Deploy GrowthExperiments on Serbian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584133 (https://phabricator.wikimedia.org/T241181) (owner: 10Gergő Tisza) [11:17:41] o/ [11:17:43] meeting ove [11:17:45] *over [11:18:01] great [11:23:01] (03CR) 10Dzahn: ""sudo tcpdump port 80" on phab1001 shows there is no traffic, at least not for regular operation from the caches. the deployment_servers w" [puppet] - 10https://gerrit.wikimedia.org/r/569100 (owner: 10Dzahn) [11:26:26] tgr: you’re still deploying right? [11:27:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline, but feel free to ignore." (031 comment) [debs/thanos] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/587252 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [11:27:26] Lucas_WMDE: yeah, I'll take a while [11:27:40] I can do your patch next [11:27:47] (03CR) 10Lucas Werkmeister (WMDE): Enable WikibaseQualityConstraints on test commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585766 (https://phabricator.wikimedia.org/T248117) (owner: 10Cparle) [11:28:01] ok no problem [11:28:06] just making sure :) [11:28:24] !log tgr@deploy1001 Synchronized dblists/: SWAT: [[gerrit:584133|Deploy GrowthExperiments on Serbian Wikipedia (T241181)]] (duration: 01m 17s) [11:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:30] T241181: Deploy GrowthExperiments on Serbian Wikipedia - https://phabricator.wikimedia.org/T241181 [11:29:53] !log tgr@deploy1001 Synchronized wmf-config/: SWAT: [[gerrit:584133|Deploy GrowthExperiments on Serbian Wikipedia (T241181)]] (duration: 01m 06s) [11:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:06] Lucas_WMDE: is that patch good to go? [11:32:03] not sure, it looks like cormacparle__ snuck in a second fix in PS2 ^^ [11:32:15] that needs to be there so we can test this [11:32:28] just realised earlier when I was fiddling about [11:32:41] without it you can't choose a property with constraints on test-commons [11:32:45] oh, right [11:32:47] did we have a group0 train yesterday? [11:32:57] I was only thinking about items, where it’s not so much of a problem [11:33:10] (you can enter the item ID and since it’s likely to exist on Wikidata as well, the search still returns something) [11:33:19] (but testwikidata *property* IDs are higher than on real Wikidata) [11:33:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable WikibaseQualityConstraints on test commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585766 (https://phabricator.wikimedia.org/T248117) (owner: 10Cparle) [11:33:46] (03PS3) 10Lucas Werkmeister (WMDE): Enable WikibaseQualityConstraints on test commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585766 (https://phabricator.wikimedia.org/T248117) (owner: 10Cparle) [11:33:54] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585766 (https://phabricator.wikimedia.org/T248117) (owner: 10Cparle) [11:33:58] then I think we’re good to go [11:34:21] tgr: I can also do the actual deployment [11:34:51] I can do it, I already have the consoles set up [11:34:53] (03Merged) 10jenkins-bot: Enable WikibaseQualityConstraints on test commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585766 (https://phabricator.wikimedia.org/T248117) (owner: 10Cparle) [11:35:19] I have them as well ^^ [11:35:21] butb ok [11:35:51] it's on mwdebug1002 [11:35:58] cool, checking [11:36:12] (03PS7) 10Dzahn: phabricator: remove firewall holes for port 80 from caches [puppet] - 10https://gerrit.wikimedia.org/r/569100 [11:36:14] btwe the CORS issue is fixed in master [11:36:24] great [11:36:35] not sure if that got deployed yesterday, though, what ith the outage [11:36:40] Hello, are GrowthExperiments enabled on srwiki? [11:37:00] Zoranzoki21: yeah, as of a couple minutes ago [11:37:17] PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [11:37:35] PROBLEM - logstash syslog TCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [11:37:42] is mwdebug1002 okay to use again? [11:37:42] (03CR) 10Dzahn: [C: 03+1] phabricator: remove firewall holes for port 80 from caches [puppet] - 10https://gerrit.wikimedia.org/r/569100 (owner: 10Dzahn) [11:37:47] I think I’ve been using -1001 recently [11:37:48] tgr: Cool, how I can access to it? [11:38:56] Lucas_WMDE: it had a motd saying it should not be used, and it's gone, so I assume so [11:39:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1078 after schema change', diff saved to https://phabricator.wikimedia.org/P10941 and previous config saved to /var/cache/conftool/dbconfig/20200408-113901-marostegui.json [11:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:10] Lucas_WMDE: I'm not seeing that error anymore on a `wbcheckconstraints`call [11:39:49] I’m seeing a correct violation on https://test-commons.wikimedia.org/wiki/File:Sable_antelope_skeleton_at_MAV-USP_edited.jpg [11:40:06] so it seems to be working :) [11:40:15] hooray! [11:40:55] tgr: okay to deploy, I think [11:41:15] tgr: When I go on some special page I get error that special page isn't found [11:41:57] PROBLEM - logstash syslog TCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [11:42:25] PROBLEM - logstash JSON linesTCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [11:42:45] PROBLEM - logstash JSON linesTCP port on logstash1008 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [11:42:50] BTW... On mwdebug1002 everything is ok with GrowthExperiments on srwiki.. When I disable turn off extension I get error that special page isn't found [11:43:00] *When I turn off extension [11:43:06] ah [11:43:09] that might be because… one second [11:43:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 for schema change', diff saved to https://phabricator.wikimedia.org/P10942 and previous config saved to /var/cache/conftool/dbconfig/20200408-114315-marostegui.json [11:43:17] PROBLEM - logstash syslog TCP port on logstash2005 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [11:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:28] T236104 [11:43:30] T236104: Cache of wmf-config/InitialiseSettings often 1 step behind - https://phabricator.wikimedia.org/T236104 [11:43:31] !log Deploy schema change on db1112, this will generate lag on labs s3 [11:43:32] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:585766|Enable WikibaseQualityConstraints on test commons (T248117)]] (duration: 01m 05s) [11:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:40] T248117: Deploy WikibaseQualityConstraints to commons - https://phabricator.wikimedia.org/T248117 [11:44:07] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1009.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:44:36] I guess that scap just now should fix the GrowthExperiments issue, acting as the second sync for that deploy [11:44:47] but we probably still need a second sync for the constraints change itself? [11:45:13] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1009.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1009.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:45:19] I'm going to do a bunch of other deploys so it will catch up eventually [11:45:23] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [11:45:33] PROBLEM - logstash JSON linesTCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [11:45:39] Lucas_WMDE: All is ok now with GrowthExperiments, thanks! [11:45:44] ok [11:46:07] (03PS9) 10Gergő Tisza: Enable GrowthExperiments on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) [11:46:57] godog: do you know about the logstash issues? [11:48:59] !log logstash1009 - restarted logstash [11:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:42] (03CR) 10Gergő Tisza: [C: 03+2] Enable GrowthExperiments on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) (owner: 10Gergő Tisza) [11:51:20] mutante: er, I broke it [11:51:38] (03Merged) 10jenkins-bot: Enable GrowthExperiments on French Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) (owner: 10Gergő Tisza) [11:51:54] with https://gerrit.wikimedia.org/r/c/operations/puppet/+/587265 [11:52:09] the restart went fine on the canary host I used [11:52:21] but looks like it broke under the hood [11:52:39] "[2020-04-08T11:50:58,897][ERROR][logstash.agent ] Cannot create pipeline {:reason=>"Expected one of #, \", ', -, [, { at line 798, column 13 (byte 26892) after filter" [11:52:55] PROBLEM - logstash syslog TCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [11:54:25] PROBLEM - logstash JSON linesTCP port on logstash2006 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [11:54:45] (03PS1) 10Ayounsi: Revert "Logstash: parse Juniper PFE firewall syslog" [puppet] - 10https://gerrit.wikimedia.org/r/587509 [11:54:57] (reverting) [11:55:47] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={1,2,4,5} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster= [11:55:47] -topic=All&var-consumer_group=All [11:55:52] PROBLEM - LVS HTTP IPv4 #page on logstash.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.36 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:56:14] XioNoX: related right? [11:56:19] yeah... [11:56:20] !log tgr@deploy1001 Synchronized dblists/: SWAT: [[gerrit:584183|Enable GrowthExperiments on French Wiktionary (T235964)]] (duration: 01m 03s) [11:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:26] T235964: Get the Growth experiment for the French Wiktionary - https://phabricator.wikimedia.org/T235964 [11:56:26] waiting for CI [11:56:31] PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [11:56:39] in https://gerrit.wikimedia.org/r/c/operations/puppet/+/587509 [11:56:48] <_joe_> tgr: please stop deployments if possible [11:57:01] <_joe_> XioNoX: just self-merge it [11:57:07] I see we got it already under control [11:57:10] _joe_: stopped [11:57:11] <_joe_> do not wait for CI for an alert [11:57:11] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Revert "Logstash: parse Juniper PFE firewall syslog" [puppet] - 10https://gerrit.wikimedia.org/r/587509 (owner: 10Ayounsi) [11:57:48] <_joe_> tgr: thanks :) [11:58:08] doing [11:58:21] XioNoX: LMK if I can help [11:58:23] <_joe_> tgr: you're without logstash right now so it's better to wait for this to be fixed [11:58:28] running puppet on logstash hosts [11:58:34] XioNoX: oh, gotcha. ok [11:58:37] should be back in a couple min [11:59:01] PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [11:59:30] RECOVERY - LVS HTTP IPv4 #page on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on 10.2.2.36 port 10514 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:59:31] RECOVERY - logstash syslog TCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [11:59:35] alright [11:59:43] RECOVERY - logstash syslog TCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [11:59:55] RECOVERY - logstash JSON linesTCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [12:00:11] RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [12:00:11] RECOVERY - logstash syslog TCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [12:00:17] RECOVERY - logstash syslog TCP port on logstash2006 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [12:00:22] TIL, even with a broken config, logstash resart doesn't fail, and doesn't show any error in `service logstash status` [12:00:27] <_joe_> tgr: I guess you can resume now :) [12:00:35] RECOVERY - logstash JSON linesTCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [12:00:35] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [12:00:41] RECOVERY - logstash JSON linesTCP port on logstash2005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [12:00:51] RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [12:00:57] thanks! [12:00:59] RECOVERY - logstash JSON linesTCP port on logstash1008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [12:01:03] RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [12:01:20] <_joe_> tgr: maybe wait a few mins so that the queue of messages to logstash is reduced [12:01:35] (03PS10) 10Gergő Tisza: Enable GrowthExperiments welcome survey on Ukrainian, Hungarian, Armenian Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584135 (https://phabricator.wikimedia.org/T238295) [12:03:29] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:04:13] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:04:46] 10Operations, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699 (10hnowlan) Port 3050 is the restbase ratelimiter and has been opened in ferm for TCP only. I was concerned this might be an issue related to restbase2022 being a newly added host but this is... [12:05:03] 10Operations, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699 (10hnowlan) a:03hnowlan [12:07:38] (03CR) 10Gergő Tisza: [C: 03+2] Enable GrowthExperiments welcome survey on Ukrainian, Hungarian, Armenian Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584135 (https://phabricator.wikimedia.org/T238295) (owner: 10Gergő Tisza) [12:08:33] (03Merged) 10jenkins-bot: Enable GrowthExperiments welcome survey on Ukrainian, Hungarian, Armenian Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584135 (https://phabricator.wikimedia.org/T238295) (owner: 10Gergő Tisza) [12:08:35] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [12:09:07] yeah queue is drained [12:09:24] !log tgr@deploy1001 Synchronized wmf-config/: SWAT: [[gerrit:584183|Enable GrowthExperiments on French Wiktionary (T235964)]] (duration: 01m 06s) [12:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:30] T235964: Get the Growth experiment for the French Wiktionary - https://phabricator.wikimedia.org/T235964 [12:13:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I think the format is a bit different. Reading on https://nodejs.org/api/cli.html#cli_node_extra_ca_certs_file I understand we need some w" [deployment-charts] - 10https://gerrit.wikimedia.org/r/587298 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [12:13:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] "To be clear, the current approach will create the ENV var, but with the content of the var being the cert, not a path to a file containing" [deployment-charts] - 10https://gerrit.wikimedia.org/r/587298 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [12:14:00] (03PS1) 10Ayounsi: Revert "Revert "Logstash: parse Juniper PFE firewall syslog"" [puppet] - 10https://gerrit.wikimedia.org/r/587513 [12:17:20] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:584135|Enable GrowthExperiments welcome survey on Ukrainian, Hungarian, Armenian Wikipedias (T238295) (duration: 01m 08s) [12:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:25] T238295: Deploy Welcome Survey to Ukrainian, Hungarian, Armenian Wikipedias - https://phabricator.wikimedia.org/T238295 [12:18:39] (03PS2) 10Ayounsi: Logstash: parse Juniper PFE firewall syslog. Take 2 [puppet] - 10https://gerrit.wikimedia.org/r/587513 (https://phabricator.wikimedia.org/T244147) [12:19:25] (03PS3) 10Gergő Tisza: Enable GrowthExperiments suggested edits on uk, hu, hy, eu wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585219 (https://phabricator.wikimedia.org/T247308) [12:21:35] (03CR) 10Gergő Tisza: [C: 03+2] Enable GrowthExperiments suggested edits on uk, hu, hy, eu wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585219 (https://phabricator.wikimedia.org/T247308) (owner: 10Gergő Tisza) [12:22:30] (03Merged) 10jenkins-bot: Enable GrowthExperiments suggested edits on uk, hu, hy, eu wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/585219 (https://phabricator.wikimedia.org/T247308) (owner: 10Gergő Tisza) [12:24:08] (03CR) 10jerkins-bot: [V: 04-1] Logstash: parse Juniper PFE firewall syslog. Take 2 [puppet] - 10https://gerrit.wikimedia.org/r/587513 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [12:29:43] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:585219|Enable GrowthExperiments suggested edits on uk, hu, hy, eu wikipedias (T247308)]] (duration: 01m 08s) [12:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:49] T247308: Deploy suggested edits on uk, hu, hy, eu wikis - https://phabricator.wikimedia.org/T247308 [12:31:14] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: re-sync (duration: 01m 07s) [12:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:29] (03PS1) 10Jbond: apereo_cas: add more timeout values [puppet] - 10https://gerrit.wikimedia.org/r/587515 [12:33:47] (03CR) 10Ayounsi: "> Patch Set 2: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/587513 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [12:34:12] done. [12:34:17] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/587513 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [12:34:50] (03PS1) 10Vgutierrez: ATS: Increase to 2 the number of accept_threads in cp3052 and cp3056 [puppet] - 10https://gerrit.wikimedia.org/r/587516 (https://phabricator.wikimedia.org/T249335) [12:38:05] (03PS1) 10Dzahn: ci: add parameter and stop zuul-merger service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) [12:39:21] (03CR) 10Ayounsi: "The comma at the end of line 47 was the cause of this transient chaos." [puppet] - 10https://gerrit.wikimedia.org/r/587513 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [12:39:57] (03CR) 10jerkins-bot: [V: 04-1] ATS: Increase to 2 the number of accept_threads in cp3052 and cp3056 [puppet] - 10https://gerrit.wikimedia.org/r/587516 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [12:40:40] 10Operations, 10Wikimedia-Logstash, 10observability: config file change canarying for logstash - https://phabricator.wikimedia.org/T221052 (10ayounsi) +1, this happened today. [12:42:29] (03PS2) 10Vgutierrez: ATS: Increase to 2 the number of accept_threads in cp3052 and cp3056 [puppet] - 10https://gerrit.wikimedia.org/r/587516 (https://phabricator.wikimedia.org/T249335) [12:44:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thanks for debugging the configuration! Take my +1 with a grain of salt since the previous config was broken :)" [puppet] - 10https://gerrit.wikimedia.org/r/587513 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [12:44:59] * godog being bold [12:45:55] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS Categories update lag alert - https://phabricator.wikimedia.org/T246497 (10Gehel) [12:46:23] (03CR) 10Dzahn: "Per Hashar, before we can reimage contint2001 we must first ensure zuul-merger service is stopped. Making that possible in puppet and stop" [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [12:46:48] (03CR) 10jerkins-bot: [V: 04-1] ATS: Increase to 2 the number of accept_threads in cp3052 and cp3056 [puppet] - 10https://gerrit.wikimedia.org/r/587516 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [12:46:53] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS Categories update lag alert - https://phabricator.wikimedia.org/T246497 (10Gehel) We are fixing an issue in blazegraph that failed categories update in the next deployment. This should be fixed after the deployment... [12:51:08] (03PS2) 10Dzahn: ci: add parameter and stop zuul-merger service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) [12:51:35] (03CR) 10Vgutierrez: [C: 03+1] Revert "ATS: Disable wmf-analytics log" [puppet] - 10https://gerrit.wikimedia.org/r/587422 (https://phabricator.wikimedia.org/T249335) (owner: 10Ema) [12:53:15] (03CR) 10Ema: [C: 03+2] Revert "ATS: Disable wmf-analytics log" [puppet] - 10https://gerrit.wikimedia.org/r/587422 (https://phabricator.wikimedia.org/T249335) (owner: 10Ema) [12:57:41] (03PS1) 10JMeybohm: docker::baseimages: Fix pinning of wikimedia deb repository [puppet] - 10https://gerrit.wikimedia.org/r/587518 (https://phabricator.wikimedia.org/T249218) [13:02:02] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks Moritz!" [debs/thanos] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/587252 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [13:02:08] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] debian: first commit [debs/thanos] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/587252 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [13:03:10] (03PS3) 10Dzahn: ci: add parameter and stop zuul-merger service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) [13:03:12] (03CR) 10Ema: [C: 03+1] ATS: Enable inbound TLSv1.3 in text@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/587423 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [13:03:55] PROBLEM - PHP opcache health on mw2362 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:05:42] !log purged 0.1 uploaded to buster-wikimedia T249583 [13:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:48] T249583: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 [13:07:07] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/562472 (owner: 10Muehlenhoff) [13:08:37] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/21776/contint1001.wikimedia.org/change.contint1001.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [13:09:17] (03PS2) 10Ema: purged: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/587501 (https://phabricator.wikimedia.org/T249583) [13:09:30] (03PS2) 10Ema: cache: add support for purged to profile::cache::base [puppet] - 10https://gerrit.wikimedia.org/r/587502 (https://phabricator.wikimedia.org/T249583) [13:10:58] (03PS3) 10Ema: purged: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/587501 (https://phabricator.wikimedia.org/T249583) [13:11:00] (03PS3) 10Ema: cache: add support for purged to profile::cache::base [puppet] - 10https://gerrit.wikimedia.org/r/587502 (https://phabricator.wikimedia.org/T249583) [13:11:02] (03PS3) 10Ema: cache: test purged on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/587503 (https://phabricator.wikimedia.org/T249583) [13:14:55] (03PS4) 10Ema: purged: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/587501 (https://phabricator.wikimedia.org/T249583) [13:14:57] (03PS4) 10Ema: cache: add support for purged to profile::cache::base [puppet] - 10https://gerrit.wikimedia.org/r/587502 (https://phabricator.wikimedia.org/T249583) [13:14:59] (03PS4) 10Ema: cache: test purged on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/587503 (https://phabricator.wikimedia.org/T249583) [13:15:35] (03CR) 10Ottomata: [C: 03+1] "Lemme know when you want to deploy this and we can do so!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/584667 (https://phabricator.wikimedia.org/T242025) (owner: 10Ppchelko) [13:15:50] (03PS4) 10Dzahn: ci: add parameter and stop zuul-merger service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) [13:16:37] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable inbound TLSv1.3 in text@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/587423 (https://phabricator.wikimedia.org/T170567) (owner: 10Vgutierrez) [13:17:53] (03CR) 10Dzahn: [C: 03+1] "see it getting stopped in codfw and keep running in eqiad: https://puppet-compiler.wmflabs.org/compiler1003/21779/" [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [13:19:03] RECOVERY - PHP opcache health on mw2362 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:19:28] (03PS5) 10Ema: cache: add support for purged to profile::cache::base [puppet] - 10https://gerrit.wikimedia.org/r/587502 (https://phabricator.wikimedia.org/T249583) [13:19:30] (03PS5) 10Ema: cache: test purged on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/587503 (https://phabricator.wikimedia.org/T249583) [13:22:13] !log enable inbound TLSv1.3 in text@ulsfo - T170567 [13:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:20] T170567: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 [13:25:54] (03CR) 10Ema: [C: 03+2] purged: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/587501 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:26:06] (03CR) 10Ema: [C: 03+2] cache: add support for purged to profile::cache::base [puppet] - 10https://gerrit.wikimedia.org/r/587502 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:26:18] (03CR) 10Ema: [C: 03+2] cache: test purged on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/587503 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:26:29] (03PS5) 10Dzahn: ci: add parameter to stop or mask zuul-merger service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) [13:26:38] (03CR) 10Dzahn: [C: 03+1] "adding another option to also mask it in addition to just stopping it. that's what you wanted, hashar?" [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [13:27:59] (03PS6) 10Dzahn: ci: add parameter to stop or mask zuul-merger service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) [13:29:33] (03CR) 10Dzahn: "stopped and masked: https://puppet-compiler.wmflabs.org/compiler1002/21783/contint2001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [13:30:04] !log cp3050: stop vhtcpd, start purged T249583 [13:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:10] T249583: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 [13:30:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM +1. PCC says ok as well at https://puppet-compiler.wmflabs.org/compiler1002/21782/" [puppet] - 10https://gerrit.wikimedia.org/r/587518 (https://phabricator.wikimedia.org/T249218) (owner: 10JMeybohm) [13:31:16] 10Operations, 10OpenRefine, 10Traffic, 10Core Platform Team Workboards (Clinic Duty Team): Clients failing API login due to dependence on "Set-Cookie" header name casing - https://phabricator.wikimedia.org/T249680 (10Anomie) There was no change to MediaWiki with respect to output of Set-Cookie headers. For... [13:32:00] (03CR) 10Filippo Giunchedi: "The hardware meant to be running Thanos query is being ordered, but we can start with a pair of VMs too in the meantime." [puppet] - 10https://gerrit.wikimedia.org/r/586314 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [13:32:19] (03CR) 10Hashar: [C: 03+1] "Ah excellent thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [13:32:35] PROBLEM - Varnish HTCP daemon on cp3050 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (vhtcpd), args vhtcpd https://wikitech.wikimedia.org/wiki/Varnish [13:33:43] ACKNOWLEDGEMENT - Varnish HTCP daemon on cp3050 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (vhtcpd), args vhtcpd Ema known, testing purged https://wikitech.wikimedia.org/wiki/Varnish [13:35:43] (03CR) 10Dzahn: [C: 03+2] ci: add parameter to stop or mask zuul-merger service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/587517 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [13:39:03] C [13:40:26] !log stopped and masked zuul-merger service on contint2001 via puppet (T224591) [13:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:32] T224591: Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 [13:43:47] (03PS1) 10Hashar: WIP WIP WIP Switch CI to contint2001 WIP WIP WIP [puppet] - 10https://gerrit.wikimedia.org/r/587521 [13:44:00] (03CR) 10Hashar: [C: 04-1] WIP WIP WIP Switch CI to contint2001 WIP WIP WIP [puppet] - 10https://gerrit.wikimedia.org/r/587521 (owner: 10Hashar) [13:45:16] 10Operations, 10OpenRefine, 10Traffic, 10Core Platform Team Workboards (Clinic Duty Team): Clients failing API login due to dependence on "Set-Cookie" header name casing - https://phabricator.wikimedia.org/T249680 (10Pintoch) Thank you very much for the investigation. We have patched Wikidata-Toolkit, rel... [13:45:47] 10Operations, 10observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10elukey) There is nothing listening on port 9700 on kafkamon2001: ` prometh+ 445 0.8 1.1 378020 47544 ? Ssl 2019 3393:13 /usr/bin/prometheus-burrow-exporter --burrow-addr http://local... [13:46:39] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [13:49:19] 10Operations, 10OpenRefine, 10Traffic, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): Clients failing API login due to dependence on "Set-Cookie" header name casing - https://phabricator.wikimedia.org/T249680 (10CDanis) This seems likely related to us switching from using nginx as our... [13:50:09] (03PS10) 10Hashar: zuul: provision the scap repository [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) [13:50:35] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [13:50:57] 10Operations, 10observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10elukey) @fgiunchedi the following bit is probably the culprit: ` prometheus::class_config{ "burrow_jumbo_${::site}": dest => "${targets_path}/burrow_jumbo_${::site}.yaml",... [13:51:35] (03PS1) 10Andrew Bogott: OpenStack designate: upgrade eqiad1 to version Rocky [puppet] - 10https://gerrit.wikimedia.org/r/587522 (https://phabricator.wikimedia.org/T248635) [13:52:40] (03CR) 10JMeybohm: [C: 03+2] docker::baseimages: Fix pinning of wikimedia deb repository [puppet] - 10https://gerrit.wikimedia.org/r/587518 (https://phabricator.wikimedia.org/T249218) (owner: 10JMeybohm) [13:53:35] 10Operations, 10Traffic, 10Patch-For-Review: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 (10ema) Deployed on cp3050. Compared to vhtcpd, CPU usage is ~ 3x - 4x which I'd say is good enough for now. {F31743497} Here's the 10s CPU profile with production traffic which might suggest... [13:57:33] (03PS11) 10Hashar: zuul: provision the scap repository [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) [13:59:44] (03CR) 10Hashar: "I made a quick fix in the zuul erb template to prevent the introduction of an extra newline." [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [14:07:09] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [14:10:50] !log jeh@deploy1001 Started deploy [horizon/deploy@0d18f67]: update horizon submodule to enable server groups [14:10:53] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/compiler1001/382/deploy1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [14:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:21] (03PS1) 10Ema: prometheus: job definition for purged [puppet] - 10https://gerrit.wikimedia.org/r/587525 (https://phabricator.wikimedia.org/T249583) [14:14:19] !log jeh@deploy1001 Finished deploy [horizon/deploy@0d18f67]: update horizon submodule to enable server groups (duration: 03m 30s) [14:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:01] (03PS1) 10Ayounsi: uRPF enable globally as log only [homer/public] - 10https://gerrit.wikimedia.org/r/587526 (https://phabricator.wikimedia.org/T244147) [14:23:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1112 after schema change', diff saved to https://phabricator.wikimedia.org/P10947 and previous config saved to /var/cache/conftool/dbconfig/20200408-142341-marostegui.json [14:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:30] (03PS2) 10Ema: prometheus: job definition for purged [puppet] - 10https://gerrit.wikimedia.org/r/587525 (https://phabricator.wikimedia.org/T249583) [14:27:11] 10Operations, 10serviceops, 10Epic: Trask and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10akosiaris) [14:27:22] 10Operations, 10serviceops, 10Epic: Trask and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10akosiaris) p:05Triage→03Medium [14:30:38] (03PS1) 10Alexandros Kosiaris: Remove jessie base images building process [puppet] - 10https://gerrit.wikimedia.org/r/587529 (https://phabricator.wikimedia.org/T249724) [14:32:12] (03CR) 10jerkins-bot: [V: 04-1] prometheus: job definition for purged [puppet] - 10https://gerrit.wikimedia.org/r/587525 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [14:32:24] (03CR) 10CDanis: [C: 03+1] modules: add thanos-sidecar define and profile [puppet] - 10https://gerrit.wikimedia.org/r/586312 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [14:32:38] (03PS3) 10Ema: prometheus: job definition for purged [puppet] - 10https://gerrit.wikimedia.org/r/587525 (https://phabricator.wikimedia.org/T249583) [14:32:51] (03CR) 10CDanis: [C: 03+1] prometheus: add thanos-sidecar to prometheus@ops [puppet] - 10https://gerrit.wikimedia.org/r/586313 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [14:34:07] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: job definition for purged [puppet] - 10https://gerrit.wikimedia.org/r/587525 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [14:34:39] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/587525 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [14:36:32] (03CR) 10CDanis: [C: 03+1] Add Thanos query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/586314 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [14:37:24] (03CR) 10CDanis: [C: 03+1] prometheus: scrape thanos sidecar/query metrics [puppet] - 10https://gerrit.wikimedia.org/r/586315 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [14:38:11] (03PS1) 10Alexandros Kosiaris: profile::docker::builder: Add buster, drop jessie [puppet] - 10https://gerrit.wikimedia.org/r/587530 (https://phabricator.wikimedia.org/T249724) [14:38:13] (03CR) 10jerkins-bot: [V: 04-1] prometheus: job definition for purged [puppet] - 10https://gerrit.wikimedia.org/r/587525 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [14:38:36] (03PS1) 10Ssingh: cescout: add OONI's metadb sync scripts [puppet] - 10https://gerrit.wikimedia.org/r/587531 (https://phabricator.wikimedia.org/T247273) [14:39:56] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/587530 (https://phabricator.wikimedia.org/T249724) (owner: 10Alexandros Kosiaris) [14:40:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] profile::docker::builder: Add buster, drop jessie [puppet] - 10https://gerrit.wikimedia.org/r/587530 (https://phabricator.wikimedia.org/T249724) (owner: 10Alexandros Kosiaris) [14:40:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21786/ says ok" [puppet] - 10https://gerrit.wikimedia.org/r/587530 (https://phabricator.wikimedia.org/T249724) (owner: 10Alexandros Kosiaris) [14:41:04] (03CR) 10Hashar: [C: 03+1] "Cool yes that makes sense. +1" [puppet] - 10https://gerrit.wikimedia.org/r/586203 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [14:43:24] (03CR) 10jerkins-bot: [V: 04-1] cescout: add OONI's metadb sync scripts [puppet] - 10https://gerrit.wikimedia.org/r/587531 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [14:43:27] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10Anomie) > Anything else? Does CREATE cover CREATE TEMPORARY TABLES? If not, we should most likely include that one too. There's also INDEX, see below. While we don't currently use... [14:45:10] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21788/" [puppet] - 10https://gerrit.wikimedia.org/r/586203 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [14:45:34] (03PS2) 10Ssingh: cescout: add OONI's metadb sync scripts [puppet] - 10https://gerrit.wikimedia.org/r/587531 (https://phabricator.wikimedia.org/T247273) [14:46:59] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/587525 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [14:49:09] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for tchanders, dmaza, dbarratt, wikigit - https://phabricator.wikimedia.org/T249059 (10Nuria) Per @marostegui 's last comment then no usage of analytics replicas should be needed as... [14:49:20] (03Abandoned) 10Dzahn: contint: use package_from_component, stop using docker class [puppet] - 10https://gerrit.wikimedia.org/r/566383 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [14:51:06] (03CR) 10jerkins-bot: [V: 04-1] cescout: add OONI's metadb sync scripts [puppet] - 10https://gerrit.wikimedia.org/r/587531 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [14:51:17] (03PS1) 10Ema: purged: set concurrency to 'processorcount' [puppet] - 10https://gerrit.wikimedia.org/r/587533 (https://phabricator.wikimedia.org/T249583) [14:51:53] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Script wmf-auto-reimage wa... [14:52:08] 10:50:57 Build timed out (after 5 minutes). Marking the build as aborted. hmm [14:55:16] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/587531 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [14:56:35] (03CR) 10jerkins-bot: [V: 04-1] purged: set concurrency to 'processorcount' [puppet] - 10https://gerrit.wikimedia.org/r/587533 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [14:57:31] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10Marostegui) >>! In T249683#6040077, @Anomie wrote: >> Anything else? > > Does CREATE cover CREATE TEMPORARY TABLES? If not, we should most likely include that one too. No, `CREATE`... [14:57:53] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10Marostegui) [15:02:01] (03CR) 10Ema: [V: 03+2 C: 03+2] purged: set concurrency to 'processorcount' [puppet] - 10https://gerrit.wikimedia.org/r/587533 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [15:02:58] sukhe: why did you break CI?!? :P [15:03:21] ema: yeah! [15:03:28] I will let you know once I find out how I did it :D [15:04:21] hashar: hi! In case you aren't aware, ops/puppet CI jobs are failing with "Build timed out (after 5 minutes). Marking the build as aborted." [15:05:14] 10Operations, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10jbond) screen shot of redesign {F31743655} [15:07:03] (03CR) 10Hashar: "operations-puppet-tests-buster-docker ABORTED in 5m 07s" [puppet] - 10https://gerrit.wikimedia.org/r/587531 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [15:07:41] 10Operations, 10docker-pkg, 10serviceops: Investigate why the apt configuration of the wikimedia-buster docker image doesn't seem to prefer wikimedia packages - https://phabricator.wikimedia.org/T249218 (10JMeybohm) 05Open→03Resolved That affected all debian base-images. All of them should be fixed now,... [15:08:07] hashar: the second build succeeded [15:09:51] (03PS1) 10Jbond: apereo_cas: update templates login page [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) [15:11:42] !log cp3051: param.set shortlived=0 to try ease pressure on transient memory [15:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:56] (03PS2) 10Jbond: apereo_cas: update templates login page [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) [15:13:58] (03PS1) 10Jbond: apereo_cas: import initial css to make reviewing easier [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587541 [15:15:03] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/586315 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [15:16:12] (03PS2) 10Jbond: apereo_cas: import initial css to make reviewing easier [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587541 [15:17:08] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: import initial css to make reviewing easier [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587541 (owner: 10Jbond) [15:18:10] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/586314 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [15:19:00] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [15:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:36] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/586313 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [15:20:29] (03PS3) 10Mvolz: Update spec.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/585726 [15:21:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:36] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/586312 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [15:24:04] (03PS3) 10Jbond: apereo_cas: update templates login page [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) [15:24:24] 10Operations, 10OpenRefine, 10Traffic, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team): Clients failing API login due to dependence on "Set-Cookie" header name casing - https://phabricator.wikimedia.org/T249680 (10Pintoch) Sure, it's your call. I am sure you have more important things to... [15:26:03] PROBLEM - PHP opcache health on mw2357 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:37:23] RECOVERY - PHP opcache health on mw2357 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:43:01] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/587526 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [15:44:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/587429 (https://phabricator.wikimedia.org/T233930) (owner: 10Muehlenhoff) [15:49:37] (03CR) 10Ammarpad: [C: 03+1] "There's a typo in the commit summary. Otherwise looks OK." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587301 (https://phabricator.wikimedia.org/T249643) (owner: 10Huji) [15:50:49] PROBLEM - PHP opcache health on mw2353 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:56:55] (03CR) 10BryanDavis: [C: 03+1] raid: add lsscsi to required packages for hpsa raid [puppet] - 10https://gerrit.wikimedia.org/r/587370 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [15:57:48] (03CR) 10CDanis: [C: 03+1] "I forget, was this change blocked on anything else?" [homer/public] - 10https://gerrit.wikimedia.org/r/577316 (https://phabricator.wikimedia.org/T246618) (owner: 10Ayounsi) [16:00:43] (03PS3) 10Ssingh: cescout: add OONI's metadb sync scripts [puppet] - 10https://gerrit.wikimedia.org/r/587531 (https://phabricator.wikimedia.org/T247273) [16:06:03] (03PS1) 10Ema: cache: double upload transient storage limit [puppet] - 10https://gerrit.wikimedia.org/r/587551 (https://phabricator.wikimedia.org/T185968) [16:06:33] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10elukey) There seems to be a problem with PXE installs, today I tried to live-hack a partman recipe from install1003/2003 but it didn't work, and after a while I found the following in the squid lo... [16:07:18] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/21793/cescout1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/587531 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [16:08:14] (03CR) 10Dzahn: [C: 03+1] cescout: add OONI's metadb sync scripts [puppet] - 10https://gerrit.wikimedia.org/r/587531 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [16:09:59] (03CR) 10BBlack: [C: 03+1] cache: double upload transient storage limit [puppet] - 10https://gerrit.wikimedia.org/r/587551 (https://phabricator.wikimedia.org/T185968) (owner: 10Ema) [16:10:06] (03CR) 10Ema: [C: 03+2] cache: double upload transient storage limit [puppet] - 10https://gerrit.wikimedia.org/r/587551 (https://phabricator.wikimedia.org/T185968) (owner: 10Ema) [16:13:34] 10Operations: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10akosiaris) p:05Triage→03Medium I 've did a quick investigation and I have a pcap capture capture attached to this task (no PII in there, no worries, everything is encrypted. I also attach an image fo... [16:13:35] (03CR) 10Ssingh: [C: 03+2] cescout: add OONI's metadb sync scripts [puppet] - 10https://gerrit.wikimedia.org/r/587531 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [16:16:08] !log cache_upload: rolling varnish-fe restarts to bump transient storage limit T185968 [16:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:16] T185968: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 [16:18:08] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/587525 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [16:20:42] 10Operations, 10DBA, 10Wikimedia-Incident: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 (10Reedy) >>! In T249683#6039249, @jcrespo wrote: >> I can't add the indexes as part of creating the table > Why not? This is literally the definition live on the DB: > > ` > CREATE TAB... [16:21:26] (03PS2) 10Andrew Bogott: OpenStack designate: upgrade eqiad1 to version Rocky [puppet] - 10https://gerrit.wikimedia.org/r/587522 (https://phabricator.wikimedia.org/T248635) [16:21:28] (03PS1) 10Andrew Bogott: Designate: fully puppetize the designate pool config yaml [puppet] - 10https://gerrit.wikimedia.org/r/587554 [16:23:32] (03PS1) 10Arturo Borrero Gonzalez: cloud: review direct references to cloudcontrol1003.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/587556 [16:24:27] (03PS1) 10Vgutierrez: Release 8.0.6-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/587557 (https://phabricator.wikimedia.org/T249335) [16:25:24] (03PS4) 10Jforrester: Update spec.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/585726 (owner: 10Mvolz) [16:25:30] (03CR) 10Jforrester: [C: 03+1] Update spec.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/585726 (owner: 10Mvolz) [16:26:11] (03CR) 10jerkins-bot: [V: 04-1] Designate: fully puppetize the designate pool config yaml [puppet] - 10https://gerrit.wikimedia.org/r/587554 (owner: 10Andrew Bogott) [16:28:04] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): Assess whether we should still disable seccomp in Docker for CI - https://phabricator.wikimedia.org/T249729 (10MoritzMuehlenhoff) [16:29:17] (03CR) 10Cwhite: [C: 03+2] raid: add lsscsi to required packages for hpsa raid [puppet] - 10https://gerrit.wikimedia.org/r/587370 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [16:29:31] RECOVERY - PHP opcache health on mw2353 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:31:28] (03CR) 10BryanDavis: apereo_cas: update templates login page (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [16:32:10] deployers: there seems to be a problem w/ flow on a group0 wiki: T247774 [16:32:11] T247774: 1.35.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T247774 [16:32:49] i'm investigating, could use help from anyone who knows the particular quirks of testwiki [16:33:03] (03PS3) 10Andrew Bogott: OpenStack designate: upgrade eqiad1 to version Rocky [puppet] - 10https://gerrit.wikimedia.org/r/587522 (https://phabricator.wikimedia.org/T248635) [16:34:47] (03CR) 10Mvolz: [C: 03+2] Update spec.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/585726 (owner: 10Mvolz) [16:35:13] (03Merged) 10jenkins-bot: Update spec.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/585726 (owner: 10Mvolz) [16:35:33] cscott: I'm looking too. [16:35:43] cscott: Most odd. [16:37:54] VE seems to work fine on test.wikipedia.org fwiw [16:37:58] Yeah. [16:41:19] Flow/includes/Conversion/Utils.php::parsoid (where the exception occurs) does a VRS query to /restbase [16:41:42] I did change the VRS defaults recently, I wonder if the setup on testwiki was relying on a default? [16:41:57] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for andrew-wmde - https://phabricator.wikimedia.org/T249733 (10Andrew-WMDE) [16:44:08] James_F: i'm thinking of https://gerrit.wikimedia.org/r/583430 [16:46:25] James_F: and the message does get 'parsoid' from $vrsInfo->getName() [16:46:57] cscott: That shipped last week, though. [16:47:13] Maybe RESTbase config changed? [16:47:17] but they said the pywikibot was broken last week [16:47:25] so it's possible we broke it last week and no one noticed [16:47:45] (which would mean it's not a train blocker for this week, i guess?) [16:47:59] Oh, hmm. [16:48:05] Definitely not a train blocker in that case. [16:48:19] _joe_ did insert envoy into the parsoid/restbase request path this week [16:48:25] But even "bad" bot requests shouldn't trigger an error like this, so we definitely need to fix it at our end. [16:48:56] James_F: yeah, i'd rather have a fix in hand before i definitely say the train is fine, and i don't quite have a smoking gun yet [16:49:00] just a theory [16:50:16] (03PS2) 10Hnowlan: Changeprop: add puppet CA cert to environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/587298 (https://phabricator.wikimedia.org/T249633) [16:51:12] hm, flow also gets the VRS object via $serviceClient->getMountAndService( '/restbase/' ) while VE directly instantiates RestbaseVirtualRESTService::class [16:53:18] (03CR) 10Ema: [C: 03+1] Release 8.0.6-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/587557 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [16:53:49] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.6-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/587557 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [16:56:32] (03PS2) 10Andrew Bogott: Designate: fully puppetize the designate pool config yaml [puppet] - 10https://gerrit.wikimedia.org/r/587554 [16:59:15] (03PS1) 10Ssingh: cescout: update metadb's data directory [puppet] - 10https://gerrit.wikimedia.org/r/587559 [16:59:34] (03PS1) 10Elukey: autoinstall: fix kafka-jumbo.cfg for Buster [puppet] - 10https://gerrit.wikimedia.org/r/587560 (https://phabricator.wikimedia.org/T244506) [16:59:42] (03CR) 10Ppchelko: [C: 04-1] Changeprop: add puppet CA cert to environment variables (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/587298 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [17:00:49] (03PS2) 10Elukey: autoinstall: fix kafka-jumbo.cfg for Buster [puppet] - 10https://gerrit.wikimedia.org/r/587560 (https://phabricator.wikimedia.org/T244506) [17:01:08] James_F: i wonder if it's because flow is still setting a 'prefix' instead of a 'domain' [17:02:15] oh also Flow is deliberately not using restbase. [17:03:05] (03PS3) 10Hnowlan: Changeprop: add puppet CA cert to environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/587298 (https://phabricator.wikimedia.org/T249633) [17:04:13] <_joe_> cscott: what is the problem? [17:04:43] (03PS3) 10Andrew Bogott: Designate: fully puppetize the designate pool config yaml [puppet] - 10https://gerrit.wikimedia.org/r/587554 [17:05:03] (03PS3) 10Elukey: autoinstall: fix kafka-jumbo.cfg for Buster [puppet] - 10https://gerrit.wikimedia.org/r/587560 (https://phabricator.wikimedia.org/T244506) [17:05:23] _joe_: trying to diagnose https://phabricator.wikimedia.org/T249705 [17:05:55] <_joe_> cscott: testwiki goes through restbase, and nothing really changed there I would think [17:06:03] probably not envoy related, maybe somehow related to my https://phabricator.wikimedia.org/T249705 [17:06:15] _joe_: except flow doesn't go through restbase (sigh) [17:06:29] hashar: contint2001 is on buster and i am on it :) [17:06:46] <_joe_> cscott: oh. [17:06:50] the puppet error though ..which makes it fail on first run is deployment related [17:07:50] <_joe_> cscott: I see subbu says it's a pywikibot bug though [17:08:01] _joe_: https://github.com/wikimedia/mediawiki-extensions-Flow/blob/master/includes/Conversion/Utils.php#L307 [17:08:29] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/21800/cescout1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/587559 (owner: 10Ssingh) [17:08:32] <_joe_> it seems definitely related to the envoy switch [17:08:41] _joe_, i just clarified there .. this may not necessarily be related to pywikibot. [17:08:44] see my latest comment. [17:08:51] <_joe_> ok [17:09:10] (03CR) 10Andrew Bogott: [C: 03+2] Designate: fully puppetize the designate pool config yaml [puppet] - 10https://gerrit.wikimedia.org/r/587554 (owner: 10Andrew Bogott) [17:09:10] <_joe_> subbu: is flow broken on testwiki? [17:09:16] no. [17:09:20] <_joe_> it definitely isn't on officewiki [17:09:22] <_joe_> I tested it [17:09:28] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) contint2001 has been reimaged with buster using the wmf-auto-... [17:09:36] yes, we all tested on mediawiki, officewiki, testwiki ... all tests worked. [17:09:54] but, see my last comment on that phab task .. https://phabricator.wikimedia.org/T249705#6040712 [17:09:58] <_joe_> I can also see https://test.wikipedia.org/wiki/Topic:U5y53rn0dp6h70nw [17:10:40] subbu: is Parsoid giving the 503? or is it the mediawiki VRS which is failing to contact parsoid? [17:10:45] yes, i tested on the topics that had the error messages. [17:10:48] cscott, i dont know. [17:10:49] (the latter would be a 400-class error, wouldn't it?) [17:11:01] (03PS1) 10Andrew Bogott: Designate: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/587561 [17:11:55] parsoid/js would give a 503 for "MaxConcurrentCallsError", is there something similar in the PHP REST dispatch code? [17:13:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:13:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:13:10] 193 errors in a week across all wikis .. so, given the frequency, that feels like some "edge case" somewhere. ... the errors are in 2 clusters, Monday, and today starting about 7 hours back. Neither of them correspond to a train rollout. [17:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:15] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Icinga downtime for 1 day, 0:00:00 set by dzahn@... [17:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:25] <_joe_> subbu: yes it's the move to envoy clearly [17:13:25] (03CR) 10Andrew Bogott: [C: 03+2] Designate: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/587561 (owner: 10Andrew Bogott) [17:14:01] <_joe_> subbu: you have any idea what is the url that gets requested to parsoid there? [17:14:17] <_joe_> so that I can take a look at envoy's logs [17:14:21] _joe_: I can probably reconstruct it, hang on [17:14:22] one sec. [17:14:32] ok, cscott will be faster there. :) [17:14:50] <_joe_> you can also comment on the task, this doesn't seem like a high-priority bug given the frequency [17:15:42] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) `jenkins` ====== is managed via `reprepro` https://wikitech... [17:15:52] ya. [17:16:07] Flow issues an internal VRS request to /restbase/local/v1/transform/html/to/wikitext/Topic%3AU5y53rn0dp6h70nw [17:16:34] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) @Cmjohnson I powered up again 1008 and I don't see any DHCP ACK in syslog when PXE installing: ` Apr 8 17:14:09 install1003... [17:16:35] now i just have to work out what the external URL that corresponds to is [17:18:25] cscott, should be w/rest.php/test.wikimedia.org/v3/transform/html/to/wikitext/Topic..... ? [17:19:14] url would be $url/$domain/v3/transform/html/to/wikitext/Topic%3AU5y53rn0dp6h70nw [17:19:45] if only we had the user-agent string .. :-) we chould have known if they are all pywikibot originated .. in which case if it is envoy related, it might be some header / setting that the bot uses? [17:19:59] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10MoritzMuehlenhoff) >>! In T224591#6040713, @Dzahn wrote: > contint20... [17:20:15] subbu: yeah, i hadn't confirmed the $url and $domain yet, but presumably $host/w/rest.php/test.wikimedia.org/v3/transform/html/to/wikitext/Topic%3AU5y53rn0dp6h70nw [17:21:02] (03CR) 10Ppchelko: [C: 03+2] Changeprop: add puppet CA cert to environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/587298 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [17:21:10] <_joe_> cscott: interestingly I don't find errors for that in any envoy logs [17:21:24] (03Merged) 10jenkins-bot: Changeprop: add puppet CA cert to environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/587298 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [17:22:05] <_joe_> I find no error for textwiki in the envoy logs [17:22:55] _joe_: http://localhost:6002/w/rest.php/test.wikimedia.org/v3/transform/html/to/wikitext/Topic%3AU5y53rn0dp6h70nw [17:23:19] <_joe_> yeah I'm looking for that right now [17:23:20] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:25] i assume the localhost:6002 goes to envoy or some other router which actually brings it to the parsoid cluster [17:23:30] ? [17:24:42] <_joe_> yes [17:25:26] <_joe_> so interestingly, I find no trace of 503 errors in the parsoid logs, but lemme check the mediawiki appservers [17:25:49] <_joe_> actually, it's almost 8 pm [17:25:54] <_joe_> I'll do it tomorrow [17:26:03] <_joe_> this is not a complete breakage AFAICT [17:26:41] (03CR) 1020after4: [C: 03+1] zuul: provision the scap repository [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [17:27:28] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) I am also not able to ssh to `kafka-jumbo1007.mgmt.eqiad.wmnet` :( [17:27:59] (03PS1) 10Hnowlan: changeprop: Correct puppetca path. [deployment-charts] - 10https://gerrit.wikimedia.org/r/587562 (https://phabricator.wikimedia.org/T248677) [17:30:12] (03CR) 10Ppchelko: [C: 03+2] changeprop: Correct puppetca path. [deployment-charts] - 10https://gerrit.wikimedia.org/r/587562 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [17:30:31] (03Merged) 10jenkins-bot: changeprop: Correct puppetca path. [deployment-charts] - 10https://gerrit.wikimedia.org/r/587562 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [17:31:28] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:55] (03PS1) 10Andrew Bogott: Designate pools.yaml: A couple of updates [puppet] - 10https://gerrit.wikimedia.org/r/587564 [17:34:14] (03CR) 10Andrew Bogott: [C: 03+2] Designate pools.yaml: A couple of updates [puppet] - 10https://gerrit.wikimedia.org/r/587564 (owner: 10Andrew Bogott) [17:34:46] jouncebot: now [17:34:46] No deployments scheduled for the next 0 hour(s) and 25 minute(s) [17:34:49] jouncebot: next [17:34:49] In 0 hour(s) and 25 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200408T1800) [17:37:13] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['contint2001... [17:41:06] (03PS1) 10Andrew Bogott: Designate pools.yaml.erb: rearrange slightly [puppet] - 10https://gerrit.wikimedia.org/r/587567 [17:42:39] (03CR) 10Andrew Bogott: [C: 03+2] Designate pools.yaml.erb: rearrange slightly [puppet] - 10https://gerrit.wikimedia.org/r/587567 (owner: 10Andrew Bogott) [17:42:50] (03CR) 10Herron: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/587513 (https://phabricator.wikimedia.org/T244147) (owner: 10Ayounsi) [17:48:44] (03CR) 10Herron: [C: 03+1] "LGTM! https://puppet-compiler.wmflabs.org/compiler1001/21801/" [puppet] - 10https://gerrit.wikimedia.org/r/587427 (https://phabricator.wikimedia.org/T246961) (owner: 10Elukey) [17:50:26] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services): Assess whether we should still disable seccomp in Docker for CI - https://phabricator.wikimedia.org/T249729 (10hashar) p:05Triage→03Medium `seccomp` has too much o... [18:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200408T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:14:12] i've got some patches I could throw into the SWAT queue if you're looking for something to do [18:14:27] PROBLEM - PHP opcache health on mw2351 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:15:08] RoanKattouw, Niharika, Urbanecm: https://gerrit.wikimedia.org/r/#/q/hashtag:%22swat-2020-03-17-morning%22 got skipped, and i just haven't gotten around to putting them back in a swat queue yet [18:15:20] cscott: I've got a UBN to deploy, and am waiting on Reedy to finish his. [18:15:27] (03PS1) 10Ppchelko: Enable TLS for kafka connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/587573 (https://phabricator.wikimedia.org/T249644) [18:15:43] James_F: ok, definitely low priority! [18:16:21] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.26/extensions/TemplateData/includes/TemplateDataHooks.php: T236809 (duration: 01m 10s) [18:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:29] T236809: Refactor Parser.php to allow alternate parser (Parsoid) - https://phabricator.wikimedia.org/T236809 [18:17:40] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.27/extensions/TemplateData/includes/TemplateDataHooks.php: T236809 (duration: 01m 06s) [18:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:51] Reedy: All done? [18:17:55] Aye [18:17:57] OK. [18:22:45] (03CR) 10Ppchelko: "Kafka port 9093 is already open in the firewall rules for change-prop." [deployment-charts] - 10https://gerrit.wikimedia.org/r/587573 (https://phabricator.wikimedia.org/T249644) (owner: 10Ppchelko) [18:23:21] (03CR) 10Ppchelko: "The puppet_ca was already added in a prior patch for configuring talking to eventgate over https" [deployment-charts] - 10https://gerrit.wikimedia.org/r/587573 (https://phabricator.wikimedia.org/T249644) (owner: 10Ppchelko) [18:26:45] 10Operations, 10ops-eqsin: apply asset tags to s[12]-60[34]-eqsin - https://phabricator.wikimedia.org/T244900 (10RobH) Please note that due to covid19 concerns, this task won't be accomplished until we have Jin onsite in late April to receive and install our router ordered for that location. [18:26:55] 10Operations, 10ops-eqsin: snag asset tags from ulsfo, ship some to eqsin - https://phabricator.wikimedia.org/T245056 (10RobH) [18:31:35] * James_F is soooooo bored waiting for CI. [18:33:22] [18:33:27] im really only going through the motion... [18:33:33] motions [18:36:17] (03CR) 1020after4: [C: 03+1] "so is the envoy_port (444) the one that external devices talk to? If so then we need to alter the proxy configuration, no?" [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [18:36:55] robh: … playing out my part. [18:37:41] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.27/skins/Vector: T248761: Revert moving indicators in DOM (duration: 01m 07s) [18:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:47] T248761: Move indicators underneath firstHeading - https://phabricator.wikimedia.org/T248761 [18:42:40] (03CR) 1020after4: [C: 03+1] "> Patch Set 14:" [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [18:43:27] (03PS10) 1020after4: ATS/phabricator: directly talk wss:// to aphlict [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [18:48:42] (03CR) 10Ppchelko: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/584667 (https://phabricator.wikimedia.org/T242025) (owner: 10Ppchelko) [18:48:53] (03CR) 1020after4: ATS/phabricator: directly talk wss:// to aphlict (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [18:49:03] RECOVERY - PHP opcache health on mw2351 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:54:29] (03CR) 10Vgutierrez: ATS/phabricator: directly talk wss:// to aphlict (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/569104 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [19:00:04] longma and James_F: #bothumor I � Unicode. All rise for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200408T1900). [19:02:02] * James_F waves. [19:02:42] !log deploying 1.35.0-wmf.27 to group1 [19:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:49] Whee. [19:04:17] (03PS1) 10Jeena Huneidi: group1 wikis to 1.35.0-wmf.27 refs T247774 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587583 [19:04:19] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.35.0-wmf.27 refs T247774 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587583 (owner: 10Jeena Huneidi) [19:05:34] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.27 refs T247774 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587583 (owner: 10Jeena Huneidi) [19:07:12] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.27 refs T247774 [19:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:18] T247774: 1.35.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T247774 [19:08:19] !log jhuneidi@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.27 refs T247774 (duration: 01m 06s) [19:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:03] longma: Things LGTM. [19:11:35] 👍 [19:17:15] (03PS4) 10Jforrester: parsoidphp is dead, long live parsoid (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579021 (owner: 10C. Scott Ananian) [19:17:47] (03PS5) 10Jforrester: ProductionServices: Add 'parsoid' service to replace 'parsoidphp' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579021 (owner: 10C. Scott Ananian) [19:18:02] (03PS2) 10Jforrester: parsoidphp is dead, long live parsoid (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579042 (owner: 10C. Scott Ananian) [19:18:24] (03PS3) 10Jforrester: CommonSettings: Use 'parsoid' service in lieu of 'parsoidphp' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579042 (owner: 10C. Scott Ananian) [19:18:36] (03PS2) 10Jforrester: parsoidphp is dead, long live parsoid (part 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579043 (owner: 10C. Scott Ananian) [19:18:59] (03PS3) 10Jforrester: ProductionServices: Drop 'parsoidphp' service, we use 'parsoid' now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579043 (owner: 10C. Scott Ananian) [19:19:06] (03CR) 10Faidon Liambotis: "Comments over the screenshot in the task, because I don't have a local copy of CAS handy :) All nitpicks, bear with me!" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [19:29:45] (03PS1) 10DCausse: [mwgrep] only query live indices [puppet] - 10https://gerrit.wikimedia.org/r/587586 (https://phabricator.wikimedia.org/T249435) [19:31:31] (03CR) 10jerkins-bot: [V: 04-1] [mwgrep] only query live indices [puppet] - 10https://gerrit.wikimedia.org/r/587586 (https://phabricator.wikimedia.org/T249435) (owner: 10DCausse) [19:34:53] (03PS3) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [19:35:12] !log mstyles@deploy1001 Started deploy [wdqs/wdqs@c2995eb]: WDQS version 0.3.21 [19:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:22] (03CR) 10jerkins-bot: [V: 04-1] varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [19:38:41] (03PS2) 10DCausse: [mwgrep] only query live indices [puppet] - 10https://gerrit.wikimedia.org/r/587586 (https://phabricator.wikimedia.org/T249435) [19:44:22] !log dpifke@deploy1001 Started deploy [performance/navtiming@4acb04d]: Deploy new navtiming with First Input Delay metric https://phabricator.wikimedia.org/T238091 [19:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:28] !log dpifke@deploy1001 Finished deploy [performance/navtiming@4acb04d]: Deploy new navtiming with First Input Delay metric https://phabricator.wikimedia.org/T238091 (duration: 00m 05s) [19:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:49] !log mstyles@deploy1001 Finished deploy [wdqs/wdqs@c2995eb]: WDQS version 0.3.21 (duration: 14m 37s) [19:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:19] !log restart wdqs-updater after deployment [19:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:46] (03CR) 10Krinkle: [mwgrep] only query live indices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587586 (https://phabricator.wikimedia.org/T249435) (owner: 10DCausse) [20:00:04] halfak and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200408T2000). [20:00:30] (03CR) 10Krinkle: "Ah, that explains the results. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/587586 (https://phabricator.wikimedia.org/T249435) (owner: 10DCausse) [20:00:36] 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10jbond) {F31744254} [20:00:41] (03PS3) 10Ottomata: Eventgate-main: add mediawiki/page-suppress stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/584667 (https://phabricator.wikimedia.org/T242025) (owner: 10Ppchelko) [20:00:56] RhinosF1: we finally did the deploy! we should be good! [20:03:09] (03CR) 10Ottomata: [C: 03+2] Eventgate-main: add mediawiki/page-suppress stream config [deployment-charts] - 10https://gerrit.wikimedia.org/r/584667 (https://phabricator.wikimedia.org/T242025) (owner: 10Ppchelko) [20:04:17] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [20:04:17] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [20:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:24] 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10Krinkle) For colors and such, you may want to use this: [20:06:29] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [20:06:29] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [20:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:35] 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10jbond) >>! In T233939#6041508, @Krinkle wrote: > For colors and such, you may want to use this: > Thanks @Krinkle il... [20:09:26] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'production' . [20:09:26] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-main' for release 'canary' . [20:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:36] (03CR) 10Ottomata: "Merged and applied in production." [deployment-charts] - 10https://gerrit.wikimedia.org/r/584667 (https://phabricator.wikimedia.org/T242025) (owner: 10Ppchelko) [20:20:35] PROBLEM - PHP opcache health on mw2365 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:26:17] 10Operations, 10Parsing-Team, 10Performance-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10Krinkle) Notes from TechCom meeting: * @tstarling mentioned that we can also look into using a bett... [20:31:35] RECOVERY - PHP opcache health on mw2365 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:48:59] PROBLEM - PHP opcache health on mw2376 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:59:55] RECOVERY - PHP opcache health on mw2376 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:11:23] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:13:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:17:55] PROBLEM - PHP opcache health on mw2371 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:19:15] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.26/extensions/TemplateData/includes/TemplateDataHooks.php: Restore call to OutputPage::setupOOUI() (duration: 01m 09s) [21:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:30] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.27/extensions/TemplateData/includes/TemplateDataHooks.php: Restore call to OutputPage::setupOOUI() (duration: 01m 07s) [21:20:33] 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10Volker_E) We don't have a specific dev environment Wikimedia style, nonetheless some of the solutions shared in the Design Style Guide should and could be seen as universal for ou... [21:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:25] (03CR) 10VolkerE: apereo_cas: update templates login page (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [21:44:56] 10Operations, 10Parsing-Team, 10Performance-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) Another though from the TechCom meeting: we could just have the CDN cache output for old rev... [21:56:11] RECOVERY - PHP opcache health on mw2371 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:12:03] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:13:51] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [22:33:19] 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10jbond) >>! In T233939#6041699, @Volker_E wrote: first off thanks for the review defiantly appreciated and please keep with me, CSS is not my area :) > We don't have a specific de... [22:38:18] (03CR) 10Jdlrobson: [C: 03+1] "Was hoping to get time to land this today but I'm feeling a bit off. Will try again tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584734 (https://phabricator.wikimedia.org/T248500) (owner: 10Jforrester) [22:50:27] 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10jbond) this is the new screen shot with the logo on the left {F31744707} [22:51:17] (03PS4) 10Jbond: apereo_cas: update templates login page [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) [22:58:28] (03CR) 10Jbond: "Thanks everyone for the reviews. just wanted to say if you wanted to test this out and hack on the code you should be able to get somethin" (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200408T2300). Please do the needful. [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:35:28] (03CR) 10Krinkle: Do not update the globals cache file while opcache needs regeneration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575469 (https://phabricator.wikimedia.org/T236104) (owner: 10Giuseppe Lavagetto) [23:35:31] (03CR) 10Krinkle: [C: 03+1] Do not update the globals cache file while opcache needs regeneration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575469 (https://phabricator.wikimedia.org/T236104) (owner: 10Giuseppe Lavagetto) [23:36:09] (03CR) 10Krinkle: [C: 03+1] "Be careful on the rebase for this one. A lot has changed since :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575469 (https://phabricator.wikimedia.org/T236104) (owner: 10Giuseppe Lavagetto) [23:43:38] 10Operations, 10Patch-For-Review, 10User-jbond: Wikimedia theme for SSO login page - https://phabricator.wikimedia.org/T233939 (10Volker_E) We're following WCAG 2.0 level AA color contrast ratios, so something like a placeholder text color needs to provide 4.5:1 contrast. That minimum requirement here in our... [23:44:23] (03CR) 10Krinkle: apereo_cas: update templates login page (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [23:57:14] (03CR) 10BryanDavis: apereo_cas: update templates login page (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond)