[00:02:10] (03CR) 10Mstyles: "thanks for the review feedback, there's no rush on this, I don't mind getting things correct. I moved kibana.pp into the kibana directory," [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [00:20:39] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10colewhite) The problem the service label solves is that the service label is for aggregation on "instance-level" metric... [00:26:05] (03CR) 10Jdlrobson: Drop fallback support for wgMobileFrontendLogo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584734 (https://phabricator.wikimedia.org/T248500) (owner: 10Jforrester) [01:08:21] (03CR) 10Jforrester: Drop fallback support for wgMobileFrontendLogo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584734 (https://phabricator.wikimedia.org/T248500) (owner: 10Jforrester) [01:09:18] (03PS2) 10Jforrester: Drop fallback support for wgMobileFrontendLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584734 (https://phabricator.wikimedia.org/T248500) [01:32:40] (03CR) 10Jdlrobson: [C: 03+1] Drop fallback support for wgMobileFrontendLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584734 (https://phabricator.wikimedia.org/T248500) (owner: 10Jforrester) [01:33:14] (03CR) 10Jdlrobson: [C: 03+1] Drop fallback support for wgMobileFrontendLogo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584734 (https://phabricator.wikimedia.org/T248500) (owner: 10Jforrester) [01:54:25] 10Operations, 10ops-codfw, 10DBA: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10Papaul) [01:54:34] 10Operations, 10ops-codfw, 10DBA: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10Papaul) p:05Triage→03Medium [04:14:13] So a friend and I may have found a weird caching bug that affects IPv6 clients. He pointed out an issue with the "Freddo" article on enwiki, and I corrected it. He then said that he isn't seeing the change. We did some curl -svo and found that when hitting wmf over IPv6 he's getting a Last Modified date of 5 days ago, but when hitting over IPv4, it's the correct date. [04:14:37] Here's some curl output: https://gist.github.com/relrod/9c99d3d7658a086d0ce8faf4be70b5bf [04:15:11] relrod is the friend in question [04:19:56] In particular and worth noting, curling from another v6 host local to me with a different IP also gets the correct response. Curling on my laptop with -4 gets the correct response, and curling with -6 gets the wrong one. So some cache layer appears to be keying on my laptop's v6 address and serving the old version. [04:22:31] And tossing a ?1 at the end of the url forces the correct/new version. [04:39:21] phuzion: relrod: thanks for the report. it's not actually an issue with v6 clients; our first layer of loadbalancing does consistent hashing by client IP address, and one of our reverse proxies in codfw never got the purge event for the edit. [04:39:44] just issued a manual purge and it is now fixed [04:40:13] cdanis: is there anything that a user can do from a client side to issue a similar purge, or is it something that you guys have to do on the backend? [04:40:44] I tried action=purge earlier today and it obviously didn't do anything [04:40:52] Also, thanks for looking into it! [04:41:02] ?action=purge should have done it [04:41:06] that's all I did right now :) [04:41:29] I tried that earlier too and it didn't have any affect but seems to have worked now [04:41:33] weird [04:42:07] making a null edit ought to work as well, I think [04:42:57] 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 2 others: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10bearND) This is strange. When I run this [[ http://localhost:... [04:43:55] cdanis: But yeah I figured there had to be some caching layer there keying on IP address, but I hadn't heard of that before. What does keying on IP solve? [04:44:59] relrod: TCP Fast Open, SSL session persistence [04:46:44] oh, makes sense [04:47:06] not necessary for correctness, but a nice performance optimization [04:47:31] yep, makes total sense [04:47:34] cdanis: thanks :) [04:49:31] 10Operations, 10Traffic: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 (10Vgutierrez) [04:49:54] np :) [04:53:18] 10Operations, 10Traffic: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 (10Vgutierrez) p:05Triage→03Medium `name=TSRemapDeleteInstance stacktrace Mar 30 12:07:56 cp2013 traffic_manager[32876]: traffic_server: received signal 11 (Segmentation fault) Mar 30 12:07:56 cp2013... [04:58:13] 10Operations, 10Traffic: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 (10Vgutierrez) This issue seems to be identified by upstream at https://github.com/apache/trafficserver/pull/6403 but the fix hasn't been backported to ATS 8.x [05:00:21] (03PS1) 10Vgutierrez: Release 8.0.6-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/584812 (https://phabricator.wikimedia.org/T248938) [05:07:33] (03PS2) 10Vgutierrez: Release 8.0.6-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/584812 (https://phabricator.wikimedia.org/T248938) [05:13:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3314 for schema change', diff saved to https://phabricator.wikimedia.org/P10822 and previous config saved to /var/cache/conftool/dbconfig/20200331-051354-marostegui.json [05:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:58] (03PS1) 10Vgutierrez: site: Reimage cp2034 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/584813 (https://phabricator.wikimedia.org/T248816) [05:23:58] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2034 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/584813 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [05:25:41] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2034.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [05:26:33] !log Deploy schema change on db1097:3314 [05:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:03] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10Vgutierrez) [05:39:23] (03PS1) 10Marostegui: wmf-pt-kill: Update package to PT 3.1.0 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/584814 (https://phabricator.wikimedia.org/T248843) [05:41:16] Updating cxserver in few minutes.. [05:42:51] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-03-30-145349-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/584618 (https://phabricator.wikimedia.org/T248578) (owner: 10KartikMistry) [05:43:09] (03Merged) 10jenkins-bot: Update cxserver to 2020-03-30-145349-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/584618 (https://phabricator.wikimedia.org/T248578) (owner: 10KartikMistry) [05:45:45] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2007 [puppet] - 10https://gerrit.wikimedia.org/r/584815 (https://phabricator.wikimedia.org/T248941) [05:46:21] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [05:46:22] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) [05:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:25] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [05:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:37] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [05:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:11] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [05:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:28] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for snowick - https://phabricator.wikimedia.org/T248943 (10SNowick_WMF) [05:49:58] (03PS1) 10Marostegui: dbproxy1010: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/584816 (https://phabricator.wikimedia.org/T248944) [05:50:01] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) [05:50:31] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:50:32] (03PS2) 10Vgutierrez: site,install_server: Decommission cp2007 [puppet] - 10https://gerrit.wikimedia.org/r/584815 (https://phabricator.wikimedia.org/T248941) [05:50:49] (03CR) 10Marostegui: [C: 03+2] dbproxy1010: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/584816 (https://phabricator.wikimedia.org/T248944) (owner: 10Marostegui) [05:51:02] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [05:51:09] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2034.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2034.codfw.wmnet'] ` [05:51:33] kartik@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [05:52:41] (03CR) 10Muehlenhoff: "All great minds think alike :-) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/583030/" [puppet] - 10https://gerrit.wikimedia.org/r/584696 (owner: 10Jbond) [05:52:45] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2007 [puppet] - 10https://gerrit.wikimedia.org/r/584815 (https://phabricator.wikimedia.org/T248941) (owner: 10Vgutierrez) [05:53:26] !log depool && decommission cp2007 - T248941 [05:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:32] T248941: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 [05:54:03] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:54:33] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [05:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:14] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:20] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2007.codfw.wmnet` - cp2007.codfw.wmnet (**PASS**) - Downtimed h... [05:55:22] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Marostegui) >>! In T231520#6011841, @bd808 wrote: >>>! In T231520#6009325, @Marostegui wrote: >> I have also run some queries via Quarry and I have seen th... [05:55:42] !log Updated cxserver to 2020-03-30-145349-production (T248578) [05:55:45] !log Drop nova and nova_api from m5 master (db1133) - T248313 [05:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:47] T248578: cxserver alerting for "Suggest source sections to translate" since 2020-03-25 18:40:17 - https://phabricator.wikimedia.org/T248578 [05:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:51] T248313: Drop nova and nova_api databases from m5 - https://phabricator.wikimedia.org/T248313 [06:20:07] (03CR) 10Muehlenhoff: "This is meant to be opt-in for some roles (or some canary hosts), how about we simply add a "requires_os(debian >= buster)" to profile::ba" [puppet] - 10https://gerrit.wikimedia.org/r/584553 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [06:29:50] (03PS4) 10Giuseppe Lavagetto: scap: define MW canaries dynamically [puppet] - 10https://gerrit.wikimedia.org/r/465411 [06:29:52] (03PS1) 10Giuseppe Lavagetto: conftool-data: add "canary" faux service to appservers [puppet] - 10https://gerrit.wikimedia.org/r/584861 [06:46:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1097:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P10824 and previous config saved to /var/cache/conftool/dbconfig/20200331-064627-marostegui.json [06:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103:3314 for schema change', diff saved to https://phabricator.wikimedia.org/P10825 and previous config saved to /var/cache/conftool/dbconfig/20200331-064707-marostegui.json [06:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:14] !log Deploy schema change on db1103:3314 [06:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:53] (03PS1) 10KartikMistry: apertium-en-gl: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-en-gl] - 10https://gerrit.wikimedia.org/r/584870 (https://phabricator.wikimedia.org/T247585) [06:57:07] (03PS1) 10Vgutierrez: Remove cp2007 entries [dns] - 10https://gerrit.wikimedia.org/r/584871 (https://phabricator.wikimedia.org/T248941) [06:57:44] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2007 entries [dns] - 10https://gerrit.wikimedia.org/r/584871 (https://phabricator.wikimedia.org/T248941) (owner: 10Vgutierrez) [06:59:07] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2007.codfw.wmnet - https://phabricator.wikimedia.org/T248941 (10Vgutierrez) a:05Vgutierrez→03Papaul [07:14:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1103:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P10826 and previous config saved to /var/cache/conftool/dbconfig/20200331-071401-marostegui.json [07:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:18] 10Operations, 10LDAP-Access-Requests: Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10darthmon_wmde) [07:15:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1081 for schema change', diff saved to https://phabricator.wikimedia.org/P10827 and previous config saved to /var/cache/conftool/dbconfig/20200331-071547-marostegui.json [07:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:07] !log Deploy schema change on db1081 [07:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:14] !log pool cp2034 - T248816 [07:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:18] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [07:17:55] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [07:19:02] 10Operations, 10LDAP-Access-Requests: Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10darthmon_wmde) [07:19:30] 10Operations, 10LDAP-Access-Requests: Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10darthmon_wmde) [07:25:04] (03PS1) 10Vgutierrez: site: Reimage cp2035 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/584873 (https://phabricator.wikimedia.org/T248816) [07:28:04] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020), 10Patch-For-Review: Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10Aklapper) (#Operations: On a related note, I now see that https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ is... [07:31:00] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2035 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/584873 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [07:33:01] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2035.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [07:35:15] (03CR) 10Elukey: "Mstyles: did another pass, left some more comments (some to check, more specific to the code, others are nit that we can do or skip depend" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [07:35:24] (03PS1) 10Awight: Disable TwoColConflict talk page workflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584874 (https://phabricator.wikimedia.org/T230231) [07:38:27] (03PS1) 10Vgutierrez: ATS: Enable TLS Session tickets in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/584877 (https://phabricator.wikimedia.org/T245616) [07:39:35] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for snowick - https://phabricator.wikimedia.org/T248943 (10jcrespo) a:03jcrespo This request has to be approved by @kzimmerman here on phabricator. On the service owner side, I think @thcipriani would be the right person to al... [07:41:25] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for snowick - https://phabricator.wikimedia.org/T248943 (10jcrespo) [07:43:03] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1003/21628/" [puppet] - 10https://gerrit.wikimedia.org/r/584877 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [07:43:04] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for snowick - https://phabricator.wikimedia.org/T248943 (10jcrespo) [07:48:45] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10Vgutierrez) [07:49:35] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [07:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:47] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [07:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:45] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/584878 (https://phabricator.wikimedia.org/T248950) [07:52:13] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for snowick - https://phabricator.wikimedia.org/T248943 (10jcrespo) [07:53:25] 10Operations, 10LDAP-Access-Requests: Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10Aklapper) (Thanks for filing this! In the future feel free to also file a task to disable the Phab account - done that now :) [07:54:34] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2035.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2035.codfw.wmnet'] ` [07:55:48] (03CR) 10Dzahn: [C: 03+2] add planet1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/584612 (https://phabricator.wikimedia.org/T247651) (owner: 10Dzahn) [07:55:51] (03PS2) 10Dzahn: add planet1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/584612 (https://phabricator.wikimedia.org/T247651) [08:01:14] !log delete unused ROA for ARIN v4 prefixes - T235886 [08:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:20] T235886: IRR updates needed - https://phabricator.wikimedia.org/T235886 [08:01:44] 10Operations, 10Traffic, 10Patch-For-Review: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 (10ema) Another ongoing issue which causes traffic_server to crash upon configuration reloads and related to tslua is T242952. [08:02:42] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T248922 (10jcrespo) This request needs explicit approval of @WDoranWMF here. Either @Thcipriani or SRE team would be the right people to own mwmaint servers, requesting his ok. Do y... [08:04:15] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T248922 (10jcrespo) [08:05:30] (03CR) 10Ema: [C: 03+1] "Nice!" (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/584812 (https://phabricator.wikimedia.org/T248938) (owner: 10Vgutierrez) [08:08:10] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [08:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:29] (03PS3) 10Vgutierrez: Release 8.0.6-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/584812 (https://phabricator.wikimedia.org/T248938) [08:10:58] (03CR) 10Vgutierrez: Release 8.0.6-1wm5 (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/584812 (https://phabricator.wikimedia.org/T248938) (owner: 10Vgutierrez) [08:11:09] 10Operations, 10Traffic: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10ema) The issue occurred yesterday on cp2023 and cp1081: ` Mar 30 13:55:02 cp2023 traffic_manager[17786]: PANIC: unprotected error in call to Lua API (attempt to co... [08:12:10] 10Operations: Upgrade ferm to 2.5.1 - https://phabricator.wikimedia.org/T248954 (10MoritzMuehlenhoff) [08:12:20] (03CR) 10Ema: [C: 03+1] Release 8.0.6-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/584812 (https://phabricator.wikimedia.org/T248938) (owner: 10Vgutierrez) [08:12:48] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.6-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/584812 (https://phabricator.wikimedia.org/T248938) (owner: 10Vgutierrez) [08:17:28] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020), 10Patch-For-Review: Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10jcrespo) [08:17:35] (03CR) 10Ema: "> This is meant to be opt-in for some roles (or some canary hosts)," [puppet] - 10https://gerrit.wikimedia.org/r/584553 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [08:17:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [08:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:15] (03PS1) 10Giuseppe Lavagetto: eventstreams: revert removal of monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/584880 [08:19:28] <_joe_> vgutierrez: I'd use a quick review of ^^ [08:19:36] checking [08:19:53] (03PS1) 10Dzahn: DHCP: add planet1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/584881 (https://phabricator.wikimedia.org/T248863) [08:20:26] (03CR) 10Dzahn: [C: 03+2] DHCP: add planet1002.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/584881 (https://phabricator.wikimedia.org/T248863) (owner: 10Dzahn) [08:20:51] (03CR) 10Vgutierrez: [C: 03+1] eventstreams: revert removal of monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/584880 (owner: 10Giuseppe Lavagetto) [08:21:06] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020), 10Patch-For-Review: Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10jcrespo) So I am guessing it may not be there so people are forced to read the documentation, which is linked and has the... [08:21:07] 10Operations, 10vm-requests, 10Patch-For-Review: Site: EQIAD/CODFW 2 VM request for planet - https://phabricator.wikimedia.org/T248863 (10Dzahn) a:03Dzahn [08:22:23] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020), 10Patch-For-Review: Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10jcrespo) @Nuria as project lead for analytics, I am requesting your ok for the above access. Thank you! [08:22:53] (03CR) 10jerkins-bot: [V: 04-1] eventstreams: revert removal of monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/584880 (owner: 10Giuseppe Lavagetto) [08:22:56] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020), 10Patch-For-Review: Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10jcrespo) a:03Nuria [08:24:11] (03PS2) 10Jcrespo: aklapper: access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/584676 (https://phabricator.wikimedia.org/T248905) (owner: 10Aklapper) [08:25:25] (03CR) 10Jcrespo: [C: 03+1] "+1 But waiting for analytics lead ok as per manual "Get approvals from the following people [...] The project lead where your access will " [puppet] - 10https://gerrit.wikimedia.org/r/584676 (https://phabricator.wikimedia.org/T248905) (owner: 10Aklapper) [08:25:31] (03CR) 10Dzahn: [C: 03+1] aklapper: access to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/584676 (https://phabricator.wikimedia.org/T248905) (owner: 10Aklapper) [08:25:34] (03PS1) 10Marostegui: install_server: Reimage db2093 as buster [puppet] - 10https://gerrit.wikimedia.org/r/584882 [08:27:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1081 after schema change', diff saved to https://phabricator.wikimedia.org/P10828 and previous config saved to /var/cache/conftool/dbconfig/20200331-082711-marostegui.json [08:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:31] (03PS2) 10Ema: systemd: add support for network accounting [puppet] - 10https://gerrit.wikimedia.org/r/584553 (https://phabricator.wikimedia.org/T183146) [08:28:06] (03PS1) 10Dzahn: add IPv6 records for planet1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/584883 (https://phabricator.wikimedia.org/T248863) [08:28:51] (03PS2) 10Marostegui: install_server: Reimage db2093 as buster [puppet] - 10https://gerrit.wikimedia.org/r/584882 (https://phabricator.wikimedia.org/T248957) [08:29:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1084 for schema change', diff saved to https://phabricator.wikimedia.org/P10829 and previous config saved to /var/cache/conftool/dbconfig/20200331-082904-marostegui.json [08:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:17] !log Depool db1084 for schema change [08:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:23] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T248922 (10jcrespo) p:05Triage→03Medium [08:29:36] (03PS1) 10Dzahn: site: add planet[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/584884 (https://phabricator.wikimedia.org/T248863) [08:30:25] 10Operations, 10SRE-Access-Requests, 10Developer-Advocacy (Apr-Jun 2020), 10Patch-For-Review: Add aklapper to analytics-privatedata-users - https://phabricator.wikimedia.org/T248905 (10jcrespo) p:05Triage→03Medium [08:31:16] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2093 as buster [puppet] - 10https://gerrit.wikimedia.org/r/584882 (https://phabricator.wikimedia.org/T248957) (owner: 10Marostegui) [08:31:17] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T248922 (10jcrespo) a:03WDoranWMF [08:31:22] !log signed puppet cert for planet1002.eqiad.wmnet [08:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:06] (03CR) 10Dzahn: [C: 03+2] site: add planet[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/584884 (https://phabricator.wikimedia.org/T248863) (owner: 10Dzahn) [08:34:47] 10Operations, 10Analytics, 10Product-Analytics, 10SRE-Access-Requests: Hive access for Sam Patton - https://phabricator.wikimedia.org/T248097 (10jcrespo) Just a reminder than once a conversation with @Nuria or someone else from analytics is done to understand your needs, to follow the procedure at https://... [08:35:28] (03PS1) 10Marostegui: install_server: Allow db2093 reimage without formating /srv [puppet] - 10https://gerrit.wikimedia.org/r/584886 (https://phabricator.wikimedia.org/T248957) [08:35:41] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for snowick - https://phabricator.wikimedia.org/T248943 (10jcrespo) p:05Triage→03Medium [08:36:16] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for snowick - https://phabricator.wikimedia.org/T248943 (10jcrespo) a:05jcrespo→03kzimmerman [08:42:52] 10Operations, 10SRE-Access-Requests: Add aaron, dpifke and phedenskog to analytics-privatedata-users - https://phabricator.wikimedia.org/T248797 (10jcrespo) p:05Triage→03Medium a:03Gilles Hi, Gilles, Could you use the template at https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ I know this... [08:44:44] !log pool cp2035 - T248816 [08:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:50] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [08:46:29] (03CR) 10Dzahn: [C: 03+2] add IPv6 records for planet1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/584883 (https://phabricator.wikimedia.org/T248863) (owner: 10Dzahn) [08:46:33] (03PS2) 10Dzahn: add IPv6 records for planet1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/584883 (https://phabricator.wikimedia.org/T248863) [08:46:41] (03PS2) 10Giuseppe Lavagetto: eventstreams: revert removal of monitoring class [puppet] - 10https://gerrit.wikimedia.org/r/584880 [08:47:25] (03CR) 10Jcrespo: [C: 03+1] install_server: Allow db2093 reimage without formating /srv [puppet] - 10https://gerrit.wikimedia.org/r/584886 (https://phabricator.wikimedia.org/T248957) (owner: 10Marostegui) [08:48:31] (03CR) 10Marostegui: [C: 03+2] install_server: Allow db2093 reimage without formating /srv [puppet] - 10https://gerrit.wikimedia.org/r/584886 (https://phabricator.wikimedia.org/T248957) (owner: 10Marostegui) [08:51:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21630/ fixes compilation and should be a noop overall" [puppet] - 10https://gerrit.wikimedia.org/r/584880 (owner: 10Giuseppe Lavagetto) [08:51:55] (03PS1) 10Dzahn: switch backend for planet to planet1002 [dns] - 10https://gerrit.wikimedia.org/r/584887 (https://phabricator.wikimedia.org/T247651) [08:52:04] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Disable TwoColConflict talk page workflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584874 (https://phabricator.wikimedia.org/T230231) (owner: 10Awight) [08:52:31] !log Stop MySQL on db2093 for reimage to buster [08:53:01] marostegui: Failed to log message to wiki. Somebody should check the error logs. [08:56:36] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [08:58:38] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10jcrespo) [09:00:48] !log depool & decommission cp2011 - T248950 [09:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:53] T248950: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 [09:01:00] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/584878 (https://phabricator.wikimedia.org/T248950) (owner: 10Vgutierrez) [09:01:26] 10Operations, 10SRE-Access-Requests: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10jcrespo) [09:02:46] (03PS2) 10Vgutierrez: site,install_server: Decommission cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/584878 (https://phabricator.wikimedia.org/T248950) [09:03:07] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: add notes for dumps/backup recipes [puppet] - 10https://gerrit.wikimedia.org/r/584559 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [09:04:56] (03PS4) 10Filippo Giunchedi: install_server: add notes for dumps/backup recipes [puppet] - 10https://gerrit.wikimedia.org/r/584559 (https://phabricator.wikimedia.org/T156955) [09:05:25] !log upload trafficserver 8.0.5-1wm6 to apt.wm.o (buster) - T248938 [09:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:30] T248938: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 [09:07:07] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/584878 (https://phabricator.wikimedia.org/T248950) (owner: 10Vgutierrez) [09:07:22] (03PS3) 10Vgutierrez: site,install_server: Decommission cp2011 [puppet] - 10https://gerrit.wikimedia.org/r/584878 (https://phabricator.wikimedia.org/T248950) [09:08:26] (03CR) 10Giuseppe Lavagetto: check_opcache: Use the number of scripts to determine threshold (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/583906 (owner: 10Giuseppe Lavagetto) [09:09:48] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [09:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:57] 10Operations, 10ops-codfw, 10DBA: (Need by: TBD) codfw: rack/setup/install backup2002/array backup2002-array1 - https://phabricator.wikimedia.org/T248934 (10jcrespo) [09:10:19] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:25] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2011.codfw.wmnet` - cp2011.codfw.wmnet (**PASS**) - Downtimed h... [09:12:32] (03CR) 10Dzahn: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/569684 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [09:12:40] (03PS1) 10Vgutierrez: Remove cp2011 entries [dns] - 10https://gerrit.wikimedia.org/r/584890 (https://phabricator.wikimedia.org/T248816) [09:14:04] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2011 entries [dns] - 10https://gerrit.wikimedia.org/r/584890 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [09:14:06] (03CR) 10Ayounsi: "> Patch Set 1:" [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553741 (owner: 10Hashar) [09:14:29] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2011.codfw.wmnet - https://phabricator.wikimedia.org/T248950 (10Vgutierrez) a:05Vgutierrez→03Papaul [09:14:36] PROBLEM - Check systemd state on ms-be1059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:53] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [09:15:04] (03CR) 10Ayounsi: [C: 03+1] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/569684 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [09:15:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [09:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:25] (03PS2) 10Giuseppe Lavagetto: check_opcache: Use the number of scripts to determine threshold [puppet] - 10https://gerrit.wikimedia.org/r/583906 [09:16:47] 10Operations, 10LDAP-Access-Requests: Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10jcrespo) a:03jcrespo [09:18:07] (03PS1) 10Volans: admin: grant user tarrow analytics access [puppet] - 10https://gerrit.wikimedia.org/r/584892 (https://phabricator.wikimedia.org/T248498) [09:19:20] (03PS1) 10Jcrespo: admin: Remove "Matthias Geisler" from the wmde LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/584893 (https://phabricator.wikimedia.org/T248949) [09:19:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:45] (03CR) 10Jcrespo: "Will also remove from ldap groups through commands." [puppet] - 10https://gerrit.wikimedia.org/r/584893 (https://phabricator.wikimedia.org/T248949) (owner: 10Jcrespo) [09:21:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] check_opcache: Use the number of scripts to determine threshold [puppet] - 10https://gerrit.wikimedia.org/r/583906 (owner: 10Giuseppe Lavagetto) [09:24:23] (03PS1) 10Volans: admin: convert user itamar from ldap to shell [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) [09:24:25] (03PS1) 10Volans: admin: grant user itamar analytics access [puppet] - 10https://gerrit.wikimedia.org/r/584895 (https://phabricator.wikimedia.org/T248498) [09:26:10] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/584696 (owner: 10Jbond) [09:26:59] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10jcrespo) Removed bitpogo from NDA, WMDE ldap groups. [09:27:03] (03CR) 10Addshore: Enable WikibaseQualityConstraints on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581678 (https://phabricator.wikimedia.org/T248117) (owner: 10Cparle) [09:27:09] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10jcrespo) p:05Triage→03Medium [09:27:21] (03Abandoned) 10Jbond: cloud - puppet: use puppet5 and facter 3 by default [puppet] - 10https://gerrit.wikimedia.org/r/584696 (owner: 10Jbond) [09:27:29] (03PS2) 10Jcrespo: admin: Remove "Matthias Geisler" from the nda, wmde LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/584893 (https://phabricator.wikimedia.org/T248949) [09:28:40] (03PS4) 10Cparle: Enable WikibaseQualityConstraints on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581678 (https://phabricator.wikimedia.org/T248117) [09:31:21] (03PS2) 10Volans: admin: grant user itamar analytics access [puppet] - 10https://gerrit.wikimedia.org/r/584895 (https://phabricator.wikimedia.org/T248482) [09:32:21] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10Volans) a:05Volans→03None @jcrespo https://gerrit.wikimedia.org/r/c/operations/puppet/+/584892/ is ready for review and merge. [09:32:31] (03CR) 10Cparle: Enable WikibaseQualityConstraints on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581678 (https://phabricator.wikimedia.org/T248117) (owner: 10Cparle) [09:32:38] (03PS5) 10Cparle: Enable WikibaseQualityConstraints on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581678 (https://phabricator.wikimedia.org/T248117) [09:33:11] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for aaron, dpifke, phedenskog - https://phabricator.wikimedia.org/T248797 (10Gilles) [09:33:40] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1059 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:34:19] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for aaron, dpifke, phedenskog - https://phabricator.wikimedia.org/T248797 (10Gilles) I approve this request as manager of aaron, dpifke and phedenskog [09:34:32] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for aaron, dpifke, phedenskog - https://phabricator.wikimedia.org/T248797 (10Gilles) a:05Gilles→03Nuria [09:34:46] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for aaron, dpifke, phedenskog - https://phabricator.wikimedia.org/T248797 (10Gilles) Assigning to @nuria for approval [09:36:33] !log push homer diffs to mr1-eqiad [09:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:11] looking at ms-be1059 [09:37:36] 10Operations, 10Performance-Team: Occasional NIC Tx bandwidth saturation for mc1027 - https://phabricator.wikimedia.org/T248962 (10elukey) p:05Triage→03High [09:38:04] RECOVERY - DPKG on snapshot1008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:38:28] 10Operations, 10Performance-Team: Occasional NIC Tx bandwidth saturation for mc1027 - https://phabricator.wikimedia.org/T248962 (10elukey) [09:41:40] RECOVERY - Check systemd state on ms-be1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:21] (03CR) 10Ayounsi: "Tested on mr1-eqiad and works as expected." [software/homer] - 10https://gerrit.wikimedia.org/r/584689 (https://phabricator.wikimedia.org/T244363) (owner: 10Volans) [09:49:03] !log push homer diffs to mr1-eqsin [09:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:50] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/584553 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [09:55:14] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [09:58:34] (03CR) 10Ayounsi: [C: 03+1] "LGTM." [software/homer] - 10https://gerrit.wikimedia.org/r/584689 (https://phabricator.wikimedia.org/T244363) (owner: 10Volans) [09:59:45] (03CR) 10Volans: [C: 03+2] junos: retry when a timeout occurs during commits [software/homer] - 10https://gerrit.wikimedia.org/r/584689 (https://phabricator.wikimedia.org/T244363) (owner: 10Volans) [10:01:04] 10Operations, 10Analytics, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10elukey) Got it, so basically do all the process of creating the schemas etc.. but push data from pmacct directly to the kafka topic bypassing Eventgate. The fact that we'll have to hav... [10:02:37] (03Merged) 10jenkins-bot: junos: retry when a timeout occurs during commits [software/homer] - 10https://gerrit.wikimedia.org/r/584689 (https://phabricator.wikimedia.org/T244363) (owner: 10Volans) [10:02:44] (03PS3) 10Ema: conftool::scripts: add ispooled [puppet] - 10https://gerrit.wikimedia.org/r/584613 (https://phabricator.wikimedia.org/T248067) [10:03:27] !log add BGP to AS41327 in AMS-IX [10:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:04] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1059 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:09:34] (03PS1) 10Ema: cache: update service restart scripts comments [puppet] - 10https://gerrit.wikimedia.org/r/584901 (https://phabricator.wikimedia.org/T238625) [10:19:14] (03CR) 10Jcrespo: [C: 03+2] admin: Remove "Matthias Geisler" from the nda, wmde LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/584893 (https://phabricator.wikimedia.org/T248949) (owner: 10Jcrespo) [10:19:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1084 after schema change', diff saved to https://phabricator.wikimedia.org/P10831 and previous config saved to /var/cache/conftool/dbconfig/20200331-101953-marostegui.json [10:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:31] (03PS1) 10Ema: cache: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/584902 [10:22:48] (03CR) 10Ema: "pcc here: https://puppet-compiler.wmflabs.org/compiler1002/21631/" [puppet] - 10https://gerrit.wikimedia.org/r/584902 (owner: 10Ema) [10:24:47] (03CR) 10Jbond: "lgtm minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584613 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [10:26:34] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10jcrespo) @KFrancis I am not sure if this is the correct communication method: But would you to need to update "Matthias Geisler" (matthias.ge... [10:27:54] (03CR) 10Jcrespo: [C: 03+1] admin: grant user tarrow analytics access [puppet] - 10https://gerrit.wikimedia.org/r/584892 (https://phabricator.wikimedia.org/T248498) (owner: 10Volans) [10:28:08] (03PS2) 10Jcrespo: admin: grant user tarrow analytics access [puppet] - 10https://gerrit.wikimedia.org/r/584892 (https://phabricator.wikimedia.org/T248498) (owner: 10Volans) [10:31:46] (03CR) 10Jbond: "LGTM fyi this no longer needs approval in the Monday meeting as indicated[1]" [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [10:31:52] (03CR) 10Jbond: [C: 03+1] admin: convert user itamar from ldap to shell [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [10:32:01] 10Operations, 10SRE-tools: Homer: commit timeout on MX104 and SRXs - https://phabricator.wikimedia.org/T244363 (10Volans) 05Open→03Resolved With the automatic retry added to homer this problem has been work-arounded. It's not possible to commit to those devices via homer. The first attempt will commit but... [10:32:46] (03CR) 10Jbond: [C: 03+1] "lgtm and ditto monday meeting" [puppet] - 10https://gerrit.wikimedia.org/r/584895 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [10:32:55] (03CR) 10Volans: [C: 04-1] "I'm waiting confirmation of the key on a side channel too, to be on the safe side. I'll remove the -1 and update once done." [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [10:33:50] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [10:34:18] (03PS1) 10Giuseppe Lavagetto: mediawiki: switch TLS termination to envoy on appserver canaries [puppet] - 10https://gerrit.wikimedia.org/r/584904 (https://phabricator.wikimedia.org/T247389) [10:36:33] (03CR) 10Jbond: [C: 03+1] admin: grant user tarrow analytics access [puppet] - 10https://gerrit.wikimedia.org/r/584892 (https://phabricator.wikimedia.org/T248498) (owner: 10Volans) [10:38:33] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10jcrespo) On the SRE-production side this is done, only waiting legal for the above consultation. Wikitech account is kept as normal, just wit... [10:44:58] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1001/21633/ the change does the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/584904 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [10:46:30] <_joe_> !log disabled puppet on canary appservers, potentially dangerous change ahead [10:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:41] (03CR) 10Ema: [C: 03+1] mediawiki: switch TLS termination to envoy on appserver canaries [puppet] - 10https://gerrit.wikimedia.org/r/584904 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [10:49:07] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10jcrespo) 05Open→03Resolved I am going to close this as resolved. Tf user cannot access previous account, we cannot clean it up as we cannot really verify it is hi... [10:54:55] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10jcrespo) I believe the reason for the confusion is: T226091#5288600. I will remove the extra (old) account from LDAP. [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200331T1100). [11:00:04] kart_ and awight: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:06] o/ [11:00:43] * kart_ is here [11:01:22] ok, I can SWAT [11:01:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584574 (https://phabricator.wikimedia.org/T248179) (owner: 10KartikMistry) [11:02:17] (03PS2) 10Lucas Werkmeister (WMDE): Enable ContentTranslation in Lithuanian Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584574 (https://phabricator.wikimedia.org/T248179) (owner: 10KartikMistry) [11:02:20] Lucas_WMDE: thanks! :) [11:03:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "poke zuul" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584574 (https://phabricator.wikimedia.org/T248179) (owner: 10KartikMistry) [11:04:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Disable TwoColConflict talk page workflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584874 (https://phabricator.wikimedia.org/T230231) (owner: 10Awight) [11:06:39] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10jcrespo) 05Resolved→03Open This is the status right now: ` uid: josepita cn: Jose pita email: jpita-ctr@wikimedia.org ldap groups: wmf uid:jpita cn: Jpita email:... [11:06:59] (03Merged) 10jenkins-bot: Enable ContentTranslation in Lithuanian Wikipedia as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584574 (https://phabricator.wikimedia.org/T248179) (owner: 10KartikMistry) [11:07:30] kart_: it’s on mwdebug1001, please test [11:08:17] seems to be working as far as I can test myself (https://lt.wikipedia.org/wiki/Specialus:ContentTranslation) [11:08:37] Lucas_WMDE: yep. Tested :) [11:08:38] (03CR) 10Jcrespo: [C: 03+2] admin: grant user tarrow analytics access [puppet] - 10https://gerrit.wikimedia.org/r/584892 (https://phabricator.wikimedia.org/T248498) (owner: 10Volans) [11:08:41] Lucas_WMDE: go ahead. [11:08:44] cool, thanks! [11:09:41] awight: you’re up next once this is done [11:10:18] Lucas_WMDE: ty! [11:10:21] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:584574|Enable ContentTranslation in Lithuanian Wikipedia as a default tool (T248179)]] (duration: 01m 00s) [11:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:27] T248179: Enable Content Translation in Lithuanian Wikipedia as a default tool - https://phabricator.wikimedia.org/T248179 [11:11:26] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:584574|Enable ContentTranslation in Lithuanian Wikipedia as a default tool (T248179)]], take II (duration: 00m 59s) [11:11:28] (03CR) 10Jbond: [C: 03+2] network: add new function to return ip lists used in ACLs [puppet] - 10https://gerrit.wikimedia.org/r/583340 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [11:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:04] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584874 (https://phabricator.wikimedia.org/T230231) (owner: 10Awight) [11:12:55] (03Merged) 10jenkins-bot: Disable TwoColConflict talk page workflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584874 (https://phabricator.wikimedia.org/T230231) (owner: 10Awight) [11:13:46] oh, I didn’t know Jenkins can rebase changes itself [11:13:47] that’s cool [11:14:07] awight: the change is on mwdebug1001, but IIUC there’s not much to test? [11:14:13] check that nothing is broken, I guess? [11:14:25] It doesn't, depends on the merge strategy. [11:17:09] awight: ping? [11:25:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10jcrespo) @Tarrow, after a few minutes passes (~30) you should be able to log in following https://wikitech.wikimedia.org/wiki/Production_access#Setting_up_your_acc... [11:27:16] (03PS3) 10Jbond: profile::base::firewall: add support for abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/583341 (https://phabricator.wikimedia.org/T233945) [11:27:17] no awight to be seen… [11:27:25] the change is harmless enough so I guess I’ll sync it anyways [11:28:24] Lucas_WMDE: +1 [11:28:28] (03CR) 10Hnowlan: "> Did you test that if you enable all of these change-prop actually starts?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/584637 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [11:29:04] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:584874|Disable TwoColConflict talk page workflow (T230231)]] (duration: 00m 58s) [11:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:10] T230231: Allow resolution suggestions of edit conflicts on talk pages - https://phabricator.wikimedia.org/T230231 [11:30:13] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:584874|Disable TwoColConflict talk page workflow (T230231)]], take II (duration: 00m 57s) [11:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:33] !log EU SWAT done [11:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:03] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for aaron, dpifke, phedenskog - https://phabricator.wikimedia.org/T248797 (10jcrespo) [11:42:11] (03CR) 10Jbond: [C: 03+2] profile::base::firewall: add support for abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/583341 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [11:46:06] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 165.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [11:51:16] (03PS1) 10Jbond: firewall: ensure abuse network blocks are placed first [puppet] - 10https://gerrit.wikimedia.org/r/584912 (https://phabricator.wikimedia.org/T233945) [11:54:16] Lucas_WMDE: Hi sorry to be distracted there. Yes, the config patch should have been a no-op, I'll check now for breakage. [11:55:42] Looks good. [11:59:15] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10jbond) @HMarcus I'm going to pick this up and take a look are you able to provide me with credentials. thanks [12:01:39] (03CR) 10Awight: "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584874 (https://phabricator.wikimedia.org/T230231) (owner: 10Awight) [12:08:28] (03PS1) 104nn1l2: Enable Local upload on azbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584913 [12:12:22] (03CR) 10Vgutierrez: [C: 03+1] mediawiki: switch TLS termination to envoy on appserver canaries [puppet] - 10https://gerrit.wikimedia.org/r/584904 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [12:19:08] (03PS2) 104nn1l2: Enable Local upload on azbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584913 (https://phabricator.wikimedia.org/T248971) [12:19:59] (03PS2) 10Jcrespo: admin: convert user itamar from ldap to shell [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [12:20:01] (03PS3) 10Jcrespo: admin: grant user itamar analytics access [puppet] - 10https://gerrit.wikimedia.org/r/584895 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [12:20:03] (03PS1) 10Jcrespo: admin: Add aaron, dpifke, phedenskog to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/584915 (https://phabricator.wikimedia.org/T248797) [12:22:36] (03CR) 10Jcrespo: [C: 04-1] "Waiting for analytics tech lead approval." [puppet] - 10https://gerrit.wikimedia.org/r/584915 (https://phabricator.wikimedia.org/T248797) (owner: 10Jcrespo) [12:23:00] !log rolling upgrade of ATS to version 8.0.6-1wm5 - T248938 [12:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:06] T248938: ATS ts_lua coredumps on config reload - https://phabricator.wikimedia.org/T248938 [12:24:22] (03PS3) 104nn1l2: Enable Local upload on azbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584913 [12:25:22] (03CR) 10Hnowlan: [C: 03+2] Remove outdated PCS endpoint references [deployment-charts] - 10https://gerrit.wikimedia.org/r/584660 (owner: 10Ppchelko) [12:28:30] (03CR) 10RhinosF1: [C: 03+1] Enable Local upload on azbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584913 (owner: 104nn1l2) [12:29:38] RECOVERY - traffic_server backend process restarted on cp2010 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2010&var-layer=backend [12:31:05] (03PS2) 10Ema: cache: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/584902 [12:32:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: switch TLS termination to envoy on appserver canaries [puppet] - 10https://gerrit.wikimedia.org/r/584904 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [12:34:02] 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10bmansurov) @elukey how can I access `http://scb1001.eqiad.wmnet:9632`? Should I be on some host to ping that URL? Also, where can I see the logs? Thanks! [12:34:56] <_joe_> !log transitioning mw1261 to envoy [12:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:26] PROBLEM - Check systemd state on mw1261 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1091 for schema change', diff saved to https://phabricator.wikimedia.org/P10833 and previous config saved to /var/cache/conftool/dbconfig/20200331-124452-marostegui.json [12:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:10] !log Deploy schema change on db1091 [12:45:41] !log Deploy schema change on db1091 [12:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:59] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:46:00] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:09] RECOVERY - traffic_server backend process restarted on cp1081 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqiad+prometheus/ops&var-instance=cp1081&var-layer=backend [12:49:24] <_joe_> !log switching all appserver canaries to envoy [12:49:25] RECOVERY - Check systemd state on mw1261 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:03] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T248922 (10WDoranWMF) @jcrespo Approved! Thanks [12:53:23] RECOVERY - traffic_server backend process restarted on cp2023 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2023&var-layer=backend [12:55:55] RECOVERY - traffic_server backend process restarted on cp2013 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2013&var-layer=backend [12:57:01] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T248922 (10jcrespo) a:05WDoranWMF→03jcrespo [12:58:00] 10Operations, 10Traffic: ats-tls ran out of FDs on cp1089 - https://phabricator.wikimedia.org/T248736 (10Ottomata) Would love to get this fixed ASAP! Let me know if I can help! [13:03:33] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:03:36] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:30] !log update nat on pfw3-codfw - T248906 [13:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:19] (03PS1) 10Giuseppe Lavagetto: mediawiki: convert API canaries to use envoy [puppet] - 10https://gerrit.wikimedia.org/r/584921 (https://phabricator.wikimedia.org/T247389) [13:09:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21634/mw1276.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/584921 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [13:22:04] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T248922 (10thcipriani) >>! In T248922#6013632, @jcrespo wrote: > Either @Thcipriani or SRE team would be the right people to own mwmaint servers, requesting his ok. Do you know if th... [13:22:24] (03CR) 10Ema: [C: 03+1] ATS: Enable TLS Session tickets in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/584877 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [13:24:27] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable TLS Session tickets in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/584877 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [13:27:46] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T248922 (10jcrespo) @holger.knust based on your written needs "To run maintenance scripts" and title "Requesting access to mwmaint1002", plus @thcipriani comment, I would suggest res... [13:28:56] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10fgiunchedi) >>! In T247820#5995564, @Ottomata wrote: > Another example: > > `lang=yaml > chart: eventgate > app: event... [13:31:02] !log Enable TLS Session tickets in eqsin - T245616 [13:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:08] T245616: Provide a simple and automated SSL Ticket key generation system for ATS - https://phabricator.wikimedia.org/T245616 [13:31:35] 10Operations, 10SRE-Access-Requests: Requesting access to mwmaint1002.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T248922 (10jcrespo) [13:33:06] (03PS4) 10Ema: conftool::scripts: add ispooled [puppet] - 10https://gerrit.wikimedia.org/r/584613 (https://phabricator.wikimedia.org/T248067) [13:33:19] (03CR) 10Ema: conftool::scripts: add ispooled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584613 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [13:37:01] 10Operations, 10Analytics, 10netops: Move netflow to TLS encryption/authentication via librdkafka - https://phabricator.wikimedia.org/T248980 (10elukey) [13:40:35] (03PS2) 10Giuseppe Lavagetto: conftool-data: add "canary" faux service to appservers [puppet] - 10https://gerrit.wikimedia.org/r/584861 [13:40:37] (03PS5) 10Giuseppe Lavagetto: scap: define MW canaries dynamically [puppet] - 10https://gerrit.wikimedia.org/r/465411 [13:43:11] (03PS6) 10Giuseppe Lavagetto: scap: define MW canaries dynamically [puppet] - 10https://gerrit.wikimedia.org/r/465411 [13:46:04] (03PS7) 10Giuseppe Lavagetto: scap: define MW canaries dynamically [puppet] - 10https://gerrit.wikimedia.org/r/465411 [13:46:50] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Use envoy for TLS termination on the appservers - https://phabricator.wikimedia.org/T247389 (10Joe) Both appserver and api canaries now use TLS termination. [13:49:04] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for tarrow - https://phabricator.wikimedia.org/T248498 (10jcrespo) [13:50:13] (03PS1) 10Vgutierrez: site: Reimage cp2036 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/584927 (https://phabricator.wikimedia.org/T248816) [13:52:38] 10Operations, 10Traffic: traffic_server crash upon Lua reload: attempt to concatenate a table value - https://phabricator.wikimedia.org/T242952 (10Vgutierrez) https://github.com/apache/trafficserver/pull/6571 could be handy to tune some TS lua aspects [13:53:21] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2036 as cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/584927 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [13:54:28] may anyone please merge a puppet patch for CI please? That is for the WMCS instances only (no prod). I need the "acl" debian package on the hosts [13:54:32] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/583392/ ;) [13:54:39] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2036.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [13:55:27] (03PS1) 10Elukey: Enable HDFS ACLs on Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/584929 (https://phabricator.wikimedia.org/T246755) [13:56:16] (03CR) 10Elukey: [C: 03+2] Enable HDFS ACLs on Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/584929 (https://phabricator.wikimedia.org/T246755) (owner: 10Elukey) [13:58:11] (03PS3) 10Jcrespo: admin: convert user itamar from ldap to shell [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [13:58:13] (03PS4) 10Jcrespo: admin: grant user itamar analytics access [puppet] - 10https://gerrit.wikimedia.org/r/584895 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [13:58:15] (03PS2) 10Jcrespo: admin: Add aaron, dpifke, phedenskog to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/584915 (https://phabricator.wikimedia.org/T248797) [13:58:17] (03PS1) 10Jcrespo: admin: Add holger to restricted group to run maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/584932 (https://phabricator.wikimedia.org/T248922) [14:00:11] (03PS1) 10Jbond: override manifests dir: allow passing manifest_dir to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/584934 (https://phabricator.wikimedia.org/T248689) [14:00:36] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to mwmaint1002.eqiad.wmnet for holger - https://phabricator.wikimedia.org/T248922 (10jcrespo) [14:01:33] (03CR) 10jerkins-bot: [V: 04-1] override manifests dir: allow passing manifest_dir to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/584934 (https://phabricator.wikimedia.org/T248689) (owner: 10Jbond) [14:02:09] (03CR) 10Jbond: [C: 03+1] "lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/584613 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [14:03:52] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [14:04:20] (03PS2) 10Jbond: override manifests dir: allow passing manifest_dir to compile function [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/584934 (https://phabricator.wikimedia.org/T248689) [14:05:29] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw125[4-8].eqiad.wmnet [14:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:00] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/584932 (https://phabricator.wikimedia.org/T248922) (owner: 10Jcrespo) [14:07:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:10:04] PROBLEM - Host cp4024 is DOWN: PING CRITICAL - Packet loss = 100% [14:10:15] damn.. the downtime expired... [14:10:24] ^^ that's me upgrading kernels :) [14:10:44] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:10:44] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:16] RECOVERY - Host cp4024 is UP: PING OK - Packet loss = 0%, RTA = 74.70 ms [14:12:16] alright, thanks [14:14:46] mutante: BTW, one host more and you can continue with the ATS changes [14:14:57] vgutierrez: ok, great. thx [14:15:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1091 after schema change', diff saved to https://phabricator.wikimedia.org/P10834 and previous config saved to /var/cache/conftool/dbconfig/20200331-141459-marostegui.json [14:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:08] i have like 3. but maybe tomorrow earlier [14:15:13] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for aaron, dpifke, phedenskog - https://phabricator.wikimedia.org/T248797 (10jcrespo) [14:15:51] mutante: ack [14:17:27] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:56] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [14:20:16] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:23] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) Icinga downtime for 2:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: decom ` mw[1254-1258].eqiad.wmnet ` [14:21:42] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw125[4-8].eqiad.wmnet [14:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:07] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2036.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2036.codfw.wmnet'] ` [14:26:21] 10Operations, 10Analytics, 10netops: Move netflow data to Eventgate Analytics - https://phabricator.wikimedia.org/T248865 (10Ottomata) > We'll need to move current hive data to the new location in the event database You might end up just producing to a new topic name anyway. The topic name will eventually m... [14:33:41] (03CR) 10Dzahn: [C: 03+2] "package is already installed on contint1001/2001 and per "cherry-picked"" [puppet] - 10https://gerrit.wikimedia.org/r/583392 (https://phabricator.wikimedia.org/T210271) (owner: 10Hashar) [14:43:15] !log pool cp2036 - T248816 [14:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:21] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [14:44:49] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10Ottomata) > e.g. golang services), and we seem to be fine without it? If they were using statsd, they'd wouldn't be fin... [14:49:14] PROBLEM - Host cp4025 is DOWN: PING CRITICAL - Packet loss = 100% [14:50:45] RECOVERY - Host cp4025 is UP: PING OK - Packet loss = 0%, RTA = 74.50 ms [14:51:17] jouncebot: next [14:51:18] In 0 hour(s) and 8 minute(s): Enable new DiscussionTools beta feature (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200331T1500) [14:51:24] Good-o. [14:57:32] mutante: all done, feel free to merge them [14:57:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics for ItamarWMDE - https://phabricator.wikimedia.org/T248482 (10Volans) SSH key confirmed with Itamar on a side channel. This can proceed for me. [14:58:05] (03CR) 10Volans: "SSH key confirmed with Itamar on a side channel. This can proceed for me." [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [14:59:39] (03PS1) 10Andrew Bogott: cloud-vps: set a default for profile::base::firewall::block_abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/584951 [15:00:04] James_F: I, the Bot under the Fountain, allow thee, The Deployer, to do Enable new DiscussionTools beta feature deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200331T1500). [15:01:09] !log cr2-eqiad: commit flex-flow-sizing T248394 [15:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:15] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [15:01:37] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: set a default for profile::base::firewall::block_abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/584951 (owner: 10Andrew Bogott) [15:02:11] (03PS8) 10Jforrester: Enable DiscussionTools as a beta feature on four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [15:02:16] (03CR) 10Jforrester: [C: 03+2] "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [15:05:23] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [15:05:28] 10Operations, 10Puppet, 10User-jbond: Add CI check to ensure defaults exist in cloud.yaml - https://phabricator.wikimedia.org/T248994 (10jbond) [15:05:37] 10Operations, 10Puppet, 10User-jbond: Add CI check to ensure defaults exist in cloud.yaml - https://phabricator.wikimedia.org/T248994 (10jbond) p:05Triage→03Medium [15:05:49] !log cr1-eqiad: commit flex-flow-sizing T248394 [15:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:14] (03Merged) 10jenkins-bot: Enable DiscussionTools as a beta feature on four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579337 (https://phabricator.wikimedia.org/T245794) (owner: 10Bartosz Dziewoński) [15:14:19] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T245794 Enable DiscussionTools as a beta feature on four wikis (duration: 01m 00s) [15:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:26] T245794: Enable config flag to make Replying v1.0 available as Beta Feature (on target wikis) - https://phabricator.wikimedia.org/T245794 [15:15:25] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 00m 58s) [15:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:36] (03PS1) 10Vgutierrez: site: Reimage cp2037 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/584953 (https://phabricator.wikimedia.org/T248816) [15:16:50] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2037 as cache::text [puppet] - 10https://gerrit.wikimedia.org/r/584953 (https://phabricator.wikimedia.org/T248816) (owner: 10Vgutierrez) [15:18:22] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` cp2037.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage... [15:21:59] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Vgutierrez) [15:23:40] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2010 [puppet] - 10https://gerrit.wikimedia.org/r/584954 (https://phabricator.wikimedia.org/T249002) [15:25:48] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2010 [puppet] - 10https://gerrit.wikimedia.org/r/584954 (https://phabricator.wikimedia.org/T249002) (owner: 10Vgutierrez) [15:26:47] !log depool & decommission cp2010 - T249002 [15:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:53] T249002: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 [15:27:06] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:27:06] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:57] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [15:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:30] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:36] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2010.codfw.wmnet` - cp2010.codfw.wmnet (**PASS**) - Downtimed h... [15:29:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [15:29:37] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:43] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) Icinga downtime for 1 day, 0:00:00 set by dzahn@cumin1001 on 5 host(s) and their services with reason: decom ` mw[1254-1258].eqiad.wmnet ` [15:31:13] (03PS1) 10Vgutierrez: Remove cp2010 entries [dns] - 10https://gerrit.wikimedia.org/r/584956 (https://phabricator.wikimedia.org/T249002) [15:32:09] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2010 entries [dns] - 10https://gerrit.wikimedia.org/r/584956 (https://phabricator.wikimedia.org/T249002) (owner: 10Vgutierrez) [15:33:11] 10Operations, 10netops, 10Patch-For-Review: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) [15:33:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [15:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:35:08] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2010.codfw.wmnet - https://phabricator.wikimedia.org/T249002 (10Vgutierrez) a:05Vgutierrez→03Papaul [15:35:31] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [15:35:34] !log decom mw1254 through mw1258 (last remaining old servers in rack D5, depooled a while ago and average response time is again under 200ms) T247780 [15:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:40] D5: Ok so I hacked up ssh.py to use mozprocess - https://phabricator.wikimedia.org/D5 [15:35:40] T247780: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 [15:36:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:36:22] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) 05Stalled→03Open [15:36:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:40] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1254-1258].eqiad.wmnet` - mw1254.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [15:37:01] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10Vgutierrez) [15:38:38] (03PS1) 10Marostegui: control-mariadb-10.4: Increase package version [software] - 10https://gerrit.wikimedia.org/r/584958 (https://phabricator.wikimedia.org/T248957) [15:39:01] (03PS1) 10Vgutierrez: site,install_server: Decommission cp2014 [puppet] - 10https://gerrit.wikimedia.org/r/584959 (https://phabricator.wikimedia.org/T249009) [15:39:06] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:26] (03PS3) 10Dzahn: decom mw1254 through mw1258, remaining rack D5 appservers [puppet] - 10https://gerrit.wikimedia.org/r/583575 (https://phabricator.wikimedia.org/T247780) [15:40:40] (03CR) 10Dzahn: [C: 03+2] decom mw1254 through mw1258, remaining rack D5 appservers [puppet] - 10https://gerrit.wikimedia.org/r/583575 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [15:40:48] (03CR) 10Urbanecm: [C: 03+1] "LGTM, thanks for the fix!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584579 (https://phabricator.wikimedia.org/T248844) (owner: 10Gergő Tisza) [15:41:33] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:41:39] (03CR) 10Urbanecm: [C: 03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584183 (https://phabricator.wikimedia.org/T235964) (owner: 10Gergő Tisza) [15:42:32] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584133 (https://phabricator.wikimedia.org/T241181) (owner: 10Gergő Tisza) [15:43:06] !log depool & decommission cp2014 - T249009 [15:43:16] (03CR) 10Vgutierrez: [C: 03+2] site,install_server: Decommission cp2014 [puppet] - 10https://gerrit.wikimedia.org/r/584959 (https://phabricator.wikimedia.org/T249009) (owner: 10Vgutierrez) [15:43:23] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:43:54] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [15:43:55] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:44:07] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2037.codfw.wmnet'] ` Of which those **FAILED**: ` ['cp2037.codfw.wmnet'] ` [15:44:10] (03CR) 10Urbanecm: [C: 04-1] "robots.txt is global. You should use `wgNamespaceRobotPolicies` for that in IS.php." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [15:44:27] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:45:06] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.decommission [15:45:14] (03PS6) 10Addshore: Beta commons: Remove custom wmgWikibaseRepoForeignRepositories setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569207 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [15:45:38] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:45:43] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, and 2 others: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by vgutierrez@cumin1001 for hosts: `cp2014.codfw.wmnet` - cp2014.codfw.wmnet (**PASS**) - Downtimed h... [15:47:55] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_cxserver_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:48:03] (03PS3) 10Zoranzoki21: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) [15:48:09] (03PS4) 10Zoranzoki21: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) [15:48:24] (03PS1) 10Vgutierrez: Remove cp2014 entries [dns] - 10https://gerrit.wikimedia.org/r/584962 (https://phabricator.wikimedia.org/T249009) [15:48:33] (03CR) 10Zoranzoki21: "> robots.txt is global. You should use `wgNamespaceRobotPolicies` for" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [15:49:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:49:42] 10Operations, 10ops-codfw, 10DC-Ops, 10Traffic, 10decommission: decommission cp2014.codfw.wmnet - https://phabricator.wikimedia.org/T249009 (10Vgutierrez) a:05Vgutierrez→03Papaul [15:50:20] 10Operations, 10Traffic, 10Patch-For-Review: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [15:50:42] (03PS1) 10Dzahn: replace DCHP relays with new installservers [homer/public] - 10https://gerrit.wikimedia.org/r/584963 (https://phabricator.wikimedia.org/T224576) [15:50:52] (03PS4) 10Dzahn: decom mw1254 through mw1258, remaining rack D5 appservers [puppet] - 10https://gerrit.wikimedia.org/r/583575 (https://phabricator.wikimedia.org/T247780) [15:50:53] PROBLEM - Check systemd state on ms-be2056 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:35] (03PS1) 10KartikMistry: apertium-pt-gl: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-pt-gl] - 10https://gerrit.wikimedia.org/r/584965 (https://phabricator.wikimedia.org/T247585) [15:53:06] (03CR) 10Vgutierrez: [C: 03+2] Remove cp2014 entries [dns] - 10https://gerrit.wikimedia.org/r/584962 (https://phabricator.wikimedia.org/T249009) (owner: 10Vgutierrez) [15:53:56] 10Operations, 10SRE-tools: Evaluate options for non-root operations with cumin and spicerack cookbooks - https://phabricator.wikimedia.org/T244840 (10MoritzMuehlenhoff) [15:54:10] (03PS2) 10KartikMistry: apertium-pt-gl: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-pt-gl] - 10https://gerrit.wikimedia.org/r/584965 (https://phabricator.wikimedia.org/T247585) [15:55:36] jouncebot: now [15:55:36] For the next 0 hour(s) and 4 minute(s): Enable new DiscussionTools beta feature (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200331T1500) [15:55:45] jouncebot: next [15:55:45] In 0 hour(s) and 4 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200331T1600) [15:56:09] (03CR) 10Dzahn: "schould be coordinated with https://gerrit.wikimedia.org/r/c/operations/puppet/+/569684" [homer/public] - 10https://gerrit.wikimedia.org/r/584963 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [15:56:10] looks like noone is in mediawiki-config so Im going to do a beta config patch [15:56:18] (03CR) 10Addshore: [C: 03+2] Beta commons: Remove custom wmgWikibaseRepoForeignRepositories setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569207 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [15:56:21] (03CR) 10Dzahn: "should be coordinated with https://gerrit.wikimedia.org/r/c/operations/homer/public/+/584963" [puppet] - 10https://gerrit.wikimedia.org/r/569684 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [15:57:32] (03CR) 10Thiemo Kreuz (WMDE): robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [15:57:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:58:15] (03PS6) 10Addshore: Beta cluster: remove custom wmgWikibaseClientRepositories settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569208 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [15:59:12] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) All mw servers in the [https://netbox.wikimedia.org/dcim/racks/39/ rack D5] are now decom'ed. There are a few non-mw servers in that rack that were unaffected but besides those... [15:59:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:00:01] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2056 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:00:04] godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200331T1600). [16:00:05] James_F and Krinkle: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:59] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for aaron, dpifke, phedenskog - https://phabricator.wikimedia.org/T248797 (10Nuria) Approving on my end as these are members of performance team who require navigationtiming data for their daily work.... [16:01:19] (03Merged) 10jenkins-bot: Beta commons: Remove custom wmgWikibaseRepoForeignRepositories setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569207 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [16:01:44] Hey. [16:02:30] (03CR) 10Addshore: [C: 03+1] admin: convert user itamar from ldap to shell [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [16:02:39] (03CR) 10Zoranzoki21: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [16:02:41] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:02:45] (03CR) 10Addshore: [C: 03+1] admin: grant user itamar analytics access [puppet] - 10https://gerrit.wikimedia.org/r/584895 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [16:03:26] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [16:04:30] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10colewhite) @akosiaris in [[ https://github.com/wikimedia/service-runner/pull/227 | this PR ]], we selectively disable t... [16:09:55] RECOVERY - Check systemd state on ms-be2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:09] <_joe_> James_F: I'll be here in a minute [16:10:15] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:11:04] (03CR) 10Thiemo Kreuz (WMDE): robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [16:13:20] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:13:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: Make canary logstash dashboard link more like reality [puppet] - 10https://gerrit.wikimedia.org/r/582113 (https://phabricator.wikimedia.org/T247005) (owner: 10Jforrester) [16:14:22] No rush. [16:15:03] !log pool cp2037 - T248816 [16:15:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: Sync logstash_checker.py canary query with current dashboard [puppet] - 10https://gerrit.wikimedia.org/r/582153 (https://phabricator.wikimedia.org/T247113) (owner: 10Krinkle) [16:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:09] T248816: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 [16:15:45] 10Operations, 10Traffic: Replace cp20[01-26] with cp20[27-42] - https://phabricator.wikimedia.org/T248816 (10Vgutierrez) [16:16:05] <_joe_> James_F: do you want me to run puppet on deploy1001? [16:16:25] <_joe_> so that you can verify everything's ok with scap? [16:16:34] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:17:01] <_joe_> akosiaris: around? I think citoid needs to be looked at [16:18:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:18:21] <_joe_> James_F: {{done}}, you can test scap whenever you want [16:18:36] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:19:09] (03PS3) 10Giuseppe Lavagetto: Install docker on releases-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/474825 (https://phabricator.wikimedia.org/T208529) (owner: 10Thcipriani) [16:20:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:21:51] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1001/21639/releases1001.eqiad.wmnet/ The change fails because it's missing hiera settings." [puppet] - 10https://gerrit.wikimedia.org/r/474825 (https://phabricator.wikimedia.org/T208529) (owner: 10Thcipriani) [16:22:41] your testing scap? :o [16:22:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:22:54] (03CR) 10Zoranzoki21: robots.txt: Disable indexing user (sub)pages and draft-related pages on srwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [16:22:54] <_joe_> me? nope [16:23:07] _joe_: Thanks. [16:23:18] James_F: I would like to sync IS-labs if I can? :D [16:23:41] addshore: Why? :-) [16:24:02] (03PS1) 10Volans: examples: add comments to example config [software/homer] - 10https://gerrit.wikimedia.org/r/584971 [16:24:04] (03PS1) 10Volans: config: complete test coverage [software/homer] - 10https://gerrit.wikimedia.org/r/584972 [16:24:05] <_joe_> Krinkle: merging the change you scheduled [16:24:06] addshore: you shouldn't need to sync -labs files. +2 and git fetch && git rebase should work [16:24:06] (03PS1) 10Volans: plugins: initial implementation for Netbox data [software/homer] - 10https://gerrit.wikimedia.org/r/584973 [16:24:06] cause i merged a thing changing it and historically have always synced the thing too [16:24:23] I'll just pull it for you. [16:24:25] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 00s) [16:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] jenkins: Adjust CSP header to allow inline CSS and video playback [puppet] - 10https://gerrit.wikimedia.org/r/582604 (https://phabricator.wikimedia.org/T245658) (owner: 10Brian Wolff) [16:24:36] _joe_: All seems well. [16:24:36] James_F: ack, also fine with that if that is the currently "done" thing :) [16:26:20] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:28:56] (03CR) 10Addshore: [C: 03+2] Beta cluster: remove custom wmgWikibaseClientRepositories settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569208 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [16:30:21] !log 1.35.0-wmf.26 was branched at bec758b668aaa57fc259a1d0ecf3b35340d2661b for T247773 [16:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:27] T247773: 1.35.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T247773 [16:30:30] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2056 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:32:00] (03Merged) 10jenkins-bot: Beta cluster: remove custom wmgWikibaseClientRepositories settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/569208 (https://phabricator.wikimedia.org/T242087) (owner: 10WMDE-leszek) [16:32:32] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:32:43] James_F: ^^ also that one if you are still sshed in :D [16:32:52] Done. [16:33:02] ty [16:33:05] (My secret is that I'm always sshed in. ;-)) [16:33:28] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:33:33] mrumhff [16:33:53] I'm currently stood in a living room with my laptop on a a game box, on a chair, on a table :P [16:34:02] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:34:10] I'll be back at my / a desk one day! [16:34:29] * James_F has now got a studio ring light so that he's better illuminated for WFH video calls. [16:34:42] Different levels of coping. ;-) [16:37:13] addshore: Next up is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/569209/ ? :-) [16:37:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:38:36] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:38:54] James_F: if your happy to rattle through them during the cautios covid time I'm okay with it (gotta wait for me to finish testing this on beta first though) [16:39:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:39:20] _joe_: thanks [16:39:54] addshore: Oh sure. I'm going to go for a walk right now, but ping me if you could do with help/support. [16:40:07] James_F: ack, ! [16:40:17] Divergence between master code and production config is just as much a risk as switching new things on. [16:40:22] The trick is finding the balance, [16:40:50] PROBLEM - PHP opcache health on mw2178 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:41:48] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:41:54] _joe_: fyi ^ that's the opcache alert we just tuned, right? [16:42:06] (03PS3) 10Jgreen: nsca_frack_cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/583732 (https://phabricator.wikimedia.org/T247855) [16:42:08] (03PS1) 10Jgreen: nsca_frack.cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/584977 (https://phabricator.wikimedia.org/T242270) [16:42:18] <_joe_> rlazarus: yes [16:42:24] <_joe_> sigh [16:42:34] <_joe_> i'll look later [16:42:42] ack [16:44:06] (03CR) 10jerkins-bot: [V: 04-1] nsca_frack.cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/584977 (https://phabricator.wikimedia.org/T242270) (owner: 10Jgreen) [16:46:37] (03CR) 10Dwisehaupt: [C: 03+1] "Looks good. Shipit." [puppet] - 10https://gerrit.wikimedia.org/r/584977 (https://phabricator.wikimedia.org/T242270) (owner: 10Jgreen) [16:56:29] (03PS2) 10Jgreen: nsca_frack.cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/584977 (https://phabricator.wikimedia.org/T247855) [16:58:52] (03CR) 10Jgreen: [C: 03+2] nsca_frack.cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/584977 (https://phabricator.wikimedia.org/T247855) (owner: 10Jgreen) [16:59:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:00:04] halfak and accraze: May I have your attention please! Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200331T1700) [17:01:40] (03Abandoned) 10Jgreen: nsca_frack_cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/583732 (https://phabricator.wikimedia.org/T247855) (owner: 10Jgreen) [17:01:50] (03Abandoned) 10Jgreen: nsca_frack.cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/584977 (https://phabricator.wikimedia.org/T247855) (owner: 10Jgreen) [17:06:24] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:10:02] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:10:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:10:54] RECOVERY - PHP opcache health on mw2178 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:11:33] (03PS1) 10Dduvall: Group0 to 1.35.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584981 [17:18:55] (03PS1) 10Jgreen: redo of nsca_frack.cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/584985 (https://phabricator.wikimedia.org/T247855) [17:20:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:20:15] (03CR) 10Holger Knust: [C: 03+1] admin: Add holger to restricted group to run maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/584932 (https://phabricator.wikimedia.org/T248922) (owner: 10Jcrespo) [17:22:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:34:14] (03PS1) 10CRusnov: interface_automation: Make device selection a text field [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/584991 [17:34:35] (03CR) 10jerkins-bot: [V: 04-1] interface_automation: Make device selection a text field [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/584991 (owner: 10CRusnov) [17:38:00] !log restart elasticsearch_6@cloudelastic-chi-eqiad.service on cloudelastic1001 to see if it recovers from a trashing/gc state - T231517 [17:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:07] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [17:40:36] !log dduvall@deploy1001 Pruned MediaWiki: 1.35.0-wmf.23 (duration: 26m 51s) [17:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:24] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [17:41:56] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [17:42:09] !log dduvall@deploy1001 Started scap: testwiki to php-1.35.0-wmf.26 and rebuild l10n cache [17:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:38] (03CR) 10Dwisehaupt: [C: 03+1] "Looks ok to me. Looking forward to the alpha reorg." [puppet] - 10https://gerrit.wikimedia.org/r/584985 (https://phabricator.wikimedia.org/T247855) (owner: 10Jgreen) [17:43:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for aaron, dpifke, phedenskog - https://phabricator.wikimedia.org/T248797 (10dpifke) Data access guidelines read and acknowledged. [17:43:35] (03CR) 10Jgreen: [C: 03+2] redo of nsca_frack.cfg.erb updates [puppet] - 10https://gerrit.wikimedia.org/r/584985 (https://phabricator.wikimedia.org/T247855) (owner: 10Jgreen) [17:50:15] (03PS2) 10CRusnov: interface_automation: Make device selection a text field [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/584991 [17:51:27] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/584991 (owner: 10CRusnov) [17:55:00] (03Abandoned) 10CDanis: Prepped depool of esams (just in case) [dns] - 10https://gerrit.wikimedia.org/r/583760 (owner: 10CDanis) [17:58:26] (03PS3) 10DannyS712: Don't try to grant `oathauth-enable` to `*` (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582615 (https://phabricator.wikimedia.org/T248282) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200331T1800) [18:01:15] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10Bstorm) [18:02:49] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 38.64 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [18:04:07] (03PS1) 10ArielGlenn: add snapshot1001 to dumps scap targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/585004 [18:06:24] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] add snapshot1001 to dumps scap targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/585004 (owner: 10ArielGlenn) [18:08:02] !log ariel@deploy1001 Started deploy [dumps/dumps@8376c62]: bring snapshot1010 up to date [18:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:02] !log ariel@deploy1001 Finished deploy [dumps/dumps@8376c62]: bring snapshot1010 up to date (duration: 00m 05s) [18:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:15] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726" - https://phabricator.wikimedia.org/T247014 (10Krinkle) Not sure if the same issue or not, but I'm seeing storms of "shards fail... [18:22:02] (03CR) 10Volans: [C: 03+1] "I didn't test it but it looks sane. One optional nit inline." (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/584991 (owner: 10CRusnov) [18:24:05] (03CR) 10CRusnov: "I have tested this and it works, for what it's worth." (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/584991 (owner: 10CRusnov) [18:26:17] (03PS3) 10CRusnov: interface_automation: Make device selection a text field [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/584991 [18:26:36] (03CR) 10CRusnov: interface_automation: Make device selection a text field (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/584991 (owner: 10CRusnov) [18:27:30] (03CR) 10CRusnov: [C: 03+2] interface_automation: Make device selection a text field [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/584991 (owner: 10CRusnov) [18:29:23] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10Krinkle) First impressions of the new Logstash/Kibana based on using Firefox 74 for macOS on an idle high-end MacBook Pro using a fast WiFi connection. * It is even slower to load. Hust to have the UI appear i... [18:32:32] jouncebot: next [18:32:32] In 0 hour(s) and 27 minute(s): Mediawiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200331T1900) [18:32:37] * addshore looks forward to it [18:33:19] (03CR) 10CRusnov: "Looks good to me, not tested but seems logical." [cookbooks] - 10https://gerrit.wikimedia.org/r/583676 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [18:33:23] (03CR) 10CRusnov: [C: 03+1] sre.dns.netbox: pull the specific SHA1 [cookbooks] - 10https://gerrit.wikimedia.org/r/583676 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [18:40:43] (03PS4) 10ArielGlenn: add more public tables for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/527505 (https://phabricator.wikimedia.org/T226167) [18:42:48] (03CR) 10ArielGlenn: [C: 03+2] add more public tables for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/527505 (https://phabricator.wikimedia.org/T226167) (owner: 10ArielGlenn) [18:42:49] 10Operations, 10LDAP-Access-Requests: Remove the account "Matthias Geisler" from the wmde LDAP group - https://phabricator.wikimedia.org/T248949 (10KFrancis) >>! In T248949#6014119, @jcrespo wrote: > @KFrancis I am not sure if this is the correct communication method: But would you to need to update "Matthias... [18:43:20] 10Operations, 10Cloud-Services, 10Traffic: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10MusikAnimal) [18:48:34] 10Operations, 10Cloud-Services, 10Traffic: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10MusikAnimal) [19:00:04] dduvall and longma: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - American Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200331T1900). [19:03:06] * James_F is around, too. [19:07:06] (03PS1) 10ArielGlenn: add output filename to prefetch message displayed to console [dumps] - 10https://gerrit.wikimedia.org/r/585021 [19:07:08] (03PS1) 10ArielGlenn: sort possible dump files for prefetch properly [dumps] - 10https://gerrit.wikimedia.org/r/585022 [19:07:36] (03CR) 10jerkins-bot: [V: 04-1] sort possible dump files for prefetch properly [dumps] - 10https://gerrit.wikimedia.org/r/585022 (owner: 10ArielGlenn) [19:08:41] (03PS1) 10Ssingh: Add HTTPS proxy support to cescout role [puppet] - 10https://gerrit.wikimedia.org/r/585024 (https://phabricator.wikimedia.org/T247273) [19:08:53] (03PS2) 10ArielGlenn: sort possible dump files for prefetch properly [dumps] - 10https://gerrit.wikimedia.org/r/585022 [19:13:38] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/21640/cescout1001.eqiad.wmnet/change.cescout1001.eqiad.wmnet.pson" [puppet] - 10https://gerrit.wikimedia.org/r/585024 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [19:14:13] (03CR) 10ArielGlenn: [C: 03+2] rename 'parts' attribute of Dump subclasses to something more accurate [dumps] - 10https://gerrit.wikimedia.org/r/577225 (https://phabricator.wikimedia.org/T246465) (owner: 10ArielGlenn) [19:14:43] (03CR) 10ArielGlenn: [C: 03+2] make value of 'parts' in the file listing methods be None or a list [dumps] - 10https://gerrit.wikimedia.org/r/577226 (https://phabricator.wikimedia.org/T246465) (owner: 10ArielGlenn) [19:15:02] (03CR) 10ArielGlenn: [C: 03+2] New class for output file listing methods to move them out of jobs code [dumps] - 10https://gerrit.wikimedia.org/r/577228 (https://phabricator.wikimedia.org/T246465) (owner: 10ArielGlenn) [19:15:21] (03CR) 10ArielGlenn: [C: 03+2] clean up file list method docs, tighten up code [dumps] - 10https://gerrit.wikimedia.org/r/578477 (https://phabricator.wikimedia.org/T246465) (owner: 10ArielGlenn) [19:15:48] (03CR) 10ArielGlenn: [C: 03+2] fix name of option in a fixup script help message [dumps] - 10https://gerrit.wikimedia.org/r/583289 (owner: 10ArielGlenn) [19:16:43] (03PS2) 10Ssingh: cescout: add HTTPS proxy support [puppet] - 10https://gerrit.wikimedia.org/r/585024 (https://phabricator.wikimedia.org/T247273) [19:17:16] (03CR) 10ArielGlenn: [C: 03+2] add output filename to prefetch message displayed to console [dumps] - 10https://gerrit.wikimedia.org/r/585021 (owner: 10ArielGlenn) [19:18:32] (03CR) 10ArielGlenn: [C: 03+2] sort possible dump files for prefetch properly [dumps] - 10https://gerrit.wikimedia.org/r/585022 (owner: 10ArielGlenn) [19:20:11] !log ariel@deploy1001 Started deploy [dumps/dumps@713c297]: more filelist methods cleanup, sort prefetch possible files properly [19:20:15] !log ariel@deploy1001 Finished deploy [dumps/dumps@713c297]: more filelist methods cleanup, sort prefetch possible files properly (duration: 00m 04s) [19:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:15] (03CR) 10Ssingh: [C: 03+2] cescout: add HTTPS proxy support [puppet] - 10https://gerrit.wikimedia.org/r/585024 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [19:22:45] (03CR) 10Ssingh: [C: 03+2] "Already reviewed by dzahn and only the commit message was changed." [puppet] - 10https://gerrit.wikimedia.org/r/585024 (https://phabricator.wikimedia.org/T247273) (owner: 10Ssingh) [19:24:41] 10Operations, 10Cloud-Services, 10Traffic: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10CDanis) I did some quick looking at analytics webrequest data and I don't see a marked increase in >29 second TTFB responses nor a marked increase... [19:30:11] (03CR) 10Jcrespo: "Will deploy tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/584894 (https://phabricator.wikimedia.org/T248482) (owner: 10Volans) [19:42:44] (03CR) 10Acamicamacaraca: [C: 03+1] "Per consensus, LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584615 (https://phabricator.wikimedia.org/T248860) (owner: 10Zoranzoki21) [19:59:04] Lydia_WMDE: Can you look at https://gerrit.wikimedia.org/r/#/c/wikidata/query/deploy/+/585028/ ? Not sure how to get that deployed [20:04:58] !log dduvall@deploy1001 Finished scap: testwiki to php-1.35.0-wmf.26 and rebuild l10n cache (duration: 142m 48s) [20:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:14] !log a slew of "ErrorException from line 334 of /srv/mediawiki/php-1.35.0-wmf.26/includes/context/RequestContext.php: PHP Warning: Recursion detected in RequestContext::getLanguage" after group0 deployment (cc T247773) [20:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:20] T247773: 1.35.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T247773 [20:09:52] marxarelli: interesting [20:11:16] 10Operations, 10Cloud-Services, 10Traffic: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10MusikAnimal) >>! In T249035#6016229, @CDanis wrote: > Would it be possible to get some packet captures of requests that failed? `tcpdump` has sup... [20:13:28] sorry, correction: that's on testwiki only. not group0 [20:14:32] !log correction: RequestContext::getLanguage errors are for testwiki deployment, pre group0 [20:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:57] (03PS1) 10Andrew Bogott: neutron: enable l3_agent_only_dmz_cidr_hack in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/585031 (https://phabricator.wikimedia.org/T247505) [20:15:57] (03CR) 10Andrew Bogott: "This needs to be announced before merging and monitored after." [puppet] - 10https://gerrit.wikimedia.org/r/585031 (https://phabricator.wikimedia.org/T247505) (owner: 10Andrew Bogott) [20:18:25] addshore: they all appear to be from requests made to /w/index.php?title=User:SR5/test.js&action=raw&ctype=text/javascript [20:20:20] should i rollback or leave wmf.26 on testwiki for debugging? [20:22:44] i'll rollback to lessen the log spam [20:25:10] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: ELK7 shards failed errors when loading saved objects, e.g. "field expansion matches too many fields, limit: 1024, got: 1726" - https://phabricator.wikimedia.org/T247014 (10herron) >>! In T247014#6016026, @Krinkle wrote: > Not sure if the same issue or n... [20:26:01] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: rolling back 1.35.0-wmf.26 testwiki deployment following significant increase in error rate (cc T247773) [20:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:07] T247773: 1.35.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T247773 [20:26:07] (03PS4) 10CDanis: completed rollout of sensible flow-table-sizes [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) [20:28:42] (03CR) 10CDanis: completed rollout of sensible flow-table-sizes (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/583740 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [20:32:42] 10Operations, 10ops-eqiad: eqiad: Re-connect cage cameras - https://phabricator.wikimedia.org/T207965 (10wiki_willy) 05Open→03Resolved Resolving this task. It's been over a year, since there was any progress on this, plus we can utilize the onsite Equinix cameras if something were to ever come up. Thanks... [20:47:26] 10Operations, 10Cloud-Services, 10Traffic: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10bd808) I would expect traffic from a VPS instance to be routed something like: instance local interface -> {neutron overlay network} -> cloudnet e... [20:47:59] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/577687/ should unstall the train [20:48:50] DannyS712: right on. thank you! [20:49:36] Its probably larger than most quick-fixes, because it was just a tweak to an existing patch to create a MessageFactory service [20:50:14] jouncebot: now [20:50:14] No deployments scheduled for the next 2 hour(s) and 9 minute(s) [20:50:26] I am going to restart the CI Jenkins soonish [20:51:00] would it be possible to wait a bit - I have another fix for the train that addresses James_F's concern at https://phabricator.wikimedia.org/T249045#6016579 [20:51:31] Thanks, DannyS712. [20:51:53] !log Restarting Jenkins for new CSP rules # T245658 [20:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:59] T245658: .mp4 build artifacts not viewable due to CSP in Chrome - https://phabricator.wikimedia.org/T245658 [20:53:24] DannyS712: i can wait a bit, yeah [20:53:34] of course jenkins crashes [20:53:39] (03PS2) 10Andrew Bogott: neutron: enable l3_agent_only_dmz_cidr_hack in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/585031 (https://phabricator.wikimedia.org/T247505) [20:53:41] (03PS1) 10Andrew Bogott: Openstack Neutron: add neutron l3 hacks for Rocky [puppet] - 10https://gerrit.wikimedia.org/r/585034 (https://phabricator.wikimedia.org/T248635) [20:53:45] hashar: Oh no. [20:53:56] Can I help [20:54:25] https://gerrit.wikimedia.org/r/#/c/582604/ [20:54:32] too many single / double quotes of doom bah [20:54:45] Meh. [20:55:10] I note that gerrit is confused by the quoting in trying to do syntaxhighlighting. [20:55:20] Which is generally a sign that your code is too complex. ;-) [20:56:28] Error: Could not find or load main class default-src [20:56:29] bah [20:57:01] DannyS712: Looks plausible as a quick fix. [20:57:29] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:35] PROBLEM - jenkins_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [20:57:40] I am crafting a revert commit [20:57:59] James_F it might not pass the tests though, since they specify specific returns for the MessageLocalizer [20:58:02] (03PS1) 10Hashar: Revert "jenkins: Adjust CSP header to allow inline CSS and video playback" [puppet] - 10https://gerrit.wikimedia.org/r/585035 (https://phabricator.wikimedia.org/T245658) [20:58:31] DannyS712: Well, given that jenkins is down, no tests are running right now… ;-) [20:58:52] and I'm not setup to run tests locally, unfortunately - nothing to do but wait? [20:59:16] anomie: If you think it's safe enough to revert, happy to do that. [20:59:17] !log contint1001: manually reverted /lib/systemd/system/jenkins.service [20:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:23] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:24] jenkins is starting [20:59:29] RECOVERY - jenkins_service_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [20:59:48] (03CR) 10jerkins-bot: [V: 04-1] Openstack Neutron: add neutron l3 hacks for Rocky [puppet] - 10https://gerrit.wikimedia.org/r/585034 (https://phabricator.wikimedia.org/T248635) (owner: 10Andrew Bogott) [21:00:27] James_F: I don't see anything that makes me think it wouldn't be safe. Codesearch turns up no references other than those added by the patch itself. [21:01:23] * James_F nods. [21:01:27] OK, let's try that out. [21:03:33] we would need https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/585035/ to be merged by sre/ops :/ [21:03:37] that is the revert [21:04:19] James_F if its reverted then https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/579236/ should also be reverted [21:04:47] Right. [21:05:14] (03PS2) 10Andrew Bogott: Openstack Neutron: add neutron l3 hacks for Rocky [puppet] - 10https://gerrit.wikimedia.org/r/585034 (https://phabricator.wikimedia.org/T248635) [21:05:16] 10Operations, 10LDAP-Access-Requests, 10serviceops: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10Reedy) [21:06:01] Though https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/585033/ is probably simpler than a full revert [21:07:34] Yes, but without a test to demonstrate that we've fixed things, it's just assumption (likely, but…) that it fixes it. [21:08:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:08:15] the tests all still pass, and it should be impossible to have a call to RequestContext... if the use is removed [21:08:24] but makes sense to be better safe than sorry [21:08:44] How exactly would one test for this? [21:09:32] (03CR) 10jerkins-bot: [V: 04-1] Openstack Neutron: add neutron l3 hacks for Rocky [puppet] - 10https://gerrit.wikimedia.org/r/585034 (https://phabricator.wikimedia.org/T248635) (owner: 10Andrew Bogott) [21:09:59] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:10:21] DannyS712: That's what Daniel asked too. I guess it's early in the load from sign-in? [21:10:42] yes, looking at the logs its before the session is fully available [21:13:35] marxarelli: Your call, but I'd recommend trying the quick patch before the revert. [21:13:50] 10Operations, 10Anti-Harassment, 10SRE-Access-Requests: Grant analytics access to Anti Harassment Tools engineers - https://phabricator.wikimedia.org/T249059 (10Mooeypoo) [21:14:12] (03CR) 10Cwhite: [C: 03+2] Revert "jenkins: Adjust CSP header to allow inline CSS and video playback" [puppet] - 10https://gerrit.wikimedia.org/r/585035 (https://phabricator.wikimedia.org/T245658) (owner: 10Hashar) [21:14:40] sure, i can redeploy to testwiki after we merge and cherry-pick https://gerrit.wikimedia.org/r/c/mediawiki/core/+/585033/ [21:14:47] James_F, DannyS712 ^ [21:14:47] Kk. [21:15:12] ty; I'm working on adding tests for the MessageLocalizer service [21:15:33] Excellent. [21:17:15] James_F: the puppet change got reverted/merged/deployed :] [21:17:21] Excellent. [21:18:49] (03PS1) 10Hashar: jenkins: Adjust CSP header to allow inline CSS and video playback [puppet] - 10https://gerrit.wikimedia.org/r/585038 (https://phabricator.wikimedia.org/T245658) [21:20:02] (03CR) 10jerkins-bot: [V: 04-1] jenkins: Adjust CSP header to allow inline CSS and video playback [puppet] - 10https://gerrit.wikimedia.org/r/585038 (https://phabricator.wikimedia.org/T245658) (owner: 10Hashar) [21:21:09] (03CR) 10Hashar: [C: 04-1] "systemd strips some single quotes unless it has a space before or after. It seems to work on Stretch, but not on Jessie :/" [puppet] - 10https://gerrit.wikimedia.org/r/585038 (https://phabricator.wikimedia.org/T245658) (owner: 10Hashar) [21:21:37] (03PS3) 10Andrew Bogott: Openstack Neutron: add neutron l3 hacks for Rocky [puppet] - 10https://gerrit.wikimedia.org/r/585034 (https://phabricator.wikimedia.org/T248635) [21:22:57] Jenkins is all fine. I am out to bed [21:23:14] Bye hashar. [21:26:00] (03CR) 10jerkins-bot: [V: 04-1] Openstack Neutron: add neutron l3 hacks for Rocky [puppet] - 10https://gerrit.wikimedia.org/r/585034 (https://phabricator.wikimedia.org/T248635) (owner: 10Andrew Bogott) [21:29:53] James_F: you are quicker on the cherry-pick draw than i :) [21:30:00] thanks for taking care of tht [21:30:01] that [21:30:13] * James_F grins. [21:30:14] Happy to help. [21:30:18] So i take it the fix worked? [21:30:35] DannyS712: Seems like. The final proof is in production, of course. [21:30:51] marxarelli: I'll deploy the fix and then hold back for you to roll to testwiki? [21:31:13] sounds good [21:31:19] Cool. [21:43:42] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10RobH) [21:43:59] 10Operations, 10ops-eqiad: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10RobH) [21:45:41] marxarelli: Syncing now; over to you. [21:45:42] marxarelli James_F cherry pick at https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/585040/ is merged [21:45:56] James_F: great [21:46:26] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.26/includes/user/UserNameUtils.php: T249045 Use wfMessage in UserNameUtils::isUsable for now (duration: 00m 58s) [21:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:33] T249045: PHP Warning: Recursion detected in RequestContext::getLanguage - https://phabricator.wikimedia.org/T249045 [21:47:07] !log dduvall@deploy1001 Started scap: testwiki to php-1.35.0-wmf.26 (T247773) [21:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:13] T247773: 1.35.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T247773 [21:52:06] oh crap. why did i do a full sync... [21:54:38] !log dduvall@deploy1001 sync aborted: testwiki to php-1.35.0-wmf.26 (T247773) (duration: 07m 31s) [21:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:44] T247773: 1.35.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T247773 [21:55:37] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:00:15] marxarelli: I mean, it wouldn't break anything, but yeah. [22:00:51] (03CR) 10Bstorm: [C: 03+2] Add support for redirecting to toolforge.org [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [22:01:32] (03Merged) 10jenkins-bot: Add support for redirecting to toolforge.org [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [22:06:08] James_F: sync-wikiversions isn't saving much time either. syncs are taking way too long in general these days [22:07:27] * James_F sighs. [22:07:29] Yeah. [22:07:46] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: testwiki to php-1.35.0-wmf.26 (T247773) [22:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:52] T247773: 1.35.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T247773 [22:08:38] so far so good [22:08:43] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1075 is OK: HTTP OK: HTTP/1.0 200 OK - 22306 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [22:09:13] (03CR) 10Dduvall: [C: 03+2] Group0 to 1.35.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584981 (owner: 10Dduvall) [22:10:42] (03Merged) 10jenkins-bot: Group0 to 1.35.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/584981 (owner: 10Dduvall) [22:12:30] addshore: Heads-up that we're about to deploy to group0 if you want to test T249018. [22:12:31] T249018: https://wikidata.beta.wmflabs.org/wiki/Q64 consitantly doesn't load on beta wikidata - https://phabricator.wikimedia.org/T249018 [22:13:08] (03PS10) 10Mstyles: kibana: move httpd proxy authentication to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) [22:13:31] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.26 [22:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:46] (03CR) 10jerkins-bot: [V: 04-1] kibana: move httpd proxy authentication to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [22:17:30] (03PS1) 10Bstorm: d/changelog: prepare for 0.65 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/585045 [22:18:39] !log group0 to 1.35.0-wmf.26 (T247773); no rise in error rates following redeployment [22:18:56] (03PS2) 10Bstorm: d/changelog: prepare for 0.65 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/585045 [22:19:10] marxarelli: Failed to log message to wiki. Somebody should check the error logs. [22:19:54] !log group0 to 1.35.0-wmf.26 (T247773); no rise in error rates following redeployment [22:23:08] T247773: 1.35.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T247773 [22:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:11] (03PS1) 10Bstorm: toolforge: update the package_builder role so it can be neatly used [puppet] - 10https://gerrit.wikimedia.org/r/585047 [22:36:56] (03PS4) 10Andrew Bogott: Openstack Neutron: add neutron l3 hacks for Rocky [puppet] - 10https://gerrit.wikimedia.org/r/585034 (https://phabricator.wikimedia.org/T248635) [22:36:57] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [22:39:07] (03CR) 10Bstorm: [C: 03+1] "Looks good! PCC results https://puppet-compiler.wmflabs.org/compiler1003/21642/" [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) (owner: 10Arturo Borrero Gonzalez) [22:42:19] (03CR) 10jerkins-bot: [V: 04-1] Openstack Neutron: add neutron l3 hacks for Rocky [puppet] - 10https://gerrit.wikimedia.org/r/585034 (https://phabricator.wikimedia.org/T248635) (owner: 10Andrew Bogott) [22:43:25] (03CR) 10Bstorm: [C: 03+2] "Cutting the release to test in toolsbeta" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/585045 (owner: 10Bstorm) [22:46:34] (03CR) 10Bstorm: "This will be replacing https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/master/tools/tools-package-builder.roles wit" [puppet] - 10https://gerrit.wikimedia.org/r/585047 (owner: 10Bstorm) [22:46:46] (03Merged) 10jenkins-bot: d/changelog: prepare for 0.65 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/585045 (owner: 10Bstorm) [22:56:56] (03PS2) 10Ppchelko: Remove outdated PCS endpoint references [deployment-charts] - 10https://gerrit.wikimedia.org/r/584660 [22:57:11] (03CR) 10Ppchelko: [C: 03+2] "needed a rebase I guess" [deployment-charts] - 10https://gerrit.wikimedia.org/r/584660 (owner: 10Ppchelko) [22:57:28] (03Merged) 10jenkins-bot: Remove outdated PCS endpoint references [deployment-charts] - 10https://gerrit.wikimedia.org/r/584660 (owner: 10Ppchelko) [23:00:05] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200331T2300). [23:00:05] DannyS712: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:55] I'll do the SWAT [23:01:21] (03PS11) 10Mstyles: kibana: move httpd proxy authentication to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) [23:03:08] hmm DannyS712 isn't here [23:03:19] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:05:09] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:05:27] (03CR) 10jerkins-bot: [V: 04-1] kibana: move httpd proxy authentication to a separate profile [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [23:05:53] (03CR) 10Mstyles: "> Patch Set 11: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/583414 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [23:06:28] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Bstorm) @Krenair puppet is broken in toolsbeta with ` Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed... [23:07:43] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Bstorm) Ah puppetdb service isn't working. That I might be able to fix. [23:09:53] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Bstorm) restarting the service got it running, checking some things. [23:16:58] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Bstorm) ` Mar 31 19:53:09 toolsbeta-puppetdb-02 kernel: [1624858.205612] oom_reaper: reaped process 17294 (java), now anon-rss:0kB, file-rss:0kB, shme... [23:19:09] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Krenair) This is probably the OOM problem that's been affecting deployment-prep. I think I made a task for that somewhere... [23:20:22] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Bstorm) This is an m1.small. Maybe the instance size is just too low for recent versions?