[00:00:07] (03CR) 10Catrope: [C: 03+2] Site name & meta namespace localisations for ti[wikipedia|wiktionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 (https://phabricator.wikimedia.org/T251287) (owner: 10RhinosF1) [00:01:01] (03Merged) 10jenkins-bot: Site name & meta namespace localisations for ti[wikipedia|wiktionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 (https://phabricator.wikimedia.org/T251287) (owner: 10RhinosF1) [00:01:21] RoanKattouw: shout when it's somewhere! [00:02:47] Syncing onw [00:03:47] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set sitename and meta namespace localizations for tiwiki and tiwiktionary (T251287) (duration: 01m 06s) [00:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:50] T251287: Localised sitenames/namespaces for ti.wikipedia and ti.wiktionary - https://phabricator.wikimedia.org/T251287 [00:04:21] RoanKattouw: once ns dupes is ran - {{done}} [00:04:48] @James_F i can slim them down to LESS, JS and mustache if that works for you - there's an rfc for that! https://phabricator.wikimedia.org/T249673 :) [00:05:01] * James_F grins. [00:05:08] Shockingly they both found 0 [00:05:23] !log Ran namespaceDupes.php on tiwiki and tiwiktionary for T251287 [00:05:25] RoanKattouw: ha! [00:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:42] Alright, well, that only took two hours :) [00:06:46] RoanKattouw: That wasn't too bad, We'll see what I say in the morning when I realise what 6 hours sleep does to me! [00:07:02] James_F, Jdlrobson: sorry for any delay I caused, we're done here. [00:07:04] * RhinosF1 to bed [00:16:40] RhinosF1: goodnight! [00:30:26] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:59] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for kubestage200[1-2], kubernetes200[7-14] [dns] - 10https://gerrit.wikimedia.org/r/597403 [00:42:13] Question regarding potential security bug in abusefilter (either a bug or me missing something really obvious, I'm fairly sure its the latter) - anyone familiar with the extension enough to help? [00:47:18] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:13] DannyS712: You say that like a lot of people know lots about AF ;) [00:59:44] yeah, I realized I was being dump [00:59:47] *dumb [01:00:26] qqx revealed that "Revoke the user's autoconfirmed status" isn't "degroup" but "blockautopromote" which is confusing (I thought the bug was that degroup was available) [01:01:13] but I found another bug that is confirmed [01:18:41] Report security bugs in Phabricator please [01:18:48] don't tell it here [01:19:44] Yeah, I know, I wanted someone to check if it was indeed a bug or not (it wasn't, like I suspected I missed something) [01:19:59] but also https://phabricator.wikimedia.org/T253181 [02:34:05] (03PS3) 10Herron: lvs::monitor: expand icinga service descriptions [puppet] - 10https://gerrit.wikimedia.org/r/597346 (https://phabricator.wikimedia.org/T211692) [02:52:27] (03PS1) 10Andrew Bogott: Remove most (but not all) nfs mounts from the wikidata-dev project. [puppet] - 10https://gerrit.wikimedia.org/r/597413 (https://phabricator.wikimedia.org/T208416) [03:40:59] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10Nuria) Is @dcipoletti a WMF employee? [04:09:16] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:09:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:00:30] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 51 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:06:22] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 45 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:06:26] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [05:18:08] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:31:42] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) We now have Jenkins pinned to java8 so... [05:44:32] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:51:55] (03CR) 10Hashar: "Nowadays 'jq' is provided everywhere through base::standard_packages so we can safely slice it out of this manifest." [puppet] - 10https://gerrit.wikimedia.org/r/596833 (https://phabricator.wikimedia.org/T252955) (owner: 10Krinkle) [06:21:01] 10Operations, 10LDAP-Access-Requests: LDAP access request - add Christian Aistleitner to "nda" (or "wmf") - https://phabricator.wikimedia.org/T252875 (10Dzahn) a:03Dzahn [06:24:51] 10Operations, 10LDAP-Access-Requests: LDAP access request - add Christian Aistleitner to "nda" (or "wmf") - https://phabricator.wikimedia.org/T252875 (10Dzahn) 05Open→03Resolved Thanks @KFrancis @QChris I added you to the nda group. You should now be able to login to Icinga. [06:29:28] (03CR) 10Dzahn: [C: 03+1] lvs::monitor: expand icinga service descriptions [puppet] - 10https://gerrit.wikimedia.org/r/597346 (https://phabricator.wikimedia.org/T211692) (owner: 10Herron) [06:30:22] (03PS8) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [06:31:30] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [06:33:46] (03CR) 10Giuseppe Lavagetto: monitoring: add data types to monitoring::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [07:03:20] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 64 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:04:57] (03CR) 10Muehlenhoff: "It's a fair point, we can probably in fact simply do it fleet-wide, this will always fail on HDFS and the performance impact is virtually " [puppet] - 10https://gerrit.wikimedia.org/r/597298 (owner: 10Muehlenhoff) [07:06:17] 10Operations, 10netops: intermittent brief data dropouts for esams netflow data - https://phabricator.wikimedia.org/T253128 (10Joe) Looking at kafka, it seems there is a bizarre pattern in producing the data to the "netflow" topic: https://grafana.wikimedia.org/d/000000234/kafka-by-topic?panelId=34&fullscree... [07:06:40] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 54 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:06:52] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 53 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:08:39] (03PS1) 10Marostegui: mariadb: Place db1142 into s4 [puppet] - 10https://gerrit.wikimedia.org/r/597468 (https://phabricator.wikimedia.org/T252512) [07:09:33] (03CR) 10Marostegui: [C: 03+2] mariadb: Place db1142 into s4 [puppet] - 10https://gerrit.wikimedia.org/r/597468 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [07:10:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1084 to clone db1142 T252512', diff saved to https://phabricator.wikimedia.org/P11241 and previous config saved to /var/cache/conftool/dbconfig/20200520-071010-marostegui.json [07:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:15] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [07:21:51] <_joe_> XioNoX: have you seen the ripe atlas issues to eqiad/codfw? [07:22:05] nop, looking [07:22:30] <_joe_> it looks like quite a few location in APAC can't reach codfw or eqiad [07:23:13] it's visible there too https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?panelId=4&fullscreen&orgId=1&from=now-3h&to=now [07:24:24] not much we can do though, it's a good FYI in case there is something worse happening [07:24:49] !log install systemd security updates [07:24:50] those alerts have always been on the line between useful and not useful... [07:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:53] 10Operations, 10ops-codfw: BBU faulty on ms-be2016 - https://phabricator.wikimedia.org/T252851 (10fgiunchedi) @Papaul I see, please let's procure a new BBU, we can't really decom the server yet [07:31:13] 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10Dzahn) >>! In T252932#6150527, @MBeat33 wrote: Hi @MBeat33, i will reply inline. > The issue is related to large fundraising email sends that originate from the jimmy@wikimedia.org Google Group. That t... [07:37:51] (03PS9) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [07:38:54] (03CR) 10Dzahn: monitoring: add data types to monitoring::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [07:38:56] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [07:41:03] !log alter table categorylinks engine=Innodb ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8,force on all labsdb1011 wikis - T249188 [07:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:07] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [07:43:34] (03Abandoned) 10Apakhomov: Resolved merge conflicts in several files [deployment-charts] - 10https://gerrit.wikimedia.org/r/595177 (owner: 10Apakhomov) [07:46:15] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) >>! In T224591#6150943, @hashar wrote:... [07:46:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Thanks a lot, I somehow never got around to fix that post-transition!" [puppet] - 10https://gerrit.wikimedia.org/r/597346 (https://phabricator.wikimedia.org/T211692) (owner: 10Herron) [07:57:41] 10Operations, 10Traffic, 10netops: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) p:05Triage→03Medium [07:58:50] (03PS3) 10JMeybohm: tls_helper: fix the envoy config configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/597303 (https://phabricator.wikimedia.org/T235411) [08:03:35] (03PS1) 10Ema: varnish: new stap script post_body.stp [puppet] - 10https://gerrit.wikimedia.org/r/597471 [08:05:22] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1010.eqiad.wmnet ` The log can be found in `/var/lo... [08:11:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I left a comment in the code, but after thinking a bit about it, I think we're just perpetrating a wrong approach:" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/597322 (owner: 10Cwhite) [08:12:21] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [08:13:57] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [08:14:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: Don't try to create a envoy config if it already exists [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597305 (https://phabricator.wikimedia.org/T244843) (owner: 10JMeybohm) [08:18:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] tls_helper: fix the envoy config configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/597303 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [08:18:51] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1011.eqiad.wmnet ` The log can be found in `/var/lo... [08:19:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [08:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:02] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1012.eqiad.wmnet ` The log can be found in `/var/lo... [08:21:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:24] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1010.eqiad.wmnet'] ` and were **ALL** successful. [08:28:30] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) @RobH Remote IPMI was disabled on these hosts which popped up when i tried to run the reimage cookbook (to change software RAID level from 1 to 5) and it... [08:30:17] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1013.eqiad.wmnet ` The log can be found in `/var/lo... [08:31:42] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [08:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:57] 10Operations, 10Traffic, 10netops: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [08:32:04] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [08:33:05] 10Operations, 10Traffic: Implement a prometheus exporter for rdkafka in golang - https://phabricator.wikimedia.org/T253197 (10ema) [08:33:18] <_joe_> !log disabling puppet on mw1266-1275 for migration to envoy [08:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [08:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:19] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:29] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1011.eqiad.wmnet'] ` Of which those **FAILED**: ` ['ganeti1011.eqiad.wmnet'] ` [08:35:25] 10Operations, 10Analytics, 10Traffic, 10Patch-For-Review: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10ema) 05Open→03Resolved a:03ema Closing this task now given that an initial version of `atskafka` has been created and deployed. Further improvements such as T2... [08:36:56] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:57] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1014.eqiad.wmnet ` The log can be found in `/var/lo... [08:39:03] (03PS1) 10Vgutierrez: ATS: Disable KA for POST requests on esams [puppet] - 10https://gerrit.wikimedia.org/r/597473 (https://phabricator.wikimedia.org/T249335) [08:41:32] (03PS2) 10JMeybohm: envoy: Don't try to create a envoy config if it already exists [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597305 (https://phabricator.wikimedia.org/T244843) [08:41:45] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1012.eqiad.wmnet'] ` and were **ALL** successful. [08:42:17] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy: Don't try to create a envoy config if it already exists [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597305 (https://phabricator.wikimedia.org/T244843) (owner: 10JMeybohm) [08:42:23] !log Remove bogons4 for policy options on all routers - gerrit 597272 [08:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:27] (03CR) 10Ayounsi: [C: 03+2] Remove bogons4 for policy options [homer/public] - 10https://gerrit.wikimedia.org/r/597272 (owner: 10Ayounsi) [08:42:48] (03Merged) 10jenkins-bot: Remove bogons4 for policy options [homer/public] - 10https://gerrit.wikimedia.org/r/597272 (owner: 10Ayounsi) [08:43:30] (03CR) 10Vgutierrez: "pcc seems happy: https://puppet-compiler.wmflabs.org/compiler1002/22612/" [puppet] - 10https://gerrit.wikimedia.org/r/597473 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [08:43:53] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1015.eqiad.wmnet ` The log can be found in `/var/lo... [08:44:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [08:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:18] (03CR) 10JMeybohm: [C: 03+2] tls_helper: fix the envoy config configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/597303 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [08:44:45] (03Merged) 10jenkins-bot: tls_helper: fix the envoy config configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/597303 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [08:45:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/22613/ looks good, merging." [puppet] - 10https://gerrit.wikimedia.org/r/597243 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [08:45:38] 10Operations: Integrate Buster 10.4 point update - https://phabricator.wikimedia.org/T252394 (10MoritzMuehlenhoff) [08:46:20] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1016.eqiad.wmnet ` The log can be found in `/var/lo... [08:46:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:48] (03CR) 10Dzahn: [C: 03+2] add malmok.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/597295 (https://phabricator.wikimedia.org/T253024) (owner: 10Dzahn) [08:48:51] (03PS2) 10Dzahn: add malmok.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/597295 (https://phabricator.wikimedia.org/T253024) [08:49:33] <_joe_> !log converting mw1266-1275 to use envoy T247389 [08:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:36] T247389: Use envoy for TLS termination on the appservers - https://phabricator.wikimedia.org/T247389 [08:51:23] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1013.eqiad.wmnet'] ` and were **ALL** successful. [08:51:41] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1017.eqiad.wmnet ` The log can be found in `/var/lo... [08:52:38] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [08:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [08:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:22] 10Operations, 10Traffic, 10vm-requests, 10Patch-For-Review: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10Dzahn) ` Ready to create Ganeti VM malmok.codfw.wmnet in the ganeti01.svc.codfw.wmnet cluster on row A with 2 vCPUs, 8GB of RAM, 30GB of disk in the private network. ` [08:55:16] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [08:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1142 with minimum weight for the first time T252512', diff saved to https://phabricator.wikimedia.org/P11245 and previous config saved to /var/cache/conftool/dbconfig/20200520-085757-marostegui.json [08:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:02] T252512: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 [08:58:10] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 61 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:59:05] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1018.eqiad.wmnet ` The log can be found in `/var/lo... [08:59:08] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1014.eqiad.wmnet'] ` and were **ALL** successful. [09:00:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [09:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:28] (03PS1) 10Marostegui: db1142: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/597475 (https://phabricator.wikimedia.org/T252512) [09:01:18] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [09:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:35] (03CR) 10Marostegui: [C: 03+2] db1142: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/597475 (https://phabricator.wikimedia.org/T252512) (owner: 10Marostegui) [09:02:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:52] (03PS3) 10Muehlenhoff: Add a define for creating a system user using systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/597265 [09:03:56] (03CR) 10jerkins-bot: [V: 04-1] Add a define for creating a system user using systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/597265 (owner: 10Muehlenhoff) [09:04:26] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) @RobH @cmjohnson I noticed by chance there are more ganeti machines beyond ganeti1018. ganeti1019-ganeti1022 are in netbox but i don't see a racking tick... [09:04:55] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 78 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:05:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [09:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:05] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:06:17] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1015.eqiad.wmnet'] ` and were **ALL** successful. [09:06:32] (03PS4) 10Muehlenhoff: Add a define for creating a system user using systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/597265 [09:06:45] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1016.eqiad.wmnet'] ` and were **ALL** successful. [09:07:05] (03PS1) 10Jcrespo: mariadb: Temporarilly add db1141 to the list of special hosts [puppet] - 10https://gerrit.wikimedia.org/r/597476 (https://phabricator.wikimedia.org/T249188) [09:07:12] (03CR) 10jerkins-bot: [V: 04-1] Add a define for creating a system user using systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/597265 (owner: 10Muehlenhoff) [09:08:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:08:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [09:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:37] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 85, down: 6, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:08:38] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Dzahn) Enabled remote IPMI on these machines which was disabled but is needed. ([[ https://wikitech.wikimedia.org/wiki/Management_Interfaces#Is_remote_IPMI_enabled? | wikitech how to ]] [09:08:39] (03CR) 10Kormat: [C: 03+1] mariadb: Temporarilly add db1141 to the list of special hosts [puppet] - 10https://gerrit.wikimedia.org/r/597476 (https://phabricator.wikimedia.org/T249188) (owner: 10Jcrespo) [09:09:27] (03PS2) 10Jcrespo: mariadb: Temporarilly add db1141 to the list of special hosts [puppet] - 10https://gerrit.wikimedia.org/r/597476 (https://phabricator.wikimedia.org/T249188) [09:09:48] good boi [09:10:07] (03CR) 10Jcrespo: [C: 03+2] mariadb: Temporarilly add db1141 to the list of special hosts [puppet] - 10https://gerrit.wikimedia.org/r/597476 (https://phabricator.wikimedia.org/T249188) (owner: 10Jcrespo) [09:10:16] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1019.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202005200910_dzahn_257748_ganeti10... [09:10:17] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:10:45] it seems wack a mole this morning :D [09:11:25] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1020.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202005200911_dzahn_257903_ganeti10... [09:11:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for new host db1142 and start to repool db1084', diff saved to https://phabricator.wikimedia.org/P11246 and previous config saved to /var/cache/conftool/dbconfig/20200520-091153-marostegui.json [09:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:08] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1017.eqiad.wmnet'] ` and were **ALL** successful. [09:12:45] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1021.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202005200912_dzahn_258005_ganeti10... [09:12:50] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [09:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:34] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 200, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:15:00] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1022.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202005200914_dzahn_258218_ganeti10... [09:15:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:38] (03PS1) 10Dzahn: DHCP: add malmok.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/597477 (https://phabricator.wikimedia.org/T253024) [09:15:58] (03CR) 10Dzahn: [C: 03+2] DHCP: add malmok.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/597477 (https://phabricator.wikimedia.org/T253024) (owner: 10Dzahn) [09:20:34] (03PS1) 10Dzahn: install_server: add malmok partman recipe line [puppet] - 10https://gerrit.wikimedia.org/r/597479 (https://phabricator.wikimedia.org/T253024) [09:21:17] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1018.eqiad.wmnet'] ` and were **ALL** successful. [09:22:16] (03PS5) 10Muehlenhoff: Add a define for creating a system user using systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/597265 [09:23:31] (03PS2) 10Dzahn: install_server: add malmok partman recipe line [puppet] - 10https://gerrit.wikimedia.org/r/597479 (https://phabricator.wikimedia.org/T253024) [09:24:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [09:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [09:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:58] (03CR) 10Dzahn: [C: 03+2] install_server: add malmok partman recipe line [puppet] - 10https://gerrit.wikimedia.org/r/597479 (https://phabricator.wikimedia.org/T253024) (owner: 10Dzahn) [09:26:16] !log Upgrade db1083 (s1 master) to 10.1.43-2 without restarting T251982 [09:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:20] T251982: Upgrade and restart s1 (enwiki) primary database master: Thu 21th May - https://phabricator.wikimedia.org/T251982 [09:26:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [09:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:49] (03CR) 10Addshore: [C: 03+1] "+1 to go ahead and do this." [puppet] - 10https://gerrit.wikimedia.org/r/597413 (https://phabricator.wikimedia.org/T208416) (owner: 10Andrew Bogott) [09:26:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:01] (03PS3) 10Arturo Borrero Gonzalez: apt-upgrade: give support to understand dist/component [puppet] - 10https://gerrit.wikimedia.org/r/596483 (https://phabricator.wikimedia.org/T250867) [09:28:07] !log create ARIN inetnum 198.35.27.0/24 and route 198.35.26.0/24 + 198.35.27.0/24 - T253196 [09:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:11] T253196: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 [09:28:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [09:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:29:19] elukey: hahah yeah, as long as they're not all at the same time :) [09:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:38] (03PS1) 10Filippo Giunchedi: profile: open port 80 for thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/597481 (https://phabricator.wikimedia.org/T233956) [09:29:53] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Use envoy for TLS termination on the appservers - https://phabricator.wikimedia.org/T247389 (10Joe) Status update: we've deployed envoy on all mediawiki servers with the exception of: - jobrunners (where we still... [09:30:21] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:42] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1019.eqiad.wmnet'] ` and were **ALL** successful. [09:31:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for db1142 and db1084 on s4', diff saved to https://phabricator.wikimedia.org/P11247 and previous config saved to /var/cache/conftool/dbconfig/20200520-093141-marostegui.json [09:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:16] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) @akosiaris All of these hosts have RAID5 now: ` ===== NODE GROUP ===== (10) ganeti[1009-... [09:32:19] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: open port 80 for thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/597481 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:33:07] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Dzahn) a:05Dzahn→03akosiaris Handing back over for the next "init" command steps you have mentioned are needed next. [09:33:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] apt-upgrade: give support to understand dist/component [puppet] - 10https://gerrit.wikimedia.org/r/596483 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [09:33:56] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:28] (03PS2) 10Muehlenhoff: Exclude /mnt/hfds from all debdeploy restart checks [puppet] - 10https://gerrit.wikimedia.org/r/597298 [09:34:29] arturo: merging your change too [09:34:37] godog: ACK, thanks [09:34:44] np [09:35:15] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1021.eqiad.wmnet'] ` and were **ALL** successful. [09:36:05] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1020.eqiad.wmnet'] ` and were **ALL** successful. [09:36:45] !log create ROAs for 198.35.26.0/24 and 198.35.27.0/24 - T253196 [09:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:49] T253196: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 [09:37:38] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1022.eqiad.wmnet'] ` and were **ALL** successful. [09:38:01] (03PS1) 10Filippo Giunchedi: profile: fix thanos-query httpd proxypass [puppet] - 10https://gerrit.wikimedia.org/r/597482 (https://phabricator.wikimedia.org/T233956) [09:39:00] (03CR) 10Ema: [C: 03+1] ATS: Disable KA for POST requests on esams [puppet] - 10https://gerrit.wikimedia.org/r/597473 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [09:43:07] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable KA for POST requests on esams [puppet] - 10https://gerrit.wikimedia.org/r/597473 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [09:43:55] !log disable KA for POST/PUT requests on esams - T249335 [09:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:59] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [09:44:28] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: fix thanos-query httpd proxypass [puppet] - 10https://gerrit.wikimedia.org/r/597482 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:45:08] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:46:02] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10dr0ptp4kt) Yes - thanks for checking. [09:46:35] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: drop package_from_component define [puppet] - 10https://gerrit.wikimedia.org/r/597484 [09:49:24] 10Operations: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Dzahn) These 4 hosts have been reimaged and now have RAID5 instead of RAID1 after [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/597261 | gerrit:597261 ]] [09:49:28] (03PS1) 10Filippo Giunchedi: hieradata: move thanos-query service to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/597485 (https://phabricator.wikimedia.org/T233956) [09:50:43] (03CR) 10Filippo Giunchedi: "Service is up already:" [puppet] - 10https://gerrit.wikimedia.org/r/597485 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:51:13] (03CR) 10Gehel: "This looks much better than what we have now! A few more comments inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [09:51:41] (03CR) 10Muehlenhoff: [C: 04-1] profile::url_downloader: Add types and switch to lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562472 (owner: 10Muehlenhoff) [09:51:59] (03PS8) 10Muehlenhoff: profile::url_downloader: Add types and switch to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/562472 [09:52:43] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:53:33] (03PS9) 10Muehlenhoff: profile::url_downloader: Add types and switch to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/562472 [09:55:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:57:53] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 199, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:58:05] (03CR) 10Jbond: "lgtm but the type definition is in the incorrect location" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597265 (owner: 10Muehlenhoff) [09:58:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/597298 (owner: 10Muehlenhoff) [10:05:30] (03CR) 10Muehlenhoff: [C: 03+2] Exclude /mnt/hfds from all debdeploy restart checks [puppet] - 10https://gerrit.wikimedia.org/r/597298 (owner: 10Muehlenhoff) [10:07:26] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool es1018, es1015 at 50% weight', diff saved to https://phabricator.wikimedia.org/P11249 and previous config saved to /var/cache/conftool/dbconfig/20200520-100726-jynus.json [10:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:05] (03CR) 10Filippo Giunchedi: [C: 03+1] lvs::monitor: expand icinga service descriptions [puppet] - 10https://gerrit.wikimedia.org/r/597346 (https://phabricator.wikimedia.org/T211692) (owner: 10Herron) [10:15:41] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:19:16] (03CR) 10Elukey: "Thanks for the review!" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [10:19:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase weight for db1142 and db1084 on s4', diff saved to https://phabricator.wikimedia.org/P11250 and previous config saved to /var/cache/conftool/dbconfig/20200520-101928-marostegui.json [10:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:11] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 198, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:21:48] (03CR) 10Filippo Giunchedi: "I'm not familiar enough with docker-pkg to give a meaningful vote, but LGTM to my untrained eye!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597299 (owner: 10Cwhite) [10:23:11] (03CR) 10Muehlenhoff: profile::java: one profile to rule them all (openjdk-x versions) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [10:24:49] (03CR) 10Elukey: [C: 04-1] "setting this WIP again, going to try to refactor the code to include all the suggestions :)" [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [10:25:21] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 49 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:25:28] (03CR) 10Filippo Giunchedi: "Ditto as Id7cfe3790b0, not familiar enough with docker-pkg to give a meaningful vote, although a question: is the configuration shipped he" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [10:28:00] !log rolling restart of ats-tls in text@esams - T249335 [10:28:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: drop package_from_component define [puppet] - 10https://gerrit.wikimedia.org/r/597484 (owner: 10Arturo Borrero Gonzalez) [10:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:04] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [10:28:25] 10Operations, 10Traffic, 10netops: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [10:29:51] (03PS1) 10Ayounsi: ulsfo: shrink 198.35.26.0/23 to 198.35.26.0/24 [homer/public] - 10https://gerrit.wikimedia.org/r/597486 (https://phabricator.wikimedia.org/T253196) [10:30:12] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10JjELT) I need to move a child project - not a top... [10:36:35] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:56] (03CR) 10Muehlenhoff: Add a define for creating a system user using systemd-sysusers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597265 (owner: 10Muehlenhoff) [10:37:04] (03PS6) 10Muehlenhoff: Add a define for creating a system user using systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/597265 [10:40:00] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Aklapper) @JjELT: This task is about documenting t... [10:46:18] !log installing 4.19.118 Linux packages on Buster hosts [10:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:14] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Feature Request: Extend move_project.php to allow moving child projects - https://phabricator.wikimedia.org/T253214 (10JjELT) [10:56:47] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:56:53] 10Operations, 10Phabricator, 10Security-Team, 10Security: Adjust onboarding/offboarding logic to accommodate changes to #security (now acl*security) - https://phabricator.wikimedia.org/T245771 (10Aklapper) >>! In T245771#6022035, @jbond wrote: > The script has been updated please feel free to test it furth... [10:58:37] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 87, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200520T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:20] looks like nothing to SWAT indeed [11:03:55] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:04:16] !log roll out update or exim4 [11:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:47] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 198, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:06:04] (03CR) 10Muehlenhoff: docker build: update the build process to us docker (033 comments) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) (owner: 10Jbond) [11:06:33] (03CR) 10Ema: [C: 03+1] hieradata: move thanos-query service to port 80 [puppet] - 10https://gerrit.wikimedia.org/r/597485 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [11:07:33] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool es1018, es1015 fully', diff saved to https://phabricator.wikimedia.org/P11252 and previous config saved to /var/cache/conftool/dbconfig/20200520-110732-jynus.json [11:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool new host db1142 and db1084', diff saved to https://phabricator.wikimedia.org/P11253 and previous config saved to /var/cache/conftool/dbconfig/20200520-111013-marostegui.json [11:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:47] 10Operations, 10observability, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348 (10Dzahn) >>! In T165348#6144076, @jcrespo wrote: > I would like to separate better "things that can wait" from "outages or potential outages" on different dashboards. I ag... [11:11:56] (03PS1) 10Ayounsi: Accept 198.35.27.0/24 from anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/597507 (https://phabricator.wikimedia.org/T253196) [11:16:14] (03CR) 10Ayounsi: [C: 03+1] "LGTM! PCC too https://puppet-compiler.wmflabs.org/compiler1001/22617/" [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [11:18:05] (03PS1) 10Dzahn: site: add malmok.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/597512 (https://phabricator.wikimedia.org/T253024) [11:18:27] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [11:24:34] (03CR) 10Dzahn: [C: 03+2] site: add malmok.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/597512 (https://phabricator.wikimedia.org/T253024) (owner: 10Dzahn) [11:25:57] (03PS1) 10Giuseppe Lavagetto: jobrunner: add code to switch to envoy, switch codfw [puppet] - 10https://gerrit.wikimedia.org/r/597513 (https://phabricator.wikimedia.org/T247389) [11:26:07] (03PS1) 10Arturo Borrero Gonzalez: templates: add reverse zone for 185.12.57.0/24 including cloud delegation [dns] - 10https://gerrit.wikimedia.org/r/597514 (https://phabricator.wikimedia.org/T247972) [11:26:26] 10Operations, 10observability, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348 (10jcrespo) I don't want to write more here because it is out of topic- I agree with everything you say, but let me go in a different direction: The main issue is that curre... [11:26:35] (03CR) 10jerkins-bot: [V: 04-1] templates: add reverse zone for 185.12.57.0/24 including cloud delegation [dns] - 10https://gerrit.wikimedia.org/r/597514 (https://phabricator.wikimedia.org/T247972) (owner: 10Arturo Borrero Gonzalez) [11:28:41] (03PS2) 10Arturo Borrero Gonzalez: templates: add reverse zone for 185.12.57.0/24 including cloud delegation [dns] - 10https://gerrit.wikimedia.org/r/597514 (https://phabricator.wikimedia.org/T247972) [11:29:04] (03CR) 10jerkins-bot: [V: 04-1] templates: add reverse zone for 185.12.57.0/24 including cloud delegation [dns] - 10https://gerrit.wikimedia.org/r/597514 (https://phabricator.wikimedia.org/T247972) (owner: 10Arturo Borrero Gonzalez) [11:32:38] (03CR) 10Hnowlan: [C: 03+2] mediawiki:jobrunner_tls: Remove runjobs monitoring [puppet] - 10https://gerrit.wikimedia.org/r/592631 (https://phabricator.wikimedia.org/T243096) (owner: 10Hnowlan) [11:33:41] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:35:35] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 197, down: 5, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:37:33] !log rebooting ganeti1009 and ganeti1011 to hopefully clear icinga alerts about microcode mitigations [11:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:05] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Aklapper) [11:51:11] (03PS3) 10Arturo Borrero Gonzalez: templates: add reverse zone for 185.12.57.0/24 including cloud delegation [dns] - 10https://gerrit.wikimedia.org/r/597514 (https://phabricator.wikimedia.org/T247972) [11:53:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/597265 (owner: 10Muehlenhoff) [11:56:10] 10Operations, 10Phabricator, 10Security-Team, 10Security: Adjust onboarding/offboarding logic to accommodate changes to #security (now acl*security) - https://phabricator.wikimedia.org/T245771 (10MoritzMuehlenhoff) I think we can close the task, it'll be implicitly tested the next time we offboard someone... [11:58:11] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/597514 (https://phabricator.wikimedia.org/T247972) (owner: 10Arturo Borrero Gonzalez) [12:00:05] Amir1 and Urbanecm: May I have your attention please! Creating two wikis. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200520T1200) [12:00:21] o/ [12:01:43] (03PS11) 10Ladsgroup: Initial config for awawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) (owner: 10Zoranzoki21) [12:02:34] (03CR) 10Ladsgroup: [C: 03+2] "I'll add wikiversions later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) (owner: 10Zoranzoki21) [12:03:25] (03Merged) 10jenkins-bot: Initial config for awawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/593743 (https://phabricator.wikimedia.org/T251371) (owner: 10Zoranzoki21) [12:04:12] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10soworu) >>! In T252703#6142940, @Dzahn wrote: > @soworu Use the same user/password you used on https://wikitech.wikimed... [12:06:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Dzahn) >>! In T252703#6152022, @soworu wrote: >>>! In T252703#6142940, @Dzahn wrote: >> @soworu Use the same user/passw... [12:07:05] !log ladsgroup@deploy1001 Synchronized dblists: (no justification provided) (duration: 01m 08s) [12:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:18] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Dzahn) Please try the following variations of the username including the capitalization: soworu-01 and SOworu [12:08:21] (03PS1) 10Ladsgroup: Add awawiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597520 (https://phabricator.wikimedia.org/T251371) [12:08:24] PROBLEM - Disk space on mx2001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mx2001&var-datasource=codfw+prometheus/ops [12:08:55] (03CR) 10Ladsgroup: [C: 03+2] Add awawiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597520 (https://phabricator.wikimedia.org/T251371) (owner: 10Ladsgroup) [12:09:40] (03Merged) 10jenkins-bot: Add awawiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597520 (https://phabricator.wikimedia.org/T251371) (owner: 10Ladsgroup) [12:10:08] im looking at the exim issue [12:10:47] ACKNOWLEDGEMENT - Disk space on mx2001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied John Bond investigating https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mx2001&var-datasource=codfw+prometheus/ops [12:11:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv6: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:12:20] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: Create Awadhi Wikipedia (awawiki) - T251371 [12:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:23] T251371: Create Awadhi Wikipedia - https://phabricator.wikimedia.org/T251371 [12:12:48] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 198, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:12:50] RECOVERY - Disk space on mx2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mx2001&var-datasource=codfw+prometheus/ops [12:13:12] (03PS1) 10Dzahn: convert malmok from private to public IPs [dns] - 10https://gerrit.wikimedia.org/r/597523 (https://phabricator.wikimedia.org/T253024) [12:13:38] (03CR) 10jerkins-bot: [V: 04-1] convert malmok from private to public IPs [dns] - 10https://gerrit.wikimedia.org/r/597523 (https://phabricator.wikimedia.org/T253024) (owner: 10Dzahn) [12:13:46] (03PS2) 10Dzahn: convert malmok from private to public IPs [dns] - 10https://gerrit.wikimedia.org/r/597523 (https://phabricator.wikimedia.org/T253024) [12:14:07] urgh, I didn't need to sync that file [12:14:10] (03CR) 10jerkins-bot: [V: 04-1] convert malmok from private to public IPs [dns] - 10https://gerrit.wikimedia.org/r/597523 (https://phabricator.wikimedia.org/T253024) (owner: 10Dzahn) [12:14:16] (03CR) 10Jbond: [C: 03+1] ulsfo: shrink 198.35.26.0/23 to 198.35.26.0/24 [homer/public] - 10https://gerrit.wikimedia.org/r/597486 (https://phabricator.wikimedia.org/T253196) (owner: 10Ayounsi) [12:14:33] btw. awa.wikipedia.org works fine in mwdebug1001 [12:14:45] !log ladsgroup@deploy1001 Synchronized multiversion/MWMultiVersion.php: Create Awadhi Wikipedia (awawiki) - T251371 (duration: 01m 06s) [12:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:53] (03PS1) 10Kormat: mariadb: Add db2137 to s4+s5 [puppet] - 10https://gerrit.wikimedia.org/r/597524 (https://phabricator.wikimedia.org/T252987) [12:16:38] !log ladsgroup@deploy1001 Synchronized static/images/project-logos: Create Awadhi Wikipedia (awawiki) - T251371 (duration: 01m 06s) [12:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:41] marostegui: ^ [12:17:16] (03CR) 10Marostegui: "which other host will this replace?" [puppet] - 10https://gerrit.wikimedia.org/r/597524 (https://phabricator.wikimedia.org/T252987) (owner: 10Kormat) [12:18:09] (03PS2) 10Kormat: mariadb: Add db2137 to s4+s5 [puppet] - 10https://gerrit.wikimedia.org/r/597524 (https://phabricator.wikimedia.org/T252987) [12:18:21] !log ladsgroup@deploy1001 Synchronized langlist: Create Awadhi Wikipedia (awawiki) - T251371 (duration: 01m 06s) [12:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:24] T251371: Create Awadhi Wikipedia - https://phabricator.wikimedia.org/T251371 [12:19:24] (03CR) 10Kormat: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/597524 (https://phabricator.wikimedia.org/T252987) (owner: 10Kormat) [12:19:40] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597530 [12:19:41] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597530 (owner: 10Ladsgroup) [12:19:44] (03CR) 10Marostegui: "Will you patch instances.yaml separately or was it forgotten on this patch?" [puppet] - 10https://gerrit.wikimedia.org/r/597524 (https://phabricator.wikimedia.org/T252987) (owner: 10Kormat) [12:20:16] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597529 (owner: 10Ladsgroup) [12:20:22] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597530 (owner: 10Ladsgroup) [12:20:39] (03CR) 10Dzahn: [C: 03+2] convert malmok from private to public IP [dns] - 10https://gerrit.wikimedia.org/r/597528 (https://phabricator.wikimedia.org/T253024) (owner: 10Dzahn) [12:21:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [12:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:13] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 03m 01s) [12:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:20] (03PS1) 10Muehlenhoff: Change email address for Jason [puppet] - 10https://gerrit.wikimedia.org/r/597532 [12:22:20] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [12:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:26] Okay, one done, one to go [12:22:27] 10Operations, 10Traffic, 10vm-requests, 10Patch-For-Review: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `malmok.codfw.wmnet` - malmok.codfw.wmnet (**FAIL**) - Failed downtime h... [12:22:41] (03PS3) 10Kormat: mariadb: Add db2137 to s4+s5 [puppet] - 10https://gerrit.wikimedia.org/r/597524 (https://phabricator.wikimedia.org/T252987) [12:22:43] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.53:80]) https://wikitech.wikimedia.org/wiki/PyBal [12:23:02] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10soworu) >>! In T252703#6152029, @Dzahn wrote: > Please try the following variations of the username including the capit... [12:23:04] (03CR) 10Kormat: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/597524 (https://phabricator.wikimedia.org/T252987) (owner: 10Kormat) [12:24:48] I hate wikiversions.json with every cell of my body, fixing merge conflicts [12:26:17] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.53:80]) https://wikitech.wikimedia.org/wiki/PyBal [12:26:28] (03PS3) 10Ladsgroup: Initial configuration for gomwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591084 (https://phabricator.wikimedia.org/T249506) (owner: 10Urbanecm) [12:26:37] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [12:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:22] (03CR) 10Marostegui: [C: 03+1] mariadb: Add db2137 to s4+s5 [puppet] - 10https://gerrit.wikimedia.org/r/597524 (https://phabricator.wikimedia.org/T252987) (owner: 10Kormat) [12:27:33] (03CR) 10Ladsgroup: [C: 03+2] Initial configuration for gomwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591084 (https://phabricator.wikimedia.org/T249506) (owner: 10Urbanecm) [12:28:04] !log roll-restart pybal on codfw low-traffic - T233956 [12:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:07] T233956: Deploy Thanos (long-term storage) stateless components: sidecar and query - https://phabricator.wikimedia.org/T233956 [12:28:24] (03CR) 10Kormat: [C: 03+2] mariadb: Add db2137 to s4+s5 [puppet] - 10https://gerrit.wikimedia.org/r/597524 (https://phabricator.wikimedia.org/T252987) (owner: 10Kormat) [12:28:26] (03Merged) 10jenkins-bot: Initial configuration for gomwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591084 (https://phabricator.wikimedia.org/T249506) (owner: 10Urbanecm) [12:28:57] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Dzahn) @soworu Try again now, please. Looks like i forgot to add the -01 part when adding you the right group. Sorry ab... [12:29:08] (03PS4) 10Kormat: mariadb: Add db2137 to s4+s5 [puppet] - 10https://gerrit.wikimedia.org/r/597524 (https://phabricator.wikimedia.org/T252987) [12:32:44] (03PS1) 10Marostegui: dashboard: Change tendril_purge_global_status_log_5m [software/tendril] - 10https://gerrit.wikimedia.org/r/597535 (https://phabricator.wikimedia.org/T252331) [12:33:14] !log ladsgroup@deploy1001 Synchronized dblists: Creating Wiktionary Konkani (gomwiktionary) - T249506 (duration: 01m 06s) [12:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:17] T249506: Create Wiktionary Konkani - https://phabricator.wikimedia.org/T249506 [12:33:33] (03PS1) 10Dzahn: admins: fix uid of Segun Oworu [puppet] - 10https://gerrit.wikimedia.org/r/597536 (https://phabricator.wikimedia.org/T252703) [12:34:56] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Dzahn) @soworu Looks like it works now, i think i just saw you login in the log files, am i right? [12:34:59] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10soworu) >>! In T252703#6152084, @Dzahn wrote: > @soworu Try again now, please. Looks like i forgo... [12:35:24] !log ladsgroup@deploy1001 rebuilt and synchronized wikiversions files: Creating Wiktionary Konkani (gomwiktionary) - T249506 [12:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:51] (03CR) 10Muehlenhoff: [C: 03+2] Change email address for Jason [puppet] - 10https://gerrit.wikimedia.org/r/597532 (owner: 10Muehlenhoff) [12:38:56] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating Wiktionary Konkani (gomwiktionary) - T249506 (duration: 01m 05s) [12:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:59] T249506: Create Wiktionary Konkani - https://phabricator.wikimedia.org/T249506 [12:39:01] (03PS2) 10Dzahn: admins: fix uid of Segun Oworu [puppet] - 10https://gerrit.wikimedia.org/r/597536 (https://phabricator.wikimedia.org/T252703) [12:40:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [12:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:14] !log ladsgroup@deploy1001 Synchronized static/images/project-logos/: Creating Wiktionary Konkani (gomwiktionary) - T249506 (duration: 01m 06s) [12:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:42:01] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597538 [12:42:03] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597538 (owner: 10Ladsgroup) [12:42:50] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597538 (owner: 10Ladsgroup) [12:43:02] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 199, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:43:40] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10soworu) >>! In T252703#6152094, @Dzahn wrote: > @soworu Looks like it works now, i think i just s... [12:44:30] "Error: 1176 Key 'usertext_timestamp' doesn't exist in table 'revision' (10.64.16.7)" [12:44:41] Amir1: where's that? [12:44:50] db1112 [12:44:51] checking [12:45:01] Amir1: which wiki? [12:45:04] !log milimetric@deploy1001 Started deploy [analytics/refinery@a891999]: Regular analytics weekly train [analytics/refinery@a891999] [12:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:14] marostegui: gomwiktionary [12:45:19] newly created wiki [12:45:22] that sounds like a new wiki [12:45:23] yeah [12:45:24] that [12:45:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Dzahn) @soworu I found the following comment in the Apache config of superset: ` 31 #... [12:45:58] marostegui: I'm creating it :D [12:46:12] ah cool [12:46:12] Weirdly it only fatals on my private IP [12:46:17] (03PS4) 10Filippo Giunchedi: thanos: add Envoy TLS terminator [puppet] - 10https://gerrit.wikimedia.org/r/597018 (https://phabricator.wikimedia.org/T252186) [12:46:19] (03PS4) 10Filippo Giunchedi: thanos: add Store Gateway [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) [12:46:21] (03PS3) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [12:46:23] (03PS4) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [12:46:23] which table is that? [12:46:55] marostegui: it says revision table [12:47:34] (03CR) 10jerkins-bot: [V: 04-1] thanos: add Store Gateway [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:47:34] Amir1: that looks like it is coming from this patch: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/537676/ [12:47:54] but that is for archive [12:47:56] from what I can see [12:47:56] (03CR) 10jerkins-bot: [V: 04-1] thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:48:05] (03PS7) 10ZPapierski: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [12:48:14] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10soworu) >>! In T252703#6152109, @Dzahn wrote: > @soworu I found the following comment in the Apac... [12:48:15] but it says a column is missing [12:48:23] Amir1: revision table on enwiki does have KEY `usertext_timestamp` (`rev_user_text`,`rev_timestamp`), [12:48:26] (03CR) 10CDanis: [C: 03+2] dbctl: make 'yes' equivalent to 'y' in confirmations [software/conftool] - 10https://gerrit.wikimedia.org/r/597318 (owner: 10CDanis) [12:48:33] it's index, sorry [12:48:42] sorry for the upcoming gerrit spam folks [12:48:42] yes, it is an index [12:48:46] (03PS5) 10Filippo Giunchedi: thanos: add Envoy TLS terminator [puppet] - 10https://gerrit.wikimedia.org/r/597018 (https://phabricator.wikimedia.org/T252186) [12:48:48] (03PS5) 10Filippo Giunchedi: thanos: add Store Gateway [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) [12:48:50] (03PS4) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [12:48:52] (03PS5) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [12:49:13] (03CR) 10jerkins-bot: [V: 04-1] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [12:49:46] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Dzahn) >>! In T252703#6152113, @soworu wrote: > SOworu worked. Now in Turnilo. Thanks. Great! E... [12:49:57] Amir1: found it: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/552339/19/maintenance/archives/patch-revision-actor-comment-MCR.sql [12:50:00] (03CR) 10jerkins-bot: [V: 04-1] thanos: add Store Gateway [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:50:08] (03CR) 10Dzahn: [C: 03+2] admins: fix uid of Segun Oworu [puppet] - 10https://gerrit.wikimedia.org/r/597536 (https://phabricator.wikimedia.org/T252703) (owner: 10Dzahn) [12:50:09] That is the new patch for revision that daniel merged a few days ago [12:50:11] so probably related [12:50:23] (03PS1) 10Dzahn: DHCP: update MAC and FQDN for malmok [puppet] - 10https://gerrit.wikimedia.org/r/597540 (https://phabricator.wikimedia.org/T252703) [12:50:27] https://gom.wiktionary.org/wiki/%E0%A4%B5%E0%A4%BF%E0%A4%B6%E0%A5%87%E0%A4%B6:Contributions/8.8.8.8 for any ip that doesn't have a contribution [12:50:30] (03CR) 10jerkins-bot: [V: 04-1] thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:50:36] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/22619/thanos-fe2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/597018 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:51:02] (03Merged) 10jenkins-bot: dbctl: make 'yes' equivalent to 'y' in confirmations [software/conftool] - 10https://gerrit.wikimedia.org/r/597318 (owner: 10CDanis) [12:51:04] marostegui: cool, how are we going to fix this :( [12:52:09] (03CR) 10Dzahn: [C: 03+2] DHCP: update MAC and FQDN for malmok [puppet] - 10https://gerrit.wikimedia.org/r/597540 (https://phabricator.wikimedia.org/T252703) (owner: 10Dzahn) [12:52:26] (03PS2) 10Dzahn: DHCP: update MAC and FQDN for malmok [puppet] - 10https://gerrit.wikimedia.org/r/597540 (https://phabricator.wikimedia.org/T252703) [12:52:29] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 10m 49s) [12:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:50] log - rebase race started - godog currently ahead of mutante in round one (jk) [12:53:30] Amir1: Why is the index breaking everything? [12:53:38] (03PS8) 10ZPapierski: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [12:53:47] Amir1: I mean, I can create it now, it is an empty table, but this needs a proper fix [12:53:54] as the new wiki creations will fail [12:53:58] marostegui: haha! I should be done soon [12:53:58] (03PS6) 10Filippo Giunchedi: thanos: add Store Gateway [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) [12:54:00] (03PS5) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [12:54:03] (03PS6) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [12:54:15] it has force index ugh [12:54:15] godog: i don't think i have gotten this one before: problems merging LABS [12:54:50] mutante: oh yeah I forgot, puppet-merge'ing now [12:54:51] (03CR) 10jerkins-bot: [V: 04-1] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [12:55:01] mutante: {{done}} [12:55:07] Amir1: That is a good catch, we need to remove that force index before we can proceed with the schema change [12:55:14] godog: thanks, just most of the time it works to say "no" and then it goes to the next one [12:55:20] yeah, trying to find it [12:56:08] (03PS7) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [12:56:25] Amir1: can you comment on: https://phabricator.wikimedia.org/T238966 [12:57:01] Amir1: I am going to a meeting now, ping me if it is really breaking anything else apart from this new wiki [12:57:16] (03CR) 10jerkins-bot: [V: 04-1] profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [12:57:18] marostegui: thanks. Don't worry [12:57:43] (03PS2) 10Giuseppe Lavagetto: jobrunner: add code to switch to envoy, switch codfw [puppet] - 10https://gerrit.wikimedia.org/r/597513 (https://phabricator.wikimedia.org/T247389) [12:59:15] what has the force index, for those of us who want to follow along? [12:59:29] Special:Contribs [12:59:48] but it can be more [12:59:50] ohsigh [12:59:54] right [13:00:03] this means we have to hunt around [13:00:19] !log creating two wikis are done [13:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:12] (03CR) 10Elukey: [C: 03+1] Enable base::service_auto_restart for Apache on Superset [puppet] - 10https://gerrit.wikimedia.org/r/597296 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:02:47] (03PS1) 10Dzahn: site: change FQDN of malmok to public IP [puppet] - 10https://gerrit.wikimedia.org/r/597542 (https://phabricator.wikimedia.org/T253024) [13:03:28] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache on Superset [puppet] - 10https://gerrit.wikimedia.org/r/597296 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:05:22] (03CR) 10Dzahn: [C: 03+2] site: change FQDN of malmok to public IP [puppet] - 10https://gerrit.wikimedia.org/r/597542 (https://phabricator.wikimedia.org/T253024) (owner: 10Dzahn) [13:05:30] (03PS2) 10Dzahn: site: change FQDN of malmok to public IP [puppet] - 10https://gerrit.wikimedia.org/r/597542 (https://phabricator.wikimedia.org/T253024) [13:05:42] (03PS3) 10Giuseppe Lavagetto: jobrunner: add code to switch to envoy, switch codfw [puppet] - 10https://gerrit.wikimedia.org/r/597513 (https://phabricator.wikimedia.org/T247389) [13:07:11] that's the only place in core with user forced as an index on revision [13:07:18] the rest looks like pretty standard stff [13:07:32] there's all the extensions to check though [13:08:12] (03PS8) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [13:08:14] (03PS9) 10ZPapierski: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [13:08:21] codesearch.wmflabs.org? [13:09:18] (03CR) 10jerkins-bot: [V: 04-1] profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [13:12:13] ./ContributionsList/includes/ContributionsList.php hs it also [13:12:14] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:33] nah I just did a gep across the extensions in master on a deployment-prep instance [13:12:49] I didn't see anything problematic anywhere else [13:12:54] (03CR) 10Muehlenhoff: [C: 03+2] Add a define for creating a system user using systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/597265 (owner: 10Muehlenhoff) [13:13:12] but I was only looking at the revision table [13:13:24] (03CR) 10Elukey: [C: 04-1] profile::java: one profile to rule them all (openjdk-x versions) (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [13:14:05] (03PS4) 10Giuseppe Lavagetto: jobrunner: add code to switch to envoy, switch codfw [puppet] - 10https://gerrit.wikimedia.org/r/597513 (https://phabricator.wikimedia.org/T247389) [13:14:11] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Gehel) >>! In T243701#5985617, @Ladsgroup wrote: > I think this would be a decision by @Lydia_Pintscher... [13:15:32] 10Operations, 10Traffic, 10vm-requests, 10Patch-For-Review: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10Dzahn) @ssingh The VM has been created (now with public IP). It has been added to site.pp with the role(insetup) and the first puppet ran that creates users and insta... [13:15:41] 10Operations, 10Traffic, 10vm-requests, 10Patch-For-Review: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10Dzahn) 05Open→03Resolved [13:16:29] 10Operations, 10Traffic: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10Dzahn) A VM called malmok.wikimedia.org has been created and can be used now. Currently it has the "insetup" role in site.pp. [13:16:57] 10Operations, 10Traffic, 10vm-requests, 10Patch-For-Review: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10Dzahn) ` root@malmok:~# gen_fingerprints +---------+---------+-----------------------------------------------------+ | Cipher | Algo | Fingerprint... [13:19:49] (03CR) 10Ottomata: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595634 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [13:22:31] (03PS9) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [13:23:03] (03CR) 10Muehlenhoff: profile::java: one profile to rule them all (openjdk-x versions) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [13:23:10] !log remove stale tcp service on lvs codfw low-traffic 10.2.1.53:10902 [13:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:37] !log milimetric@deploy1001 Finished deploy [analytics/refinery@a891999]: Regular analytics weekly train [analytics/refinery@a891999] (duration: 38m 33s) [13:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:59] !log milimetric@deploy1001 Started deploy [analytics/refinery@a891999] (thin): Regular analytics weekly train THIN [analytics/refinery@a891999] [13:23:59] moritzm: updated cr :) [13:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:09] !log milimetric@deploy1001 Finished deploy [analytics/refinery@a891999] (thin): Regular analytics weekly train THIN [analytics/refinery@a891999] (duration: 00m 10s) [13:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:37] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Ottomata) Hue allows you to access Hive and files in HDFS, for which you need a shell account and membership in the ana... [13:25:39] (03PS1) 10Aklapper: phabricator weekly changes email: List only open tasks by new contributors [puppet] - 10https://gerrit.wikimedia.org/r/597545 [13:25:49] having a look [13:26:06] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:26:25] (03CR) 10Filippo Giunchedi: "One nit and comment on debian/control, LGTM overall" (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594718 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [13:28:03] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for kubestage200[1-2], kubernetes200[7-14] [dns] - 10https://gerrit.wikimedia.org/r/597403 (owner: 10Papaul) [13:28:10] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for kubestage200[1-2], kubernetes200[7-14] [dns] - 10https://gerrit.wikimedia.org/r/597403 [13:28:14] (03CR) 10Papaul: [V: 03+2 C: 03+2] DNS: Add mgmt and production DNS for kubestage200[1-2], kubernetes200[7-14] [dns] - 10https://gerrit.wikimedia.org/r/597403 (owner: 10Papaul) [13:29:17] (03CR) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [13:30:02] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:30:04] (03PS10) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [13:32:02] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:32:42] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 198, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:40:42] (03CR) 10Ottomata: profile::java: one profile to rule them all (openjdk-x versions) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [13:43:21] (03CR) 10Ottomata: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [13:44:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [13:48:52] (03PS7) 10Muehlenhoff: Add debian/ directory to the build overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594718 (https://phabricator.wikimedia.org/T233947) [13:49:34] (03CR) 10Muehlenhoff: Add debian/ directory to the build overlay (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594718 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [13:51:27] !log authdns1001 - downtimed for physical work - T241770 [13:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:49] !log cr[12]-eqiad - re-routing ns[01] public IPs from authdns1001 (going offline for hw work) to dns1002 - T241770 [13:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:47] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594718 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [13:55:34] (03CR) 10Giuseppe Lavagetto: [C: 03+1] thanos: add Envoy TLS terminator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597018 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:57:29] (03PS1) 10QChris: Add .gitreview [software/locust] - 10https://gerrit.wikimedia.org/r/597547 [13:57:31] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/locust] - 10https://gerrit.wikimedia.org/r/597547 (owner: 10QChris) [14:00:52] !log cr2-eqiad - re-routing ns[01] public IPs from authdns1001 (going offline for hw work) to dns1002 - T241770 (redo from earlier, commit didn't take for whatever reason) [14:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:25] (03CR) 10Filippo Giunchedi: thanos: add Envoy TLS terminator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597018 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [14:01:53] (03CR) 10Jbond: profile::java: one profile to rule them all (openjdk-x versions) (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [14:02:21] (03CR) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [14:03:00] (03PS6) 10Filippo Giunchedi: thanos: add Envoy TLS terminator [puppet] - 10https://gerrit.wikimedia.org/r/597018 (https://phabricator.wikimedia.org/T252186) [14:03:02] (03PS7) 10Filippo Giunchedi: thanos: add Store Gateway [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) [14:03:04] (03PS6) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [14:03:06] (03PS7) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [14:03:49] !log authdns1001 - poweroff for T241770 [14:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:29] (03PS1) 10Ayounsi: special-ranges6, remove 4000::/2 and 8000::/1 [homer/public] - 10https://gerrit.wikimedia.org/r/597551 [14:04:48] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add Envoy TLS terminator [puppet] - 10https://gerrit.wikimedia.org/r/597018 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [14:05:18] (03CR) 10Elukey: [C: 04-1] profile::java: one profile to rule them all (openjdk-x versions) (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [14:05:40] (03PS1) 10Vgutierrez: Release 8.0.7-1wm10 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/597552 [14:06:57] !log special-ranges6, remove 4000::/2 and 8000::/1 [14:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:34] (03CR) 10Hnowlan: [C: 03+2] restrouter: release new package [deployment-charts] - 10https://gerrit.wikimedia.org/r/597274 (https://phabricator.wikimedia.org/T252865) (owner: 10Hnowlan) [14:07:40] (03CR) 10jerkins-bot: [V: 04-1] restrouter: release new package [deployment-charts] - 10https://gerrit.wikimedia.org/r/597274 (https://phabricator.wikimedia.org/T252865) (owner: 10Hnowlan) [14:07:43] (03PS2) 10Vgutierrez: Release 8.0.7-1wm10 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/597552 [14:08:03] (03CR) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [14:08:44] (03CR) 10CDanis: [C: 03+1] special-ranges6, remove 4000::/2 and 8000::/1 [homer/public] - 10https://gerrit.wikimedia.org/r/597551 (owner: 10Ayounsi) [14:09:04] (03PS2) 10Hnowlan: restrouter: release new package [deployment-charts] - 10https://gerrit.wikimedia.org/r/597274 (https://phabricator.wikimedia.org/T252865) [14:09:11] 10Operations, 10observability, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10akosiaris) For what is worth and for kubernetes deploys specifically, we have in grafana an annotation that is working most of the times, but can easily fail... [14:10:05] (03CR) 10Ayounsi: [C: 03+2] special-ranges6, remove 4000::/2 and 8000::/1 [homer/public] - 10https://gerrit.wikimedia.org/r/597551 (owner: 10Ayounsi) [14:10:09] (03CR) 10Hnowlan: [C: 03+2] restrouter: release new package [deployment-charts] - 10https://gerrit.wikimedia.org/r/597274 (https://phabricator.wikimedia.org/T252865) (owner: 10Hnowlan) [14:10:30] (03Merged) 10jenkins-bot: special-ranges6, remove 4000::/2 and 8000::/1 [homer/public] - 10https://gerrit.wikimedia.org/r/597551 (owner: 10Ayounsi) [14:10:31] (03Merged) 10jenkins-bot: restrouter: release new package [deployment-charts] - 10https://gerrit.wikimedia.org/r/597274 (https://phabricator.wikimedia.org/T252865) (owner: 10Hnowlan) [14:11:11] (03PS11) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [14:11:32] (03CR) 10Jbond: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [14:13:16] (03CR) 10Ottomata: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [14:14:43] (03PS12) 10BBlack: Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) [14:15:00] (03CR) 10Elukey: [C: 04-1] profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [14:15:02] (03CR) 10Ottomata: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [14:15:03] !log hnowlan@deploy1001 Started deploy [restbase/deploy@6d2f88c]: Add awa.wikipedia.org to wikipedia list [14:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:16] (03CR) 10BBlack: [C: 03+1] ulsfo: shrink 198.35.26.0/23 to 198.35.26.0/24 [homer/public] - 10https://gerrit.wikimedia.org/r/597486 (https://phabricator.wikimedia.org/T253196) (owner: 10Ayounsi) [14:16:27] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 44 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:16:45] (03CR) 10BBlack: [C: 03+1] Accept 198.35.27.0/24 from anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/597507 (https://phabricator.wikimedia.org/T253196) (owner: 10Ayounsi) [14:19:08] (03PS1) 10Giuseppe Lavagetto: Add cert for combined jobrunner [labs/private] - 10https://gerrit.wikimedia.org/r/597555 [14:19:23] (03PS5) 10Giuseppe Lavagetto: jobrunner: add code to switch to envoy, switch codfw [puppet] - 10https://gerrit.wikimedia.org/r/597513 (https://phabricator.wikimedia.org/T247389) [14:19:52] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add cert for combined jobrunner [labs/private] - 10https://gerrit.wikimedia.org/r/597555 (owner: 10Giuseppe Lavagetto) [14:20:09] (03CR) 10Ottomata: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [14:23:23] (03PS1) 10Giuseppe Lavagetto: Move jobrunner cert to correct path [labs/private] - 10https://gerrit.wikimedia.org/r/597556 [14:24:01] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Move jobrunner cert to correct path [labs/private] - 10https://gerrit.wikimedia.org/r/597556 (owner: 10Giuseppe Lavagetto) [14:25:17] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:29:02] (03PS1) 10Filippo Giunchedi: profile: fix thanos swift healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/597557 (https://phabricator.wikimedia.org/T233956) [14:31:32] (03PS6) 10Giuseppe Lavagetto: jobrunner: add code to switch to envoy, switch codfw [puppet] - 10https://gerrit.wikimedia.org/r/597513 (https://phabricator.wikimedia.org/T247389) [14:32:00] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: fix thanos swift healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/597557 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [14:33:11] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:34:52] !log hnowlan@deploy1001 Finished deploy [restbase/deploy@6d2f88c]: Add awa.wikipedia.org to wikipedia list (duration: 19m 49s) [14:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:01] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 47 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:38:39] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:39:15] (03PS4) 10Cwhite: add ca_bundle configuration option [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/597322 [14:39:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/22626/mw2150.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/597513 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [14:40:15] 10Operations, 10observability, 10Patch-For-Review: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10JMeybohm) Downside of using helmfile hooks would be that we catch the trigger, not the actual event. So deploys triggered by rollbacks for example would not... [14:41:05] (03CR) 10Andrew Bogott: [C: 03+2] Remove most (but not all) nfs mounts from the wikidata-dev project. [puppet] - 10https://gerrit.wikimedia.org/r/597413 (https://phabricator.wikimedia.org/T208416) (owner: 10Andrew Bogott) [14:41:28] (03PS1) 10Cwhite: profile: add ca_bundle configuration option to docker-pkg configs [puppet] - 10https://gerrit.wikimedia.org/r/597559 [14:42:41] (03CR) 10Marostegui: [V: 03+2 C: 03+2] dashboard: Change tendril_purge_global_status_log_5m [software/tendril] - 10https://gerrit.wikimedia.org/r/597535 (https://phabricator.wikimedia.org/T252331) (owner: 10Marostegui) [14:42:56] (03CR) 10Cwhite: "Latest patchset will need Puppet changes first so as to not interfere with integration. https://gerrit.wikimedia.org/r/c/operations/puppe" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/597322 (owner: 10Cwhite) [14:43:30] !log Replace tendril_purge_global_status_log_5m event with the new one (purging every 2d of data and with a higher limit of rows) - T252331 [14:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:35] T252331: tendril_purge_global_status_log_5m and global_status_log needs more frequent purging - https://phabricator.wikimedia.org/T252331 [14:43:37] (03PS1) 10Andrew Bogott: cloudnet1003 and cloudnet1004: next install with Buster [puppet] - 10https://gerrit.wikimedia.org/r/597560 (https://phabricator.wikimedia.org/T253124) [14:43:41] 10Operations, 10ops-codfw, 10DBA: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10Papaul) Memory arrived since yesterday. [14:46:19] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 44 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:46:24] 10Operations, 10ops-codfw, 10DBA: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10jcrespo) No rush on our side, just the day before you are going to the DC for this, let us know so I can stop the server 24h in advance. [14:46:26] 10Operations, 10SRE-tools: Create cookbook to reboot hosts - https://phabricator.wikimedia.org/T252807 (10MoritzMuehlenhoff) >>! In T252807#6144715, @Volans wrote: >>>! In T252807#6144703, @jbond wrote: >> As the CR specificity mentions operating on a single host i don't think that LBRemoteCluster comes into... [14:47:20] (03CR) 10Cwhite: "> Patch Set 2:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [14:47:44] (03CR) 10Andrew Bogott: [C: 03+1] templates: add reverse zone for 185.12.57.0/24 including cloud delegation [dns] - 10https://gerrit.wikimedia.org/r/597514 (https://phabricator.wikimedia.org/T247972) (owner: 10Arturo Borrero Gonzalez) [14:51:51] (03PS1) 10Jforrester: [itwikivoyage] Undeploy Insider and Listings extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597561 (https://phabricator.wikimedia.org/T253096) [14:56:40] (03CR) 10Jforrester: [C: 03+2] [itwikivoyage] Undeploy Insider and Listings extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597561 (https://phabricator.wikimedia.org/T253096) (owner: 10Jforrester) [14:57:32] (03Merged) 10jenkins-bot: [itwikivoyage] Undeploy Insider and Listings extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597561 (https://phabricator.wikimedia.org/T253096) (owner: 10Jforrester) [14:57:36] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/597346 (https://phabricator.wikimedia.org/T211692) (owner: 10Herron) [14:58:01] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001:9501 job=burrow partition={0,1} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+p [14:58:01] -cluster=logging-eqiad&var-topic=All&var-consumer_group=All [14:59:26] (03CR) 10Gehel: "A few more comments (all minor)" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [15:00:06] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T253096 [itwikivoyage] Undeploy Insider and Listings extensions (duration: 01m 08s) [15:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:11] T253096: Undeploy Insider and Listings extensions from Italian Wikivoyage - https://phabricator.wikimedia.org/T253096 [15:00:22] (03CR) 10Volans: [C: 04-1] "I think it's missing the installation of the package." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) (owner: 10Jbond) [15:03:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] templates: add reverse zone for 185.12.57.0/24 including cloud delegation [dns] - 10https://gerrit.wikimedia.org/r/597514 (https://phabricator.wikimedia.org/T247972) (owner: 10Arturo Borrero Gonzalez) [15:04:02] (03PS4) 10Arturo Borrero Gonzalez: templates: add reverse zone for 185.12.57.0/24 including cloud delegation [dns] - 10https://gerrit.wikimedia.org/r/597514 (https://phabricator.wikimedia.org/T247972) [15:04:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/22624/icinga1001.wikimedia.org/index.html LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/597346 (https://phabricator.wikimedia.org/T211692) (owner: 10Herron) [15:06:09] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001:9501 job=burrow partition={4,5} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluste [15:06:09] ar-topic=All&var-consumer_group=All [15:08:14] oof, looks like logstash codfw (only) is lagging [15:08:51] (03PS6) 10Jbond: puppetmaster::gitclone: add pre-commit to private repo [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) [15:11:29] (03CR) 10Ottomata: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [15:12:17] (03CR) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [15:12:24] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) (owner: 10Jbond) [15:14:16] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Unfortunately, we can't just bump to buster as is and stop supporting stretch (which is what this change would do), as kask, the software " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597299 (owner: 10Cwhite) [15:15:03] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [15:15:03] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [15:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Would adding the files to .helmignore be a satisfactory midpoint between moving it entirely and keeping things neat from a package persp" [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [15:17:22] (03CR) 10Alexandros Kosiaris: [C: 03+1] update readme (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597304 (owner: 10Cwhite) [15:17:26] (03PS1) 10Privacybatm: transfer.py: Add information to --help option [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) [15:21:14] (03CR) 10Privacybatm: "I used the information from https://wikitech.wikimedia.org/wiki/Transfer.py :)" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [15:21:40] !log installing libssh security updates [15:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:18] (03CR) 10Alexandros Kosiaris: [C: 03+1] profile::url_downloader: Add types and switch to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/562472 (owner: 10Muehlenhoff) [15:23:52] (03CR) 10Herron: "thx all for the quick reviews! I'll move forward with this shortly" [puppet] - 10https://gerrit.wikimedia.org/r/597346 (https://phabricator.wikimedia.org/T211692) (owner: 10Herron) [15:25:11] (03CR) 10Jcrespo: "Sorry for the offtopic, but I see that you implement each patch in an independeny tree. Do you have any preference in which to merge them," [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [15:25:42] (03CR) 10Jcrespo: "> Sorry for the offtopic, but I see that you implement each patch in" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [15:25:50] (03PS2) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [15:26:07] (03CR) 10Jcrespo: "> > Sorry for the offtopic, but I see that you implement each patch" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [15:26:38] (03CR) 10jerkins-bot: [V: 04-1] hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 (owner: 10Jbond) [15:31:20] (03CR) 10Herron: [C: 03+2] lvs::monitor: expand icinga service descriptions [puppet] - 10https://gerrit.wikimedia.org/r/597346 (https://phabricator.wikimedia.org/T211692) (owner: 10Herron) [15:32:16] (03PS3) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [15:33:46] (03CR) 10Privacybatm: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [15:34:34] 10Operations, 10Readers-Web-Backlog: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Jdlrobson) @Isaac thanks for flagging the redirecting of the desktop to the mobile site uses Varnish (last time i checked), so this is likely an ops task (with our input). [15:34:36] (03PS3) 10Cwhite: add golang14 builder image using golang 1.14 and based on buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597299 [15:34:43] (03CR) 10Privacybatm: "Thank you for asking! :-)" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [15:36:24] (03PS4) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [15:37:11] (03PS1) 10Elukey: profile::druid::analytics::worker: tune historical daemon settings [puppet] - 10https://gerrit.wikimedia.org/r/597573 (https://phabricator.wikimedia.org/T252771) [15:37:24] (03CR) 10Jcrespo: "Your welcome." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [15:38:51] (03PS1) 10Ema: Pass JSON file and network address as CLI flags [software/prometheus-rdkafka-exporter] - 10https://gerrit.wikimedia.org/r/597574 (https://phabricator.wikimedia.org/T253197) [15:39:09] (03PS5) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [15:39:31] (03CR) 10Jcrespo: [C: 03+2] Add comments to Firewall, MariaDB and transfer modules [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597158 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [15:40:55] (03CR) 10Jcrespo: [C: 03+1] CuminExecution.py: Improve output message readabiliy of transfer.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) (owner: 10Privacybatm) [15:41:08] (03PS6) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [15:41:43] (03PS2) 10Elukey: profile::druid::analytics::worker: tune historical daemon settings [puppet] - 10https://gerrit.wikimedia.org/r/597573 (https://phabricator.wikimedia.org/T252771) [15:42:03] (03CR) 10Jcrespo: "It doesn't allow me to rebase automatically, can you do it manually as per request order?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [15:42:53] (03CR) 10Privacybatm: "> Patch Set 10:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [15:42:55] !log update puppet compiler's facts [15:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:55] (03CR) 10Hnowlan: [C: 03+2] Add tool and configuration for generating beta configuration from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [15:44:38] (03PS7) 10Hnowlan: Add tool and configuration for generating beta configuration from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) [15:44:57] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] Add tool and configuration for generating beta configuration from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [15:45:29] (03Merged) 10jenkins-bot: Add tool and configuration for generating beta configuration from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [15:46:31] (03PS7) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [15:47:09] (03PS11) 10Privacybatm: transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) [15:47:40] (03PS3) 10Elukey: profile::druid::analytics::worker: tune historical daemon settings [puppet] - 10https://gerrit.wikimedia.org/r/597573 (https://phabricator.wikimedia.org/T252771) [15:47:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, thanks! A small nitpick inline, but apart from that it's good to merge." (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/597322 (owner: 10Cwhite) [15:47:44] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [15:49:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please amend also modules/profile/templates/docker/production-images-config.yaml.erb" [puppet] - 10https://gerrit.wikimedia.org/r/597559 (owner: 10Cwhite) [15:50:30] (03PS8) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [15:52:35] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10Nuria) Is LDAP access sufficient (superset, turnilo) or access to raw data is needed? [15:52:38] (03CR) 10Jcrespo: [C: 03+2] transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [15:53:31] (03CR) 10Privacybatm: "I am working on this now." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) (owner: 10Privacybatm) [15:53:34] (03CR) 10Jcrespo: [C: 03+2] "This can be merged as is (it is better than before), but we should not close the ticket attached until we work a bit more on the TODOs and" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [15:55:56] (03CR) 10Privacybatm: "> Patch Set 11:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [15:56:37] (03PS7) 10Privacybatm: CuminExecution.py: Improve output message readabiliy of transfer.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) [15:56:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [15:58:23] PROBLEM - Check systemd state on ms-be1033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:07] 10Operations, 10observability, 10Patch-For-Review: Duplicate definitions found in Icinga configuration - https://phabricator.wikimedia.org/T211692 (10herron) I think we're good here! ` Running pre-flight check on configuration data... Checking services... Checked 47497 services. Checking hosts... Checked... [16:01:34] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'canary' . [16:01:34] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'production' . [16:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:10] (03PS1) 10Elukey: Add fake keytabs for an-druid* [labs/private] - 10https://gerrit.wikimedia.org/r/597578 [16:02:34] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake keytabs for an-druid* [labs/private] - 10https://gerrit.wikimedia.org/r/597578 (owner: 10Elukey) [16:02:52] (03PS9) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [16:06:27] (03PS1) 10Elukey: Fix typo in an-druid100* fake kerberos path [labs/private] - 10https://gerrit.wikimedia.org/r/597579 [16:06:40] (03CR) 10Elukey: [V: 03+2 C: 03+2] Fix typo in an-druid100* fake kerberos path [labs/private] - 10https://gerrit.wikimedia.org/r/597579 (owner: 10Elukey) [16:11:10] (03CR) 10Joal: [C: 03+1] "LGTM! Let's see if it works :)" [puppet] - 10https://gerrit.wikimedia.org/r/597573 (https://phabricator.wikimedia.org/T252771) (owner: 10Elukey) [16:15:04] (03PS1) 10Ssingh: wikidough: add role and profile (first commit) [puppet] - 10https://gerrit.wikimedia.org/r/597580 (https://phabricator.wikimedia.org/T252132) [16:23:16] RECOVERY - Check systemd state on ms-be1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:14] (03CR) 10Jcrespo: [C: 03+2] CuminExecution.py: Improve output message readabiliy of transfer.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) (owner: 10Privacybatm) [16:30:07] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10dr0ptp4kt) Raw data, please. [16:37:28] (03CR) 10Jcrespo: "> I used the information from https://wikitech.wikimedia.org/wiki/Transfer.py" (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [16:38:27] (03PS1) 10Aklapper: phabricator weekly changes email: List users with URLs in profile desc [puppet] - 10https://gerrit.wikimedia.org/r/597582 [16:38:45] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10Nuria) Approve on my end, please provide ssh keys [16:39:31] (03CR) 10jerkins-bot: [V: 04-1] phabricator weekly changes email: List users with URLs in profile desc [puppet] - 10https://gerrit.wikimedia.org/r/597582 (owner: 10Aklapper) [16:41:17] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) [16:44:32] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10wiki_willy) Chatted with @wkandek today on the proposed B3 or C3 racks, along with the June 9 or 11th dates/times for the mw servers. He'll check with his team on Monday, and confirm with us af... [16:44:37] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10dcipoletti) Public SSH Key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDW9Ves95wovQePlNYA+bBvGT/NjhxzWBuYP5xbDaAqOZ34CscdII8LZY3HfH2Og6TO0qxxaxXGLuVIpDlfDdOWqMzNbTEko1RO1gy7fZqHY... [16:48:08] 10Operations, 10ops-eqsin, 10Traffic: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10RobH) [16:48:09] 10Operations, 10ops-eqsin: eqsin ganeti cable IDs - https://phabricator.wikimedia.org/T250369 (10RobH) [16:48:11] 10Operations, 10ops-eqsin: apply asset tags to s[12]-60[34]-eqsin - https://phabricator.wikimedia.org/T244900 (10RobH) [16:48:15] 10Operations, 10ops-codfw, 10ops-eqiad, 10ops-eqsin, and 2 others: Audit & update spares part tracking for all sites - https://phabricator.wikimedia.org/T243450 (10RobH) [16:49:39] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) TLS enabled mathoid is corrently deployed in staging and codfw k8s clusters but not in eqiad. CPU throttling has increased a lot (due... [16:53:05] !log kraz.wikimedia.org ( https://wikitech.wikimedia.org/wiki/IRCD ) - stopping ircecho then ircd, then restarting them in reverse order - T239993 [16:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:09] T239993: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 [16:54:47] [done] [16:56:21] (03CR) 10Privacybatm: "> Patch Set 1:" (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [16:57:37] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) [16:57:46] 10Operations, 10ops-eqsin, 10DC-Ops, 10netops: (Need By: TBD) rack/setup/install cr3-eqsin.wikimedia.org - https://phabricator.wikimedia.org/T253246 (10RobH) [16:58:02] 10Operations, 10ops-eqsin, 10DC-Ops, 10netops: (Need By: TBD) rack/setup/install cr3-eqsin.wikimedia.org - https://phabricator.wikimedia.org/T253246 (10RobH) [16:58:42] (03CR) 10Jcrespo: "> > Patch Set 1:" (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597569 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [17:02:23] !log dns* + authdns* - disabling puppet to test https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/597311/ [17:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:29] (03CR) 10BBlack: [C: 03+2] Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [17:07:21] dns4002 may have some service alerts; known, looking [17:09:24] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:09:26] (03PS1) 10BBlack: Revert "Add testable anycast authdns address and bird stuff" [puppet] - 10https://gerrit.wikimedia.org/r/597586 [17:09:38] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns4002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:09:50] PROBLEM - Bird Internet Routing Daemon on dns4002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:09:55] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "Add testable anycast authdns address and bird stuff" [puppet] - 10https://gerrit.wikimedia.org/r/597586 (owner: 10BBlack) [17:09:57] (03CR) 10Elukey: [C: 03+2] profile::druid::analytics::worker: tune historical daemon settings [puppet] - 10https://gerrit.wikimedia.org/r/597573 (https://phabricator.wikimedia.org/T252771) (owner: 10Elukey) [17:10:28] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:10:29] bblack: can I merge? [17:10:30] PROBLEM - Recursive DNS on 2620:0:863:1:198:35:26:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:10:32] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:10:34] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:10:34] elukey: yes, please [17:10:42] PROBLEM - Recursive DNS on 198.35.26.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:10:59] bblack: done [17:11:40] all the alertspam there is from dns4002 fallout. nothing should actually be production-broken, sorry for the noise! [17:12:19] (03PS1) 10Alexandros Kosiaris: mathoid: Uninstall mathoid canary in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/597587 [17:12:20] RECOVERY - Recursive DNS on 2620:0:863:1:198:35:26:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [17:12:21] (03PS1) 10Alexandros Kosiaris: Revert "mathoid: Test canary functionality in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/597588 [17:12:22] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 22, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:12:26] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:12:28] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 89, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:12:32] RECOVERY - Recursive DNS on 198.35.26.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [17:13:08] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:13:22] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns4002 is OK: OK: UP (pid=27644) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:13:34] RECOVERY - Bird Internet Routing Daemon on dns4002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:14:32] (03PS2) 10Cwhite: profile: add ca_bundle configuration option to docker-pkg configs [puppet] - 10https://gerrit.wikimedia.org/r/597559 [17:16:03] (03PS1) 10Elukey: Revert "profile::druid::analytics::worker: tune historical daemon settings" [puppet] - 10https://gerrit.wikimedia.org/r/597589 [17:16:29] (03CR) 10Elukey: [C: 03+2] "Needs a bit more tuning since direct memory needs to be adjusted as well" [puppet] - 10https://gerrit.wikimedia.org/r/597589 (owner: 10Elukey) [17:16:51] (03PS4) 10Cwhite: add golang13 and golang14 builder images using based on wikimedia-buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597299 [17:17:50] PROBLEM - Check systemd state on an-druid1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:06] PROBLEM - Druid historical on an-druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args io.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [17:18:40] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@Rvvalentim) - https://phabricator.wikimedia.org/T253155 (10RLazarus) 05Open→03Resolved a:05KFrancis→03RLazarus Thanks Katie! @diego @Rvvalentim Rodolfo is now a member of the `wmf` LDAP group -... [17:18:44] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10RLazarus) [17:19:42] RECOVERY - Check systemd state on an-druid1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:58] RECOVERY - Druid historical on an-druid1001 is OK: PROCS OK: 1 process with command name java, args io.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [17:20:37] this was me --^ [17:21:55] (03CR) 10Cwhite: "> Patch Set 2: Code-Review-1" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597299 (owner: 10Cwhite) [17:25:56] (03PS1) 10BBlack: Add testable anycast authdns address (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/597590 (https://phabricator.wikimedia.org/T98006) [17:30:52] (03PS2) 10BBlack: Add testable anycast authdns address (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/597590 (https://phabricator.wikimedia.org/T98006) [17:32:39] (03PS3) 10Cwhite: add loki 1.4.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) [17:32:58] 10Operations, 10ops-eqsin, 10Traffic: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10RobH) Ok, for memory tests we need to clear the SEL, so just dumping its output here for easy review later (its stored in the server still but not readable without a data dump and sorting): ` admin1->... [17:34:56] (03PS1) 10Bstorm: paws-kubeadm: Add option for stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/597591 (https://phabricator.wikimedia.org/T211096) [17:37:54] (03CR) 10Bstorm: "Verified this is a noop using toolsbeta. It should just drop the necessary bits where needed." [puppet] - 10https://gerrit.wikimedia.org/r/597591 (https://phabricator.wikimedia.org/T211096) (owner: 10Bstorm) [17:38:14] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10RLazarus) @dcipoletti Thanks, looks good! All I need you to do now is: * sign L3 (there should be an option to sign it right on that page) and * let me know that you've read... [17:39:53] (03CR) 10BBlack: [C: 03+2] Add testable anycast authdns address (attempt 2) [puppet] - 10https://gerrit.wikimedia.org/r/597590 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [17:44:15] !log cp5012 rebooting for troubleshooting [17:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:17] (03PS2) 10Bstorm: paws-kubeadm: Add option for stacked control plane [puppet] - 10https://gerrit.wikimedia.org/r/597591 (https://phabricator.wikimedia.org/T211096) [17:49:17] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@Rvvalentim) - https://phabricator.wikimedia.org/T253155 (10Rvvalentim) Thanks, @RLazarus! Where do I find the password? [17:49:36] PROBLEM - Host nsa-v4 is DOWN: PING CRITICAL - Packet loss = 100% [17:49:58] (03PS1) 10BBlack: bird anycast config: filter new anycast network [puppet] - 10https://gerrit.wikimedia.org/r/597594 (https://phabricator.wikimedia.org/T98006) [17:51:44] (03CR) 10Bstorm: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/22646/toolsbeta-test-k8s-control-1.toolsbeta.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/597591 (https://phabricator.wikimedia.org/T211096) (owner: 10Bstorm) [17:52:03] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@Rvvalentim) - https://phabricator.wikimedia.org/T253155 (10RLazarus) This uses your LDAP password -- the same one you use to log into Wikitech. The username should be either your Wikitech username (`Rod... [17:52:24] (03CR) 10Bstorm: "So with the notion this is a noop for purposes of existing VMs, let this be a question if everything thinks the style and notion of doing " [puppet] - 10https://gerrit.wikimedia.org/r/597591 (https://phabricator.wikimedia.org/T211096) (owner: 10Bstorm) [17:52:26] ACKNOWLEDGEMENT - Host nsa-v4 is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black T98006 ongoing work [17:53:06] oh no they're really named nsa and nsb [17:53:17] (03PS1) 10Papaul: Add kubestage200[1-2] , kubernetes200[7-14] MAC entries and to site.pp with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/597595 (https://phabricator.wikimedia.org/T252185) [17:53:44] (03CR) 10jerkins-bot: [V: 04-1] Add kubestage200[1-2] , kubernetes200[7-14] MAC entries and to site.pp with role insetup [puppet] - 10https://gerrit.wikimedia.org/r/597595 (https://phabricator.wikimedia.org/T252185) (owner: 10Papaul) [17:57:17] !log accept 198.35.27.0/24 from Anycast peers on cr3-ulsfo - T253196 [17:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:20] T253196: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 [18:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200520T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:04] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200520T1800) [18:00:12] RECOVERY - Host nsa-v4 is UP: PING OK - Packet loss = 0%, RTA = 74.64 ms [18:00:20] bblack: ^ [18:00:29] nice [18:00:44] I guess it's hitting ulsfo from eqiad for now heh [18:00:54] yep [18:01:19] I need to update some filters later one so it only do POP->core and never core->POP [18:01:22] !log add BGP between authdns2001 and cr1-codfw - T253196 [18:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:24] well really, in this case I don't know that we need to filter that [18:02:33] so long as no router advertises it to the public without a direct nearby peer [18:03:06] (03CR) 10BBlack: [C: 03+2] bird anycast config: filter new anycast network [puppet] - 10https://gerrit.wikimedia.org/r/597594 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [18:03:24] bblack: in the current state it will be cleaner, the only reason it make it to the core is because it's still part of the ulsfo IP space [18:03:45] right [18:04:17] in the recdns case, the reason we cared is that internal recdns latency can be a critical thing, and we didn't want the scenario where e.g. eqiad recdnses are all-dead, and instead of falling back to codfw it falls back to like eqsin or something [18:04:41] (03PS2) 10Papaul: Add kubestage200[1-2] , kubernetes200[7-14] MAC entries and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/597595 (https://phabricator.wikimedia.org/T252185) [18:04:43] but the latency of this isn't critical internally, all we really care about is the net effect on public routing [18:05:00] (if no local advertisers, don't advertise to the world, let them make their own path to one of our other edges) [18:05:08] (03CR) 10jerkins-bot: [V: 04-1] Add kubestage200[1-2] , kubernetes200[7-14] MAC entries and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/597595 (https://phabricator.wikimedia.org/T252185) (owner: 10Papaul) [18:05:16] yep, we can do that easily and automatically [18:08:39] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@Rvvalentim) - https://phabricator.wikimedia.org/T253155 (10RLazarus) After chatting with @colewhite (thanks!) I've moved Rodolfo from the `wmf` group over to the `nda` group, which should still provide... [18:10:52] !log accept 198.35.27.0/24 from Anycast peers on all routers - T253196 [18:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:59] T253196: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 [18:12:26] (03PS3) 10Papaul: Add kubestage200[1-2] , kubernetes200[7-14] MAC entries and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/597595 (https://phabricator.wikimedia.org/T252185) [18:12:54] (03CR) 10jerkins-bot: [V: 04-1] Add kubestage200[1-2] , kubernetes200[7-14] MAC entries and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/597595 (https://phabricator.wikimedia.org/T252185) (owner: 10Papaul) [18:21:07] 10Operations, 10Traffic, 10netops, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) Update: the `nsa` authdns IP at `198.35.27.27` is live internally everywhere and monitored and working. There's some stuff to finish up later this week for the public side i... [18:23:33] 10Operations, 10Traffic, 10netops, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) (correction - it's also internet-reachable via ulsfo only for now, in this interim state, just by chance because it's still advertising the whole original /23) [18:24:05] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Addshore) Indeed, currently I would only see wdqs inclusion in maxlag as a bandaid waiting for a proper... [18:29:59] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [18:30:15] (03CR) 10Ayounsi: [C: 03+2] Accept 198.35.27.0/24 from anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/597507 (https://phabricator.wikimedia.org/T253196) (owner: 10Ayounsi) [18:30:35] (03Merged) 10jenkins-bot: Accept 198.35.27.0/24 from anycast peers [homer/public] - 10https://gerrit.wikimedia.org/r/597507 (https://phabricator.wikimedia.org/T253196) (owner: 10Ayounsi) [18:30:50] (03PS5) 10Cwhite: add ca_bundle configuration option [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/597322 [18:32:34] (03PS1) 10Ayounsi: Add 198.35.27.27 to border-in4 term authdns [homer/public] - 10https://gerrit.wikimedia.org/r/597598 (https://phabricator.wikimedia.org/T253196) [18:37:54] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [18:38:34] (03PS4) 10Papaul: Add kubestage200[1-2] , kubernetes200[7-14] MAC entries and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/597595 (https://phabricator.wikimedia.org/T252185) [18:39:03] (03CR) 10jerkins-bot: [V: 04-1] Add kubestage200[1-2] , kubernetes200[7-14] MAC entries and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/597595 (https://phabricator.wikimedia.org/T252185) (owner: 10Papaul) [18:54:23] 10Operations, 10Analytics, 10Traffic: Publishing project anomaly data for censorship researchers. Evaluate privacy threats - https://phabricator.wikimedia.org/T183990 (10Nuria) a:05Nuria→03None [18:55:02] (03PS5) 10Papaul: Add kubestage200[1-2] , kubernetes200[7-14] MAC entries and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/597595 (https://phabricator.wikimedia.org/T252185) [18:55:33] (03CR) 10jerkins-bot: [V: 04-1] Add kubestage200[1-2] , kubernetes200[7-14] MAC entries and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/597595 (https://phabricator.wikimedia.org/T252185) (owner: 10Papaul) [18:55:54] (03PS1) 10Bstorm: paws: Add a profile to provide some special config for paws [puppet] - 10https://gerrit.wikimedia.org/r/597602 (https://phabricator.wikimedia.org/T211096) [19:00:01] 10Operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132 (10dr0ptp4kt) a:05dr0ptp4kt→03None [19:03:09] 10Operations, 10Graphoid, 10Services (watching): Graphoid returns a 400 on MW API time-out - https://phabricator.wikimedia.org/T134237 (10Yurik) a:05Yurik→03None [19:05:38] RoanKattouw: djellel mentioned that he had a conversation with you and marshall about how to surface the link recommendation results via VE. He referred to a staging environment where his api can expose the annotated wikitext. Can you point me to the right place for me to read about the staging environment? [19:08:30] leila: That's a bit of a misnomer, but let me explain [19:09:01] VE has a feature that lets you switch back and forth between visual and wikitext mode mid-edit [19:09:11] (03PS2) 10Ayounsi: ulsfo: shrink 198.35.26.0/23 to 198.35.26.0/24 [homer/public] - 10https://gerrit.wikimedia.org/r/597486 (https://phabricator.wikimedia.org/T253196) [19:09:13] (03PS1) 10Ayounsi: Advertise anycast 198.35.27.0/24 to the world [homer/public] - 10https://gerrit.wikimedia.org/r/597603 (https://phabricator.wikimedia.org/T253196) [19:09:57] This means that, when you open VE by switching from wikitext and you've made changes, the wikitext you're loading into the editor doesn't "exist" anywhere (it's not a saved version of a page for example) [19:10:44] That's not an enormous problem, but Parsoid does use the old wikitext for some optimizations in reducing dirty diffs. So to make that work better, there's some sort of ephemeral area (it used to be a RESTbase thing, not sure how it works now) where this is stored [19:10:57] In any case, the link recommendation stuff doesn't need to use this directly [19:11:24] It can just spit out wikitext, and our integration with it can take care of this magic (probably by piggybacking on VE and letting it take care of this magic) [19:12:25] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Advertise 198.35.27.0/24 as anycast prefix - https://phabricator.wikimedia.org/T253196 (10ayounsi) [19:14:28] (03PS6) 10Papaul: Add kubestage200[1-2] , kubernetes200[7-14] MAC entries and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/597595 (https://phabricator.wikimedia.org/T252185) [19:27:11] !log cp5012 still offline for mem tests, "fast" testing complete without errors and extended testing in progress. system firmware was updated before testing. T251219 [19:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:14] T251219: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 [19:27:25] (03PS1) 10RLazarus: admin: Create shell account for mbinder and add to phabricator-bulk-manager [puppet] - 10https://gerrit.wikimedia.org/r/597605 (https://phabricator.wikimedia.org/T251349) [19:31:29] (03CR) 10CDanis: [C: 03+1] admin: Create shell account for mbinder and add to phabricator-bulk-manager [puppet] - 10https://gerrit.wikimedia.org/r/597605 (https://phabricator.wikimedia.org/T251349) (owner: 10RLazarus) [19:31:47] (03CR) 10RLazarus: [C: 03+2] admin: Create shell account for mbinder and add to phabricator-bulk-manager [puppet] - 10https://gerrit.wikimedia.org/r/597605 (https://phabricator.wikimedia.org/T251349) (owner: 10RLazarus) [19:45:31] (03CR) 10Jbond: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [19:48:37] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request for srv/phab/phabricator/bin/bulk make-silent --id * command via SSH for moving tasks quarterly - https://phabricator.wikimedia.org/T251349 (10RLazarus) Hi @MBinder_WMF! I've created your shell account and added it to the group that @Dzahn crea... [19:52:02] (03PS1) 10CDanis: fnm: increase thresholds [puppet] - 10https://gerrit.wikimedia.org/r/597606 (https://phabricator.wikimedia.org/T249454) [19:55:48] 10Operations, 10LDAP-Access-Requests: LDAP/NDA Access Request for mshaver - https://phabricator.wikimedia.org/T250430 (10RLazarus) Hi @MNoorWMF, just checking on this -- can you please confirm that the `mshaver` account is working for you? If it is, I'll remove access from `mnoor` so that we don't leave both l... [20:00:04] halfak and accraze: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200520T2000). [20:05:59] (03PS10) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [20:07:53] (03PS1) 10Andrew Bogott: acme_chief: allow cloudservice2003-def to access the ldap=codfw1dev cert [puppet] - 10https://gerrit.wikimedia.org/r/597609 [20:10:32] (03PS11) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [20:13:24] (03PS12) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [20:16:56] !log logstash1011:~# kafka-preferred-replica-election --zookeeper conf1004.eqiad.wmnet,conf1005.eqiad.wmnet,conf1006.eqiad.wmnet/kafka/logging-eqiad [20:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:43] (03PS13) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [20:23:33] (03PS1) 10Andrew Bogott: codfw1dev: try to make cloudservices2002/2003 a mirrored ldap pair [puppet] - 10https://gerrit.wikimedia.org/r/597611 [20:24:33] (03CR) 10Andrew Bogott: [C: 03+2] acme_chief: allow cloudservice2003-def to access the ldap=codfw1dev cert [puppet] - 10https://gerrit.wikimedia.org/r/597609 (owner: 10Andrew Bogott) [20:24:46] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: try to make cloudservices2002/2003 a mirrored ldap pair [puppet] - 10https://gerrit.wikimedia.org/r/597611 (owner: 10Andrew Bogott) [20:24:59] 10Operations, 10LDAP-Access-Requests: LDAP/NDA Access Request for mshaver - https://phabricator.wikimedia.org/T250430 (10MNoorWMF) Hi @RLazarus - Confirming that yes, this account is working and I'm able to access Superset. thank you for closing the loop here. [20:27:07] (03PS1) 10Andrew Bogott: role::wmcs::openstack::codfw1dev::services: added a comment explaining why openldap is here [puppet] - 10https://gerrit.wikimedia.org/r/597612 [20:27:55] (03CR) 10jerkins-bot: [V: 04-1] role::wmcs::openstack::codfw1dev::services: added a comment explaining why openldap is here [puppet] - 10https://gerrit.wikimedia.org/r/597612 (owner: 10Andrew Bogott) [20:28:55] (03PS2) 10Andrew Bogott: role::wmcs::openstack::codfw1dev::services: added a comment re: ldap [puppet] - 10https://gerrit.wikimedia.org/r/597612 [20:31:26] (03CR) 10Andrew Bogott: [C: 03+2] role::wmcs::openstack::codfw1dev::services: added a comment re: ldap [puppet] - 10https://gerrit.wikimedia.org/r/597612 (owner: 10Andrew Bogott) [20:33:33] (03PS14) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [20:39:59] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Aklapper) 05Stalled→03Open p:05Medium→03Low [20:40:20] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Aklapper) @Jidanni: Is this still a problem that you are facing? [20:41:10] (03PS15) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [20:46:30] 10Operations, 10LDAP-Access-Requests: LDAP/NDA Access Request for mshaver - https://phabricator.wikimedia.org/T250430 (10RLazarus) a:05MNoorWMF→03RLazarus Thanks! I've removed `mnoor` but not `mshaver` from the `wmf` LDAP group, let me know if anything suddenly stops working. ` rzl@mwmaint1002:~$ ldapsear... [20:47:47] (03PS1) 10RLazarus: admin: Remove mnoor from data.yaml, account replaced by mshaver. [puppet] - 10https://gerrit.wikimedia.org/r/597618 (https://phabricator.wikimedia.org/T250430) [20:51:17] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Jclark-ctr) @CDanis https://phabricator.wikimedia.org/P11209 [20:59:51] 10Operations, 10Commons, 10Wikimedia-General-or-Unknown: Upload to Commons fails with a common ADSL connection in Taiwan - https://phabricator.wikimedia.org/T205619 (10Jidanni) Yes. I have to take my cellphone to the top of the mountain for a clear connection to upload, as your form requires high speeds or e... [21:10:31] (03PS1) 10Krinkle: db-labs: Enable 'useGTIDs' in Beta Cluster, same as prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597624 (https://phabricator.wikimedia.org/T139044) [21:11:00] (03CR) 10Krinkle: [C: 03+2] db-labs: Enable 'useGTIDs' in Beta Cluster, same as prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597624 (https://phabricator.wikimedia.org/T139044) (owner: 10Krinkle) [21:11:49] (03Merged) 10jenkins-bot: db-labs: Enable 'useGTIDs' in Beta Cluster, same as prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597624 (https://phabricator.wikimedia.org/T139044) (owner: 10Krinkle) [21:12:12] 10Operations, 10ops-eqsin, 10Traffic: cp5012 memory errors - https://phabricator.wikimedia.org/T251219 (10RobH) a:05RobH→03Vgutierrez So this ran the full suite of Dell tests, including extended memory testing, without failure. I did update the firmware before testing though. @Vgutierrez Can we return... [21:27:32] RoanKattouw: at a high level I get it now, thanks. [21:39:44] (03PS1) 10CDanis: dbctl: diffs: recurse into complicated sub-sections [software/conftool] - 10https://gerrit.wikimedia.org/r/597631 (https://phabricator.wikimedia.org/T253025) [21:40:15] 10Operations, 10ops-codfw, 10netops: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [21:43:47] (03PS1) 10CDanis: dbctl: diffs: cleanup return value accumulation [software/conftool] - 10https://gerrit.wikimedia.org/r/597634 [21:44:03] (03PS1) 10Andrew Bogott: Attempt to get a real sync_pass for clouddev ldap [puppet] - 10https://gerrit.wikimedia.org/r/597635 [21:47:01] (03CR) 10Andrew Bogott: [C: 03+2] Attempt to get a real sync_pass for clouddev ldap [puppet] - 10https://gerrit.wikimedia.org/r/597635 (owner: 10Andrew Bogott) [21:50:46] (03PS1) 10Andrew Bogott: profile::openldap_clouddev: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/597637 [21:53:11] (03CR) 10Andrew Bogott: [C: 03+2] profile::openldap_clouddev: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/597637 (owner: 10Andrew Bogott) [21:54:19] (03PS2) 10CDanis: dbctl: diffs: recurse into complicated sub-sections [software/conftool] - 10https://gerrit.wikimedia.org/r/597631 (https://phabricator.wikimedia.org/T253025) [21:54:20] (03PS2) 10CDanis: dbctl: diffs: cleanup return value accumulation [software/conftool] - 10https://gerrit.wikimedia.org/r/597634 [21:56:02] (03CR) 10CDanis: [C: 03+1] admin: Remove mnoor from data.yaml, account replaced by mshaver. [puppet] - 10https://gerrit.wikimedia.org/r/597618 (https://phabricator.wikimedia.org/T250430) (owner: 10RLazarus) [21:56:25] (03CR) 10RLazarus: [C: 03+2] admin: Remove mnoor from data.yaml, account replaced by mshaver. [puppet] - 10https://gerrit.wikimedia.org/r/597618 (https://phabricator.wikimedia.org/T250430) (owner: 10RLazarus) [21:58:32] (03PS1) 10Andrew Bogott: profile::openldap_clouddev: further attempt to get the sync_pass from hiera [puppet] - 10https://gerrit.wikimedia.org/r/597638 [21:59:01] (03PS16) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [21:59:45] (03CR) 10Andrew Bogott: [C: 03+2] profile::openldap_clouddev: further attempt to get the sync_pass from hiera [puppet] - 10https://gerrit.wikimedia.org/r/597638 (owner: 10Andrew Bogott) [22:00:03] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP/NDA Access Request for mshaver - https://phabricator.wikimedia.org/T250430 (10RLazarus) 05Open→03Resolved [22:06:12] (03PS17) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [22:08:52] (03PS3) 10CDanis: dbctl: diffs: recurse into complicated sub-sections [software/conftool] - 10https://gerrit.wikimedia.org/r/597631 (https://phabricator.wikimedia.org/T253025) [22:08:54] (03PS3) 10CDanis: dbctl: diffs: cleanup return value accumulation [software/conftool] - 10https://gerrit.wikimedia.org/r/597634 [22:11:13] (03CR) 10Cwhite: "> Patch Set 4: Code-Review+1" (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/597322 (owner: 10Cwhite) [22:11:27] (03PS18) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [22:12:05] (03PS2) 10Cwhite: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597304 [22:13:04] 10Operations, 10Traffic, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BBlack) The `kraz` case is gone now (yay!) and hasn't recurred since the ircd restart above. What's left appears to be all infrastructure stuff: PDUs, switches, firewalls, etc. I've picked up quite... [22:15:29] (03PS3) 10Cwhite: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597304 [22:16:08] (03PS19) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [22:18:21] (03PS20) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [22:21:35] (03PS21) 10Jbond: hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 [22:23:12] (03CR) 10jerkins-bot: [V: 04-1] hiera5: upgrade to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/566559 (owner: 10Jbond) [22:27:20] (03PS1) 10Jforrester: contint: Fix comment claiming the file is remote [puppet] - 10https://gerrit.wikimedia.org/r/597643 [22:27:22] (03PS1) 10Jforrester: contint: Drop the slave_scripts manifest, no longer used [puppet] - 10https://gerrit.wikimedia.org/r/597644 (https://phabricator.wikimedia.org/T252955) [22:30:29] (03CR) 10Hashar: [C: 03+1] "That indeed got migrated to puppet and in my 2016 patch I did not update the comment: https://gerrit.wikimedia.org/r/#/c/operations/puppet" [puppet] - 10https://gerrit.wikimedia.org/r/597643 (owner: 10Jforrester) [22:33:11] (03CR) 10Hashar: [C: 03+1] "We would then need to delete the checkout from contint1001 / contint2001." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597644 (https://phabricator.wikimedia.org/T252955) (owner: 10Jforrester) [22:50:14] (03PS1) 10Wolfgang Kandek: Documentation for locust setup on AWS. Locust is a stress testing program. [software/locust] - 10https://gerrit.wikimedia.org/r/597646 [22:55:53] (03PS1) 10BrandonXLF: Drop enwiki mobile main page special casing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597647 (https://phabricator.wikimedia.org/T253268) [23:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Evening SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200520T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:04:06] (03PS1) 10Wolfgang Kandek: Documentation for locust setup on AWS. Locust is a stress testing program. [software/locust] - 10https://gerrit.wikimedia.org/r/597649 [23:05:06] (03PS4) 10Cwhite: add loki 1.5.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) [23:12:26] (03Abandoned) 10Krinkle: contint: Remove mention of unused global agent script [puppet] - 10https://gerrit.wikimedia.org/r/596833 (https://phabricator.wikimedia.org/T252955) (owner: 10Krinkle) [23:12:35] (03CR) 10Krinkle: [C: 03+1] contint: Drop the slave_scripts manifest, no longer used [puppet] - 10https://gerrit.wikimedia.org/r/597644 (https://phabricator.wikimedia.org/T252955) (owner: 10Jforrester) [23:49:07] (03PS1) 10Jeena Huneidi: [WIP] Automate deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/597653 (https://phabricator.wikimedia.org/T253264) [23:49:16] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10wiki_willy) Hi @faidon - one of the goals we have this quarter is to resolve all backlogged install tasks from q3 and earlier by end of June. With the limited nu... [23:56:10] (03PS1) 10Ori.livneh: wall-clock excimer profiling for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597654 (https://phabricator.wikimedia.org/T253160)