[00:04:18] (03PS1) 10DannyS712: Enwiki config: Grant template editors `editcontentmodel` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597174 (https://phabricator.wikimedia.org/T253081) [00:05:14] (03PS2) 10DannyS712: Enwiki config: Grant template editors `editcontentmodel` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597174 (https://phabricator.wikimedia.org/T253081) [00:12:53] (03CR) 10Herron: "> see https://gerrit.wikimedia.org/r/c/operations/puppet/+/596787 for" [puppet] - 10https://gerrit.wikimedia.org/r/597165 (owner: 10Herron) [00:43:53] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10dcipoletti) [00:44:27] 10Operations, 10Research: Add Git LFS support for research/wikiworkshop - https://phabricator.wikimedia.org/T252956 (10bmansurov) @leila that sounds good. @Reedy there maybe many reasons. Files being changed is not a requirement to store them in Git. And I'm not proposing we store them in Git, I am asking for... [01:03:04] (03CR) 10Bmansurov: "@Alexandros, thanks for the review. I'll take a look at the issue you found." [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [02:20:14] (03PS1) 10Dave Pifke: Add check_prometheus rules for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) [02:21:18] (03CR) 10jerkins-bot: [V: 04-1] Add check_prometheus rules for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) (owner: 10Dave Pifke) [02:27:15] (03PS2) 10Dave Pifke: Add check_prometheus rules for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) [02:28:18] (03CR) 10jerkins-bot: [V: 04-1] Add check_prometheus rules for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) (owner: 10Dave Pifke) [02:36:24] (03PS3) 10Dave Pifke: Add check_prometheus rules for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) [03:28:55] !log volker-e@deploy1001 Started deploy [design/style-guide@4b4bc51]: Deploy design/style-guide: [03:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:03] !log volker-e@deploy1001 Finished deploy [design/style-guide@4b4bc51]: Deploy design/style-guide: (duration: 00m 07s) [03:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:51] (03PS1) 10Marostegui: Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/597183 [04:26:52] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/597183 (owner: 10Marostegui) [04:27:52] !log Repool labsdb1011 T249188 [04:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:55] T249188: Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 [04:46:04] (03PS1) 10Marostegui: Revert "Revert "dbproxy: Depool labsdb1011"" [puppet] - 10https://gerrit.wikimedia.org/r/597184 [04:46:41] (03CR) 10Marostegui: [C: 03+2] Revert "Revert "dbproxy: Depool labsdb1011"" [puppet] - 10https://gerrit.wikimedia.org/r/597184 (owner: 10Marostegui) [05:00:04] marostegui: It is that lovely time of the day again! You are hereby commanded to deploy s2 and s8 primary database master restart. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200519T0500). [05:00:12] jynus: let's start? [05:00:24] ok [05:00:34] will you do both at the same time? [05:00:37] yep [05:00:39] ok [05:00:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s2 and s8 as read-only for maintenance T251981', diff saved to https://phabricator.wikimedia.org/P11226 and previous config saved to /var/cache/conftool/dbconfig/20200519-050043-marostegui.json [05:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:47] T251981: Upgrade and restart s2 and s8 (wikidatawiki) primary database masters: Tue 19th May - https://phabricator.wikimedia.org/T251981 [05:01:00] RO confirmed [05:01:16] "Warning: The database has been locked for maintenance" [05:01:20] dropping wb_terms and retsarting mysql [05:03:02] s2 done, waiting for s8 [05:03:12] surprisingly, enwiktionary, not wikidata the one with most errors [05:03:20] they are gone now [05:03:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s2 and s8 as read-only=off for maintenance T251981', diff saved to https://phabricator.wikimedia.org/P11227 and previous config saved to /var/cache/conftool/dbconfig/20200519-050346-marostegui.json [05:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:49] RO off [05:04:15] I can edit [05:04:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:04:44] I can see my edit on zh [05:05:01] I can see recentchanges going fine on wikidata [05:05:14] also on wd [05:06:35] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:06:39] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:07:09] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:17:08] 10Operations, 10DBA, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10mmodell) Works for me. We could also do {T146055} at the same time. [05:29:14] 10Operations, 10DBA, 10Phabricator: replace phabricator db passwords with longer passwords - https://phabricator.wikimedia.org/T250361 (10jcrespo) @mmodell could we schedule a specific date for this, so it is not forgotten? How much time do you need to prepare T146055? Work on our side is not too time consum... [05:31:08] 10Operations, 10Core Platform Team, 10MediaWiki-General, 10serviceops, and 2 others: Revisit timeouts, concurrency limits in remote HTTP calls from MediaWiki - https://phabricator.wikimedia.org/T245170 (10tstarling) Plan: * Of the pseudo-libraries, EtcdConfig and RESTBagOStuff have short default timeouts a... [05:36:16] (03CR) 10Jcrespo: "> > The idea would be that, "if you run transfer.py --verbose" you" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) (owner: 10Privacybatm) [05:57:15] !log volker-e@deploy1001 Started deploy [design/style-guide@7bfbd2a]: Deploy design/style-guide: [05:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:21] !log volker-e@deploy1001 Finished deploy [design/style-guide@7bfbd2a]: Deploy design/style-guide: (duration: 00m 06s) [05:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:37] (03CR) 10Jcrespo: "A nitpick." (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [06:17:50] !log elukey@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper [06:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:35] !log elukey@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) [06:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:02] !log elukey@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper [06:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:31] the above two are the druid clusters --^ [06:35:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) [06:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:38] (03PS6) 10Jcrespo: backups: Add backup1002 as the eqiad host for ES db backups [puppet] - 10https://gerrit.wikimedia.org/r/596255 (https://phabricator.wikimedia.org/T79922) [06:42:14] (03CR) 10Jcrespo: [C: 03+2] backups: Add backup1002 as the eqiad host for ES db backups [puppet] - 10https://gerrit.wikimedia.org/r/596255 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [06:54:22] 10Operations, 10Research: Add Git LFS support for research/wikiworkshop - https://phabricator.wikimedia.org/T252956 (10Dzahn) I think both Commons and Youtube, depending on licenses, would be good places to store the actual video. The wikiworkshop site could then link to them or embed them directly. [07:09:16] !log starting es4 & es5 eqiad backups with low concurrency [07:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:47] feel free to tell me if you see any misbehaviour on mediawiki content requests or xmldumps [07:12:06] 10Operations, 10Traffic: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 (10ema) [07:13:15] 10Operations, 10Traffic: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 (10ema) p:05Triage→03Medium [07:13:32] (03PS3) 10Vgutierrez: ATS: Stop handling KA and WS on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/596445 [07:16:42] (03CR) 10Vgutierrez: "(updated) pcc is happy: https://puppet-compiler.wmflabs.org/compiler1001/22585/" [puppet] - 10https://gerrit.wikimedia.org/r/596445 (owner: 10Vgutierrez) [07:21:57] (03PS1) 10Marostegui: report_users: Whitelist mariadb.sys [software] - 10https://gerrit.wikimedia.org/r/597216 [07:23:55] (03CR) 10Marostegui: [C: 03+2] report_users: Whitelist mariadb.sys [software] - 10https://gerrit.wikimedia.org/r/597216 (owner: 10Marostegui) [07:24:16] (03Merged) 10jenkins-bot: report_users: Whitelist mariadb.sys [software] - 10https://gerrit.wikimedia.org/r/597216 (owner: 10Marostegui) [07:26:22] (03PS1) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [07:29:14] (03PS2) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [07:29:17] 10Operations, 10ops-eqiad, 10netops: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10ayounsi) 05Open→03Stalled p:05Triage→03Low a:05Cmjohnson→03None Sounds good! This will have to wait for a time we for example do T196487. Outside of COVID t... [07:33:25] (03CR) 10Elukey: "Hello people, this is a proposal to have a single profile to include everywhere and set accordingly. It should avoid duplicate code/declar" [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [07:36:17] (03PS4) 10Jon Harald Søby: Initial config for shnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597132 (https://phabricator.wikimedia.org/T253029) [07:37:19] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/22586/" [puppet] - 10https://gerrit.wikimedia.org/r/597090 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [07:42:34] (03CR) 10Dzahn: "on contint2001: Package[default-jre-headless]/ensure: created" [puppet] - 10https://gerrit.wikimedia.org/r/597090 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [07:43:43] (03CR) 10Dzahn: [C: 03+1] "compiler output looks good, change lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/596201 (owner: 10Muehlenhoff) [07:44:05] (03CR) 10Ema: [C: 03+1] ATS: Stop handling KA and WS on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/596445 (owner: 10Vgutierrez) [07:44:18] mutante: and I once again forgot the puppet compiler :/ [07:44:49] hashar: no worries, i don't see any issues. do you? [07:44:58] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: bump object replicator concurrency for decom'ing hosts [puppet] - 10https://gerrit.wikimedia.org/r/597003 (https://phabricator.wikimedia.org/T252008) (owner: 10Filippo Giunchedi) [07:45:03] !log rolling upgrade to trafficserver 8.0.7-1wm10 with puppet disabled on cp hosts [07:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:25] mutante: checking [07:46:50] I went to drop the jenkins::commons classes as you proposed last week [07:47:02] (03CR) 10Vgutierrez: [C: 03+2] ATS: Stop handling KA and WS on tls.lua [puppet] - 10https://gerrit.wikimedia.org/r/596445 (owner: 10Vgutierrez) [07:47:05] yep, i saw that and liked it! [07:47:47] my previous patch has been done in several separate parts .. more or less [07:48:02] i also checked releases* [07:48:20] yeah that is similar to your large patch [07:48:26] the jenkins exec start commandline changed of course.. but that is just 2 symlinks away from before [07:48:32] but I also went cleaning up a bunch of confusing/legacy stuff that is no more used [07:48:49] (03CR) 10Giuseppe Lavagetto: docker build: update the build process to us docker (035 comments) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (owner: 10Jbond) [07:48:53] yep, much appreciated to delete the old roles [07:48:55] so the idea is to have /usr/bin/java to use whatever the os ship [07:49:07] and explicitly pin the msater to java8 [07:49:14] !log Push 596597: BGP: standardize fixed part of IX4/IX6 groups - ulsfo [07:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:25] /usr/bin/java: symbolic link to /etc/alternatives/java [07:49:36] /etc/alternatives/java: symbolic link to /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java [07:49:40] I thought about having the releases Jenkins masters to use java 11 and the CI ones to use java 8 ... but I have deemed that to be too complicated [07:49:43] just to confirm it is all the same ^ [07:50:09] i agree. keeping it simple +1 [07:50:40] looks good [07:50:47] hashar: this is coincidence but also see https://gerrit.wikimedia.org/r/597219 from today [07:51:00] maybe for later :) [07:52:01] OH [07:52:32] so yeah elukey patch is exactly what I had in mind [07:52:35] !log Push 596597: BGP: standardize fixed part of IX4/IX6 groups - *dfw [07:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:44] pick the java version with a profile and apply that at the role level [07:52:52] yea, but i saw that a minute before merging yours and that still needs reviews etc [07:52:55] so that the ci::master role would have the java8 profile and java11 profile [07:52:55] didn't want to slow it down [07:53:10] so maybe in the future we will switch to it together with other services [07:53:11] and then we would pass to profile::jenkins $java = 8 [07:53:13] something like that [07:53:30] note that for now the default is it deploys 11 to buster [07:53:38] yea [07:54:14] ah didn't know that you needed it as well, feel free to chime in in the code review! [07:54:19] I haven't found out what was wrong with jenkins / java 11. But I am not literate in java so that does not help [07:54:22] BUT [07:54:43] I have found out the Gearman jenkins plugin got forked at some point and received a lot of updates :] [07:54:54] !log volker-e@deploy1001 Started deploy [design/style-guide@37c67dd]: Deploy design/style-guide: [07:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:56] elukey: yeah I am giving feedback right now [07:55:00] !log volker-e@deploy1001 Finished deploy [design/style-guide@37c67dd]: Deploy design/style-guide: (duration: 00m 06s) [07:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:34] is there also the "jdk vs jre" question in there? [07:56:16] hashar: one sec that I am updating it, just realized a bug [07:56:36] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 78, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T237575 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:57:16] mutante: there is not, the profile assumes that you want to deploy the jdk [07:57:28] it is basically what most of code that I have seen does [07:57:36] (03CR) 10Giuseppe Lavagetto: "Can we just rename the branch to "oldmaster" or something like that?" [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596778 (owner: 10Jbond) [07:58:02] jre => true [07:58:05] headless => true [07:58:05] ;) [07:58:28] I am not sure what those should mean hashar [07:58:32] deploy only headless? [07:58:47] I will write down the use case I have :] [07:59:10] please also keep in mind that we shouldn't create a overly complicated profile with 1000 tunables :) [07:59:16] ensure_packages('default-jre-headless') [07:59:22] (03PS3) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [07:59:33] oh 1000 tunables is not an issue really [07:59:36] (03CR) 10Muehlenhoff: "I like this approach! This can also become the central place where classes opt into our hardened java.security policy (separate patch, lat" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [07:59:50] next thing is we can just create a YAML DSL on top of it to abstreact out the complexity [07:59:55] with a REST micro service [07:59:59] * hashar pauses [08:00:34] ahahahah [08:00:50] okok please add comments, I'll try to include everything [08:01:19] !log Push 596597: BGP: standardize fixed part of IX4/IX6 groups - eqiad [08:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "It's not clear to me, by just looking at the code, how we'll be shipping the wangle/folly/fbthrift libraries we build separately." [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (owner: 10Jbond) [08:04:11] !log Push 596597: BGP: standardize fixed part of IX4/IX6 groups - esams [08:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:25] (03CR) 10Ayounsi: [C: 03+2] BGP: standardize fixed part of IX4/IX6 groups [homer/public] - 10https://gerrit.wikimedia.org/r/596597 (owner: 10Ayounsi) [08:05:44] (03Merged) 10jenkins-bot: BGP: standardize fixed part of IX4/IX6 groups [homer/public] - 10https://gerrit.wikimedia.org/r/596597 (owner: 10Ayounsi) [08:06:55] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] tiller: Upgrade to v2.16.7 on buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597067 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [08:07:20] !log Push 596597: BGP: standardize fixed part of IX4/IX6 groups - eqsin [08:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:12] (03PS1) 10Kormat: mariadb: Remove db2136 from spare role. [puppet] - 10https://gerrit.wikimedia.org/r/597224 (https://phabricator.wikimedia.org/T252985) [08:10:25] marostegui: ^ [08:10:33] (03CR) 10Marostegui: [C: 03+1] mariadb: Remove db2136 from spare role. [puppet] - 10https://gerrit.wikimedia.org/r/597224 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [08:11:52] 10Operations, 10Puppet, 10serviceops: delete the puppet module "apache" - https://phabricator.wikimedia.org/T252190 (10Dzahn) 05Open→03Resolved This has happened. The module is gone now. [08:15:00] (03CR) 10Kormat: [C: 03+2] mariadb: Remove db2136 from spare role. [puppet] - 10https://gerrit.wikimedia.org/r/597224 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [08:16:13] (03PS2) 10Dzahn: simplelamp2: do not purge unmanaged config files [puppet] - 10https://gerrit.wikimedia.org/r/597052 (https://phabricator.wikimedia.org/T169368) [08:18:13] (03PS3) 10Dzahn: simplelamp2: do not purge unmanaged config files [puppet] - 10https://gerrit.wikimedia.org/r/597052 (https://phabricator.wikimedia.org/T169368) [08:19:15] (03CR) 10jerkins-bot: [V: 04-1] simplelamp2: do not purge unmanaged config files [puppet] - 10https://gerrit.wikimedia.org/r/597052 (https://phabricator.wikimedia.org/T169368) (owner: 10Dzahn) [08:20:48] (03PS4) 10Dzahn: simplelamp2: do not purge unmanaged config files [puppet] - 10https://gerrit.wikimedia.org/r/597052 (https://phabricator.wikimedia.org/T169368) [08:26:01] (03CR) 10Hashar: "I have a similar use cases for the CI Jenkins. We attempted to migrate it to Buster/java 11 but one of the key Jenkins plugin we use ends " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [08:27:17] 10Operations, 10Traffic, 10vm-requests: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10Dzahn) Is the requested hostname "homer" a copy/paste error? Does this need to be in eqiad or would codfw work just as well? [08:27:33] (03PS1) 10Ema: varnishlog: exit if process terminates [puppet] - 10https://gerrit.wikimedia.org/r/597225 (https://phabricator.wikimedia.org/T253093) [08:28:51] hashar: one more to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/595525 [08:31:10] oh [08:31:29] (03CR) 10Vgutierrez: [C: 03+1] varnishlog: exit if process terminates [puppet] - 10https://gerrit.wikimedia.org/r/597225 (https://phabricator.wikimedia.org/T253093) (owner: 10Ema) [08:37:23] (03CR) 10Ema: [C: 03+2] varnishlog: exit if process terminates [puppet] - 10https://gerrit.wikimedia.org/r/597225 (https://phabricator.wikimedia.org/T253093) (owner: 10Ema) [08:39:52] (03CR) 10Hashar: [C: 04-1] "Definitely. The docroot git repository is used for both:" [puppet] - 10https://gerrit.wikimedia.org/r/595525 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:40:20] (03CR) 10Hashar: [C: 04-1] contint: fix git cloning of docroot for integration.wm.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595525 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:40:39] mutante: so it is more complicated. integration/docroot is cloned at /srv/ :-\ [08:40:49] gotta move stuff under /srv/docroot [08:44:28] (03CR) 10Hashar: [C: 04-1] "The patch we did for doc.wikimedia.org docroot: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/480881/" [puppet] - 10https://gerrit.wikimedia.org/r/595525 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [08:45:21] (03PS1) 10Muehlenhoff: Make CAS deployment via a deb toggleable [puppet] - 10https://gerrit.wikimedia.org/r/597228 [08:46:25] (03PS7) 10RhinosF1: Site name & meta namespace localisations for ti[wikipedia|wiktionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 (https://phabricator.wikimedia.org/T251287) [08:46:27] (03CR) 10jerkins-bot: [V: 04-1] Make CAS deployment via a deb toggleable [puppet] - 10https://gerrit.wikimedia.org/r/597228 (owner: 10Muehlenhoff) [08:50:35] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:50:37] 10Operations, 10Traffic, 10Patch-For-Review: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 (10ema) 05Open→03Resolved a:03ema Checking `poll()` and exiting when it returns something else than `None` seems to have fixed things. Tested... [08:57:39] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui Known https://wikitech.wikimedia.org/wiki/HAProxy [08:58:32] (03PS2) 10Muehlenhoff: Make CAS deployment via a deb toggleable [puppet] - 10https://gerrit.wikimedia.org/r/597228 (https://phabricator.wikimedia.org/T233947) [08:59:47] hashar: hmm. ack. thanks for the comments. back to a task from 2016, heh [08:59:50] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:01:39] (03PS3) 10Giuseppe Lavagetto: Make configuration of envoy a ConfigMap [deployment-charts] - 10https://gerrit.wikimedia.org/r/582777 (https://phabricator.wikimedia.org/T244843) [09:03:36] (03CR) 10Dzahn: contint: fix git cloning of docroot for integration.wm.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595525 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [09:05:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Make configuration of envoy a ConfigMap [deployment-charts] - 10https://gerrit.wikimedia.org/r/582777 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [09:05:45] (03Merged) 10jenkins-bot: Make configuration of envoy a ConfigMap [deployment-charts] - 10https://gerrit.wikimedia.org/r/582777 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [09:08:49] mutante: I can dig into that rabbit hole as well after lunch :] [09:09:17] it is probably all about moving files tracked by git from /srv/ to /srv/docroot + update the apache vhost + backup definition [09:09:46] yea.. hmm.. i don't like /srv/docroot/org/wikimedia i just want to use /srv/org/wikmedia/ like on other hosts [09:09:59] and instead fix the contents of the repo if possible [09:10:06] but that caused other troubles [09:10:14] cause /srv/ is also used for other stuff [09:10:25] if the repo is supposed to have the files of the document root.. why does it also have other stuff in it [09:10:26] !log eqiad-prod: decom ms-be101[678] - T252008 [09:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:30] T252008: Decom ms-be101[678] - https://phabricator.wikimedia.org/T252008 [09:10:56] cause /srv is also used for the zuul git repositories ( /srv/zuul/git) or the jenkins build artifacts ( /srv/jenkins/ ) [09:11:27] but why should files for zuul that are not in a docroot be in a repo called ..docroot [09:11:39] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:12:02] they will not? [09:12:04] if that repo really just had the files for the docroot.. we could just git clone into the location [09:12:18] what I mean is that integration/docroot is cloned at /srv [09:12:36] and the files tracked by that git repository should be moved under /srv/docroot [09:12:49] i don't think they should be moved to /srv/docroot. [09:13:01] but other files directly under /srv that are NOT tracked by git should stay where they are (eg /srv/jenkins or /srv/zuul ) [09:13:59] if that repo actually just had the files for the docroot.. in the root of the repo [09:14:02] all that would be solved [09:14:04] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [09:14:29] and my patch would work as it currently is [09:14:36] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/22588/" [puppet] - 10https://gerrit.wikimedia.org/r/597228 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [09:14:38] what do you mean? [09:15:28] there is shared code outside of /srv/org/wikimedia/integration [09:15:32] go into the repo called 'docroot' and mv org/wikimedia/integration/* to the root of the repo [09:15:38] under /srv/shared/ [09:15:45] get rid of the other stuff that is not supposed to be in the docroot [09:15:57] so files under /srv/org/wikimedia/integration require material that are in /srv/shared/ [09:16:15] (actually that uses relative path so something like ../../../shared/ [09:16:37] what is the point of doing that if the sites are not shared anymore [09:16:54] they are on different machines [09:17:05] the code is [09:17:42] it is a little web framework to generate boths doc.wm.o and integration.wm.o [09:17:50] if i have a repo called "docroot" then i expect it to contain the files needed in the docroot and not some stuff is in it in addition and some stuff is elsewhere [09:18:11] ah [09:18:19] yes so that is a different problem [09:18:31] (03PS1) 10JMeybohm: tls_helper: qoute idle_timeout default value [deployment-charts] - 10https://gerrit.wikimedia.org/r/597229 (https://phabricator.wikimedia.org/T244843) [09:18:45] which is that all the CI generated material end up being published under the integration/docroot.git checkout [09:18:49] which indeed is problematic [09:18:59] I guess we can have those published outside of the docroot tree [09:19:05] and use Apache to expose them [09:19:27] so that eg the coverage report that are currently under the integration/docroot.git checkout at /srv/org/wikimedia/integration/cover/ [09:19:34] would instead be at /srv/cover [09:19:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] tls_helper: qoute idle_timeout default value [deployment-charts] - 10https://gerrit.wikimedia.org/r/597229 (https://phabricator.wikimedia.org/T244843) (owner: 10JMeybohm) [09:19:51] and some apache conf would make https://integration.wikimedia.org/cover/ to point at /srv/cover/ [09:19:52] hmm [09:19:53] yeah [09:20:08] (03Merged) 10jenkins-bot: tls_helper: qoute idle_timeout default value [deployment-charts] - 10https://gerrit.wikimedia.org/r/597229 (https://phabricator.wikimedia.org/T244843) (owner: 10JMeybohm) [09:20:10] (03CR) 10MarcoAurelio: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597174 (https://phabricator.wikimedia.org/T253081) (owner: 10DannyS712) [09:20:41] mutante: point taken :] [09:20:52] hashar: yea, that sounds better to me. more separate repos and separate apache sites [09:21:05] (03CR) 10MarcoAurelio: [C: 03+1] Site name & meta namespace localisations for ti[wikipedia|wiktionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 (https://phabricator.wikimedia.org/T251287) (owner: 10RhinosF1) [09:21:25] so that is two task, a move the git repo and all the mess it has from /srv/ to /srv/docroot [09:21:39] then move the untracked material published by CI outside of the docroot to some other place [09:21:52] I will think about it during lunch break [09:22:04] alright, thanks! [09:22:44] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10faidon) Are there any updates to this task and any particular reasons it's been held up? While this was never super urgent, we're now at the ~one year mark since... [09:22:59] mutante: danke schon :] [09:24:51] (03CR) 10MarcoAurelio: [C: 03+1] Group CheckUser rights together in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594904 (owner: 10Tchanders) [09:27:59] (03PS1) 10JMeybohm: mathoid: switch to common_templates v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/597230 (https://phabricator.wikimedia.org/T235411) [09:28:18] (03CR) 10jerkins-bot: [V: 04-1] mathoid: switch to common_templates v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/597230 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [09:40:56] (03PS1) 10JMeybohm: tls_helper: fix typo in template reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/597240 (https://phabricator.wikimedia.org/T244843) [09:44:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] tls_helper: fix typo in template reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/597240 (https://phabricator.wikimedia.org/T244843) (owner: 10JMeybohm) [09:44:42] (03Merged) 10jenkins-bot: tls_helper: fix typo in template reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/597240 (https://phabricator.wikimedia.org/T244843) (owner: 10JMeybohm) [09:45:19] (03PS1) 10Kormat: mariadb: Use db1141 as test for labsdb1011 issues. [puppet] - 10https://gerrit.wikimedia.org/r/597241 (https://phabricator.wikimedia.org/T249188) [09:45:29] jynus: ^ [09:47:06] will check now [09:47:42] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [09:48:49] (03PS2) 10JMeybohm: mathoid: switch to common_templates v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/597230 (https://phabricator.wikimedia.org/T235411) [09:49:07] (03CR) 10Jcrespo: [C: 04-1] "Things may break here due to the special that is the role, but it should work good enough to test it." [puppet] - 10https://gerrit.wikimedia.org/r/597241 (https://phabricator.wikimedia.org/T249188) (owner: 10Kormat) [09:49:43] (03PS1) 10Giuseppe Lavagetto: appservers: switch to envoy in all of codfw [puppet] - 10https://gerrit.wikimedia.org/r/597242 (https://phabricator.wikimedia.org/T247389) [09:49:45] (03PS1) 10Giuseppe Lavagetto: appservers: convert mw1265-1275 to use envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/597243 (https://phabricator.wikimedia.org/T247389) [09:49:47] (03PS1) 10Giuseppe Lavagetto: appserver: use envoy everywhere [puppet] - 10https://gerrit.wikimedia.org/r/597244 (https://phabricator.wikimedia.org/T247389) [09:50:03] (03CR) 10Jcrespo: [C: 03+1] mariadb: Use db1141 as test for labsdb1011 issues. [puppet] - 10https://gerrit.wikimedia.org/r/597241 (https://phabricator.wikimedia.org/T249188) (owner: 10Kormat) [09:50:14] (03CR) 10Kormat: [C: 03+2] mariadb: Use db1141 as test for labsdb1011 issues. [puppet] - 10https://gerrit.wikimedia.org/r/597241 (https://phabricator.wikimedia.org/T249188) (owner: 10Kormat) [09:50:20] (03CR) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [09:52:13] (03PS1) 10Filippo Giunchedi: profile: add thanos::httpd to proxy thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/597245 (https://phabricator.wikimedia.org/T233956) [09:52:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mathoid: switch to common_templates v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/597230 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [09:52:35] (03CR) 10JMeybohm: [C: 03+2] mathoid: switch to common_templates v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/597230 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [09:53:15] (03CR) 10jerkins-bot: [V: 04-1] profile: add thanos::httpd to proxy thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/597245 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:54:24] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [09:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:10] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'production' . [09:55:10] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'canary' . [09:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:44] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mathoid' for release 'production' . [09:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:07] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart [09:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:20] !log roll-restart of eqiad restbase hosts for java security updates [09:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:45] (03PS2) 10JMeybohm: mathoid: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/597032 (https://phabricator.wikimedia.org/T235411) [10:00:01] (03PS2) 10Filippo Giunchedi: profile: add thanos::httpd to proxy thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/597245 (https://phabricator.wikimedia.org/T233956) [10:02:04] (03PS1) 10Alexandros Kosiaris: Bump to 1.8.4 [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/597248 [10:03:58] (03CR) 10Filippo Giunchedi: "PCC as expected https://puppet-compiler.wmflabs.org/compiler1003/22590/thanos-fe2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/597245 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [10:04:57] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mathoid: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/597032 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [10:10:38] (03PS1) 10JMeybohm: Remove verbose log statement for symlinks [debs/helm] - 10https://gerrit.wikimedia.org/r/597250 [10:12:56] (03CR) 10JMeybohm: [C: 03+2] mathoid: enable TLS with chart defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/597032 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [10:13:00] !log upgrade etherpad-lite to 1.8.4 on etherpad1002 [10:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:16] I'm about to restore a Netbox dump in few minutes, there will be a very short window of unavailability of Netbox [10:14:33] this time around etherpad upgrade has gone fine. we also have a new skin! [10:14:52] super fancy! [10:16:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/597228 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [10:17:01] i see there is a file modules/stdlib/spec/acceptance/unsupported_spec.rb which includes "it 'should fail" and references class { 'mysql::server': } to check for unsupported platforms. woudl it fail to fail if i delete the mysql module ? [10:18:20] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [10:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:19:23] that's under stdlib? it shouldn't I guess? [10:19:25] <_joe_> let me see what's going on [10:19:44] <_joe_> wow we're in the middle of a spike of errors [10:19:54] mutante: it's an acceptance test, no way we run this during catalog compilations [10:20:08] the most that can fail is CI for stdlib, which we don't run IIRC [10:20:10] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:20:45] <_joe_> eventgate-main was unresponsive I would say [10:21:16] <_joe_> 90 errors trying to submit jobs [10:21:21] <_joe_> sigh [10:21:28] it was asked to handle almost 2.5 times the normal load https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=54&fullscreen&orgId=1&refresh=1m [10:21:56] <_joe_> akosiaris: look at 24 hours [10:21:58] we can add a bit more capacity to make events like this less painful, but what happened? [10:22:00] <_joe_> that's pretty common [10:22:10] <_joe_> that's elasticawrite, it runs periodically [10:22:12] ah, indeed, I take that back [10:24:49] <_joe_> and indeed, https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=70&fullscreen&orgId=1&refresh=1m [10:24:59] <_joe_> we hit our timeout in some cases it would seem [10:25:53] <_joe_> timeout: "25s" [10:25:59] <_joe_> so possibly no [10:26:55] <_joe_> all the errors are marked "503 URX" in the logs. What will that mean? [10:27:38] (03CR) 10Elukey: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [10:27:47] <_joe_> "rejected because of upstream retry limit or maximum connection attempts reached" [10:28:06] PROBLEM - jenkins_service_running on releases2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [10:29:13] !log start Netbox restore - T253091 [10:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:17] T253091: Restore POPs server interfaces and cables - https://phabricator.wikimedia.org/T253091 [10:31:00] 10Operations, 10Discovery, 10Traffic: Search autocompletion broken for recent articles (after April 30?) for some users / browsers - https://phabricator.wikimedia.org/T253114 (10Tgr) [10:32:44] !log flushed all Netbox caches (manage.py invalidate all) - T253091 [10:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:19] !log releases2001 - Failed to restart jenkins.service: The name org.freedesktop.PolicyKit1 was not provided by any .service files [10:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:47] 10Operations, 10Discovery, 10Traffic: Search autocompletion broken for recent articles (after April 30?) for some users / browsers - https://phabricator.wikimedia.org/T253114 (10Tgr) [10:34:15] !log releases2001 - restarted failed jenkins [10:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:34] (03PS1) 10Dzahn: jenkins/icinga: fix process monitoring after change in command line [puppet] - 10https://gerrit.wikimedia.org/r/597255 (https://phabricator.wikimedia.org/T224591) [10:40:50] _joe_: akosiaris: (As always ;)) trying to follow your tracks. I started by looking at https://logstash-next.wikimedia.org/app/kibana#/dashboard/mediawiki-errors filtering for exceptions. Top error there are "Could not enqueue jobs: Unable to deliver all events: 503: Service Unavailable". This then leads to eventgate-main (because one knows that JobQueueEventBus submits to it)? [10:41:13] <_joe_> yes [10:41:41] <_joe_> eventbus is the name of the extension of mediawiki that sends data to kafka via a rest gateway [10:41:58] (03CR) 10Jbond: "had stared reviewing this and ended up reviewing some of the ACL's, no issue with this CR however have left my comments, feel free to igno" (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/596597 (owner: 10Ayounsi) [10:41:58] <_joe_> it was called eventbus, and was renamed eventgate-main [10:42:17] Did you go the same route _joe_? Cuz your speed on that was quite impressive (I did not even managed to load kibana since then :-P) [10:42:20] <_joe_> eventgate-main is for internal messages, so jobs, events, etc [10:42:35] <_joe_> jayme: yeah it's just practice [10:43:47] Sure. I meant kibana was literally not loaded when you wrote "eventgate-main was unresponsive I would say" :) [10:43:53] <_joe_> hehe [10:44:22] <_joe_> yeah you were the second concurrent user, things were slowing down [10:44:28] hrhr [10:44:42] (03CR) 10Dzahn: [C: 03+2] "root@releases2001:/etc/nagios/nrpe.d# /usr/lib/nagios/plugins/check_procs -w 1:1 -c 1:1 --ereg-argument-array '.*/bin/java .*-jar /usr/sha" [puppet] - 10https://gerrit.wikimedia.org/r/597255 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [10:45:00] (03PS2) 10Dzahn: jenkins/icinga: fix process monitoring after change in command line [puppet] - 10https://gerrit.wikimedia.org/r/597255 (https://phabricator.wikimedia.org/T224591) [10:45:07] so kibana probably has some practice in preferring your connections as well ;) [10:46:41] <_joe_> ehe possibly [10:46:58] <_joe_> akosiaris: when did you swap our dear ole etherpad for that hipster horror? [10:47:35] <_joe_> jokes aside, I find it more readable [10:49:04] ok, clear so far. But then you've quoted what I assume is an eventgate error ("503 URX"). Where did you get that from? [10:49:12] envoy logs [10:49:33] and I am guessing mediawiki's from some mw server, right? [10:49:40] <_joe_> yes [10:49:56] RECOVERY - jenkins_service_running on releases2001 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/jenkins/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [10:50:00] <_joe_> I just went to mw1331 and did "tail /var/log/envoy/eventgate-main.log" [10:50:12] <_joe_> that only reports errors, so it's pretty terse [10:50:12] jayme: the ambassador log format is explained in some detail at https://blog.getambassador.io/understanding-envoy-proxy-and-ambassador-http-access-logs-fee7802a2ec5 [10:50:16] ^ that was not actually a jenkins failure, but the monitoring check got broken due to a path change [10:50:19] ah [10:50:34] I am wondering about the X though [10:50:36] <_joe_> that's the envoy log format ftr :P [10:50:47] UR is upstream remote reset [10:50:50] what's X ? [10:51:23] <_joe_> Upstream Retry eXceeded [10:51:50] ah, so it's not additive (as in UR+X) [10:51:59] it's just some "initials" [10:52:04] <_joe_> no I don't think so [10:52:06] and not even that in this case... [10:52:14] <_joe_> but hey, we can write matt klein and ask :P [10:53:56] (03PS1) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce support for disabling TLS [puppet] - 10https://gerrit.wikimedia.org/r/597257 (https://phabricator.wikimedia.org/T120225) [10:54:04] (03PS6) 10Dzahn: Remove mysql module from WMF [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [10:57:43] (03CR) 10Dzahn: [C: 03+2] Remove mysql module from WMF [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [10:57:56] ^ another one from the past [10:59:23] puppet compiler with C:mysql showed nothing, grepping showed nothing, openstack-browser showed nothing. all converted to mariadb [10:59:53] em [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200519T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:10] is it different from the last patch? [11:00:53] jynus: no, why? [11:00:54] or just rebased? [11:00:59] yea, rebased [11:01:00] ah, you say just rebased [11:01:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/391849/5..6 [11:01:08] yep, no [11:01:10] np [11:01:13] diff between 5 and 6 [11:01:14] I thought you were asking [11:01:20] for +1 [11:01:23] np :-D [11:01:31] ah, no, i am just pointing it out in case something was missed [11:01:35] he he [11:01:48] that is a patch 3 years in the making [11:01:53] checked multiple times across the repo [11:01:53] congrats, mutante [11:01:54] heh, yea [11:01:58] thanks jynus [11:02:15] I think I don't recognize enough the hard "mining" work you do everyday [11:02:27] :-D [11:02:35] aww :) [11:02:38] to get us to a much healther state [11:02:45] that requires a lot of dedication, so thanks [11:03:05] you are welcome, thanks for that [11:03:55] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) 05Open→03Resolved done! the module has been removed [11:04:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: mailrelay: introduce support for disabling TLS [puppet] - 10https://gerrit.wikimedia.org/r/597257 (https://phabricator.wikimedia.org/T120225) (owner: 10Arturo Borrero Gonzalez) [11:09:35] (03PS4) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [11:11:39] (03PS5) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [11:12:17] !log Deploy schema change on db2124 (frwiki, jawiki, ruwiki) T238966 [11:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:20] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [11:15:08] (03CR) 10Elukey: "Ok this new version is more elaborate, but it should take into account all the use cases." [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [11:16:00] (03PS1) 10Hnowlan: changeprop: remove changeprop configuration from scb [puppet] - 10https://gerrit.wikimedia.org/r/597258 (https://phabricator.wikimedia.org/T248677) [11:17:41] (03CR) 10Muehlenhoff: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [11:20:52] (03CR) 10Muehlenhoff: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [11:22:02] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [11:38:24] (03CR) 10Hnowlan: "pcc output: https://puppet-compiler.wmflabs.org/compiler1002/22593/" [puppet] - 10https://gerrit.wikimedia.org/r/597258 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [11:39:07] (03CR) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [11:41:48] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool es1018, es1015, es1019', diff saved to https://phabricator.wikimedia.org/P11232 and previous config saved to /var/cache/conftool/dbconfig/20200519-114148-jynus.json [11:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:06] (03PS6) 10Elukey: profile::java: one profile to rule them all (openjdk-x versions) [puppet] - 10https://gerrit.wikimedia.org/r/597219 [11:49:54] PROBLEM - SSH on ms-be1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:51:16] !log starting backups of es1, es2, es3 on eqiad into backup1002 [11:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:38] RECOVERY - SSH on ms-be1026 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:54:31] (03PS1) 10Dzahn: ganeti/partman: fix recipe that claims to do RAID5 but does RAID1 [puppet] - 10https://gerrit.wikimedia.org/r/597261 [11:59:31] (03CR) 10Alexandros Kosiaris: [C: 03+2] Bump to 1.8.4 [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/597248 (owner: 10Alexandros Kosiaris) [12:00:49] 10Operations, 10Traffic, 10vm-requests: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10ssingh) [12:01:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: remove changeprop configuration from scb [puppet] - 10https://gerrit.wikimedia.org/r/597258 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [12:01:53] 10Operations, 10Traffic, 10vm-requests: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10ssingh) >>! In T253024#6147641, @Dzahn wrote: > Is the requested hostname "homer" a copy/paste error? (Updated so as not to confuse with the existing service) > Does this need to be in e... [12:02:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove verbose log statement for symlinks [debs/helm] - 10https://gerrit.wikimedia.org/r/597250 (owner: 10JMeybohm) [12:09:24] RECOVERY - mysqld processes on db2073 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:16:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 from me as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [12:17:11] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) [12:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:43] (03CR) 10CDanis: [C: 03+1] swift: enable s3api [puppet] - 10https://gerrit.wikimedia.org/r/596658 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:20:11] 10Operations, 10Traffic, 10vm-requests: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10ssingh) [12:20:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] restrouter: Remove chart and namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/596141 (https://phabricator.wikimedia.org/T242461) (owner: 10JMeybohm) [12:22:18] 10Operations, 10Traffic, 10vm-requests: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10Dzahn) a:03Dzahn [12:23:26] 10Operations, 10Traffic, 10vm-requests: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10ssingh) Updated the task description as we have decided to go with codfw. For the disk, we can start with 30G and then add an additional one if required. (In case we decide to use the disk... [12:23:46] (03PS1) 10JMeybohm: tls_helper: fix typo configMap volume reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/597264 (https://phabricator.wikimedia.org/T244843) [12:25:33] 10Operations, 10Analytics, 10serviceops, 10vm-requests: Create a VM for matomo1002 (eqiad) - https://phabricator.wikimedia.org/T252742 (10elukey) ` elukey@ganeti1003:~$ sudo gnt-group list Group Nodes Instances AllocPolicy NDParams row_A 4 44 preferred ovs=False, ssh_port=22, ovs_link=, spin... [12:26:20] (03PS1) 10Muehlenhoff: Add a define for creating a system user using systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/597265 [12:26:42] (03CR) 10jerkins-bot: [V: 04-1] Add a define for creating a system user using systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/597265 (owner: 10Muehlenhoff) [12:28:40] (03PS2) 10Muehlenhoff: Add a define for creating a system user using systemd-sysusers [puppet] - 10https://gerrit.wikimedia.org/r/597265 [12:28:44] (03CR) 10JMeybohm: [C: 03+2] Remove verbose log statement for symlinks [debs/helm] - 10https://gerrit.wikimedia.org/r/597250 (owner: 10JMeybohm) [12:29:32] (03CR) 10Hnowlan: "> The only interesting thing is whether we want this to be shipped in the packaged helm chart (the .tgz file) or not." [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [12:33:33] (03PS6) 10Hnowlan: Add tool and configuration for generating beta configuration from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) [12:33:38] (03PS1) 10Elukey: Add A/AAAA/PTR records for matomo1002 [dns] - 10https://gerrit.wikimedia.org/r/597266 (https://phabricator.wikimedia.org/T252742) [12:34:04] (03CR) 10jerkins-bot: [V: 04-1] Add A/AAAA/PTR records for matomo1002 [dns] - 10https://gerrit.wikimedia.org/r/597266 (https://phabricator.wikimedia.org/T252742) (owner: 10Elukey) [12:35:48] mmm this doesn't seem to be a bug in my patch [12:36:21] tox locally works fine [12:36:28] and in jenkins [12:36:29] Exception: Command /usr/sbin/gdnsd -c /tmp/dns-check.z7e7j3i1 checkconf failed with exit code 42, stderr: [12:36:35] elukey: missing ; in templates/wmnet line 739 [12:36:37] before the comment [12:37:43] !imported helm 2.16.7-2 to main for buster-wikimedia, stretch-wikimedia, jessie-wikimedia [12:37:47] !log imported helm 2.16.7-2 to main for buster-wikimedia, stretch-wikimedia, jessie-wikimedia [12:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:05] mutante: ah thanks! I hoped for a more descriptive error msg :( [12:38:49] (03PS2) 10Elukey: Add A/AAAA/PTR records for matomo1002 [dns] - 10https://gerrit.wikimedia.org/r/597266 (https://phabricator.wikimedia.org/T252742) [12:39:26] yw, and yes [12:40:12] !log ariel@deploy1001 Started deploy [dumps/dumps@a329605]: make page content fixup script move inprog files into place if good [12:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:16] !log ariel@deploy1001 Finished deploy [dumps/dumps@a329605]: make page content fixup script move inprog files into place if good (duration: 00m 04s) [12:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:20] (03CR) 10Dzahn: [C: 03+2] "well.. more like "attempt to fix" but will try it with a random host and they are not in prod yet anyways" [puppet] - 10https://gerrit.wikimedia.org/r/597261 (owner: 10Dzahn) [12:44:20] 10Operations, 10Traffic, 10vm-requests: Create a Ganeti VM for Wikidough - https://phabricator.wikimedia.org/T253024 (10ssingh) [12:46:21] arturo: can i merge your mailrelay change for toolforge? [12:47:32] (03CR) 10Ayounsi: BGP: standardize fixed part of IX4/IX6 groups (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/596597 (owner: 10Ayounsi) [12:47:56] (03CR) 10Muehlenhoff: "This looks great, two nits and one suggestion inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [12:48:46] (03CR) 10Giuseppe Lavagetto: [C: 03+1] tls_helper: fix typo configMap volume reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/597264 (https://phabricator.wikimedia.org/T244843) (owner: 10JMeybohm) [12:48:54] (03PS3) 10Privacybatm: CuminExecution.py: Improve output message readabiliy of transfer.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) [12:48:56] mutante: yes! sorry. I completely forgot about it [12:49:02] 10Operations, 10LDAP-Access-Requests: Add `dcipoletti` to `wmf` Access Group - https://phabricator.wikimedia.org/T252674 (10dcipoletti) Thanks, Reuven! Much appreciated :) [12:49:18] (03CR) 10jerkins-bot: [V: 04-1] CuminExecution.py: Improve output message readabiliy of transfer.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) (owner: 10Privacybatm) [12:49:24] arturo: no problem. done [12:49:31] thanks [12:49:32] (03PS1) 10Ayounsi: Remove bogons4 for policy options [homer/public] - 10https://gerrit.wikimedia.org/r/597272 [12:50:19] (03PS1) 10Kormat: hieradata: enable notifications for db2136 [puppet] - 10https://gerrit.wikimedia.org/r/597273 (https://phabricator.wikimedia.org/T252985) [12:50:30] marostegui: ^ [12:50:40] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [12:51:27] (03PS2) 10Jbond: hiera backends: Comment out special Netbox case [puppet] - 10https://gerrit.wikimedia.org/r/596787 (owner: 10CRusnov) [12:51:56] (03PS3) 10Filippo Giunchedi: profile: initial tests for logstash filters [puppet] - 10https://gerrit.wikimedia.org/r/594460 (https://phabricator.wikimedia.org/T251869) [12:52:01] (03PS1) 10Hnowlan: restrouter: release new package [deployment-charts] - 10https://gerrit.wikimedia.org/r/597274 (https://phabricator.wikimedia.org/T252865) [12:53:22] (03CR) 10JMeybohm: [C: 03+2] tls_helper: fix typo configMap volume reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/597264 (https://phabricator.wikimedia.org/T244843) (owner: 10JMeybohm) [12:53:42] (03Merged) 10jenkins-bot: tls_helper: fix typo configMap volume reference [deployment-charts] - 10https://gerrit.wikimedia.org/r/597264 (https://phabricator.wikimedia.org/T244843) (owner: 10JMeybohm) [12:54:01] (03CR) 10Jbond: "I have updated the PS and commit message" [puppet] - 10https://gerrit.wikimedia.org/r/596787 (owner: 10CRusnov) [12:54:06] (03PS2) 10Ayounsi: Remove bogons4 for policy options [homer/public] - 10https://gerrit.wikimedia.org/r/597272 [12:54:24] (03CR) 10Jbond: [C: 03+2] hiera backends: Comment out special Netbox case [puppet] - 10https://gerrit.wikimedia.org/r/596787 (owner: 10CRusnov) [12:55:21] (03PS3) 10Ayounsi: Remove bogons4 for policy options [homer/public] - 10https://gerrit.wikimedia.org/r/597272 [12:55:33] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/22595/ms-fe1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/596658 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [12:56:04] mutante: if you have a min, may I ask for a quick sanity check for my dns change? [12:56:14] (03CR) 10Marostegui: "host all green?" [puppet] - 10https://gerrit.wikimedia.org/r/597273 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [12:56:16] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:56:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [homer/public] - 10https://gerrit.wikimedia.org/r/597272 (owner: 10Ayounsi) [12:57:22] (03PS1) 10JMeybohm: mathoid: bump version for fixed template [deployment-charts] - 10https://gerrit.wikimedia.org/r/597275 (https://phabricator.wikimedia.org/T235411) [12:57:42] (03CR) 10Kormat: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/597273 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [12:58:59] (03CR) 10Marostegui: [C: 03+1] hieradata: enable notifications for db2136 [puppet] - 10https://gerrit.wikimedia.org/r/597273 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [12:59:50] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:00:11] ^ that's me [13:00:38] (03CR) 10Dzahn: [C: 03+1] Add A/AAAA/PTR records for matomo1002 [dns] - 10https://gerrit.wikimedia.org/r/597266 (https://phabricator.wikimedia.org/T252742) (owner: 10Elukey) [13:00:43] elukey: ^ lgtm [13:01:28] thanksss [13:01:37] ACKNOWLEDGEMENT - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff Kormat In progress. https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:01:37] ACKNOWLEDGEMENT - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff Kormat In progress. https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:01:42] (03CR) 10Elukey: [C: 03+2] Add A/AAAA/PTR records for matomo1002 [dns] - 10https://gerrit.wikimedia.org/r/597266 (https://phabricator.wikimedia.org/T252742) (owner: 10Elukey) [13:01:59] (03PS4) 10Privacybatm: CuminExecution.py: Improve output message readabiliy of transfer.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) [13:02:13] (03CR) 10Kormat: [C: 03+2] hieradata: enable notifications for db2136 [puppet] - 10https://gerrit.wikimedia.org/r/597273 (https://phabricator.wikimedia.org/T252985) (owner: 10Kormat) [13:03:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mathoid: bump version for fixed template [deployment-charts] - 10https://gerrit.wikimedia.org/r/597275 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [13:03:14] !log kormat@cumin1001 dbctl commit (dc=all): 'Pool db2136 into s4 T252985', diff saved to https://phabricator.wikimedia.org/P11233 and previous config saved to /var/cache/conftool/dbconfig/20200519-130313-kormat.json [13:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:17] T252985: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 [13:04:35] (03CR) 10Privacybatm: CuminExecution.py: Improve output message readabiliy of transfer.py (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) (owner: 10Privacybatm) [13:05:42] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:07:34] (03CR) 10JMeybohm: [C: 03+2] mathoid: bump version for fixed template [deployment-charts] - 10https://gerrit.wikimedia.org/r/597275 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [13:08:04] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:08:48] 10Operations, 10DBA, 10observability, 10Patch-For-Review: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473 (10Marostegui) 05Stalled→03Declined Closing this in favour of T253120 which has more concrete points of action [13:09:14] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [13:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:41] !log updated helm: 2.16.7-1 -> 2.16.7-2 on deploy[1,2]001 and contint[1,2]001 [13:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:43] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [13:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:20] elukey: eqiad is full [13:10:52] elukey: we will have to rack the new ganeti hardware, i am trying something to unblock that [13:11:14] i should have mentioned that when i saw the comments in DNS [13:11:51] ah snap I didn't know it, what command should I run to check that? I followed the wiki but didn't find anything related [13:12:03] (only the list command to check what row to pick) [13:12:20] it is not really urgent, I can postpone this [13:13:33] elukey: sudo gnt-node list and the MFree column [13:13:48] i _think_ it needs to have it on all nodes [13:15:18] elukey: motomo being a one off host can just as well run in codfw? we have space there [13:15:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] appservers: switch to envoy in all of codfw [puppet] - 10https://gerrit.wikimedia.org/r/597242 (https://phabricator.wikimedia.org/T247389) (owner: 10Giuseppe Lavagetto) [13:15:48] yea, i said the same to sukhe about his VM request [13:16:10] okok, I'll update the wiki to include the node list to check first [13:16:14] makes sense [13:16:22] thanks [13:16:22] (if it is not already there, I didn't see it) [13:16:45] i don't think it is, so that's cool if you add it [13:17:15] _joe_: please ping me when you're done in codfw, I have a fleetwide change I'd like to disable puppet to push [13:17:44] <_joe_> cdanis: it will take some time sadly [13:17:53] that is ok [13:17:58] (03CR) 10Jbond: Add a define for creating a system user using systemd-sysusers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597265 (owner: 10Muehlenhoff) [13:18:15] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [13:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:05] mutante: yes I think that matomo can live in codfw, it would be a little weird to remember since all the Analytics stuff is in eqiad, but not a big deal. If eqiad is full I can also wait until it is unblocked, not in a rush [13:19:42] elukey: ok, yea, i don't know if you would need an analytics vlan in codfw ? [13:21:30] failed to wmf-auto-reimage... remote IPMI failed.. we need something to check BIOS settings [13:21:59] we already check PXE override [13:22:09] and we have icinga checks [13:22:12] no no private vlan is fine, no need for analytics in this case [13:22:15] for remote IPMI [13:23:05] hmmm. notifications are all disabled but checks are also green [13:23:07] (03CR) 10Filippo Giunchedi: Add a define for creating a system user using systemd-sysusers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597265 (owner: 10Muehlenhoff) [13:23:39] (03PS19) 10CDanis: prometheus: export NIC firmware versions [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) [13:24:27] going through the IPMI checks .. locally..then remote [13:25:28] yeah, follow https://wikitech.wikimedia.org/wiki/Ipmi [13:25:33] and let me know if you need a hand [13:27:31] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti1009.eqiad.wmnet ` The log can be found in `/var/lo... [13:27:32] volans: locally worked, remote failed. running the "diff" command showed a diff.. ran it again with --commit to fix it, then no more diff. now reimageing works [13:27:49] so the docs are great and work.. just was not caught by monitoring i guess [13:30:11] mutante: great. The icinga checks do not perform login both for security and for not harassing the mgmt cards that are not that reliable [13:30:31] (03CR) 10CDanis: [C: 03+1] profile: add thanos::swift::frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597017 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:30:41] volans: alright, gotcha [13:31:06] so can't catch all issues so far. was a trade off [13:31:25] ack, thx [13:33:00] PROBLEM - Nginx local proxy to apache on mw2165 is CRITICAL: connect to address 10.192.32.53 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [13:33:15] (03CR) 10CDanis: [C: 03+1] profile: add thanos::httpd to proxy thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/597245 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [13:33:22] PROBLEM - Check systemd state on mw2165 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:23] (03PS5) 10Aaron Schulz: Set "coalesceKeys" to "non-global" for testwiki and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575098 (https://phabricator.wikimedia.org/T252564) [13:34:32] (03CR) 10Muehlenhoff: Add a define for creating a system user using systemd-sysusers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597265 (owner: 10Muehlenhoff) [13:35:07] (03CR) 10CDanis: [C: 03+1] profile: add thanos::swift::backend [puppet] - 10https://gerrit.wikimedia.org/r/597073 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [13:36:46] RECOVERY - Nginx local proxy to apache on mw2165 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 662 bytes in 1.208 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:39:17] mutante: so I used gnt-node list -o +dfree to get also the disk free space, and I can see that some ganeti nodes in row c don't have space for matomo1002, but some do have.. I can't really find the error output from ganeti from the spicerack's output, I am wondering if the issue is something else [13:39:23] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti1009.eqiad.wmnet'] ` Of which those **FAILED**: ` ['ganeti1009.eqiad.wmnet'] ` [13:39:33] I'd expect Ganeti to pick the first node available with the resources needed [13:40:02] (-o +dfree is not needed, just realized it) [13:40:34] i was kind of expecting it needs it on all nodes because only then it could balance/migrate VMs between nodes [13:40:52] but none of them were over about 30G anyways, right [13:41:04] PROBLEM - Check systemd state on mw2172 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:27] and yea, i meant only the disk space [13:42:13] mutante: so all nodes need to have enough dfree and mfree [13:42:40] PROBLEM - Check systemd state on mw2178 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:41] it seems a big constraint though [13:42:54] elukey: 19 OpPrereqError: ("Can't compute nodes using iallocator 'hail': Request failed: Group row_C (preferred): No valid allocation solutions, failure reasons: FailMem: 9, FailDisk: 3", 'insufficient_resources') [13:43:14] PROBLEM - Check systemd state on mw2174 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:20] that is /var/log/ganeti/commands.log on ganeti1003 [13:43:46] PROBLEM - Check systemd state on mw2186 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:53] <_joe_> cdanis: I'm 30% done [13:45:00] mutante: in what log sorry? [13:45:06] <_joe_> don't worry about the systemd state alerts on codfw [13:45:07] so that is "nginx failed" and i assume it's known [13:45:08] PROBLEM - Check systemd state on mw2375 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:08] ack [13:45:10] PROBLEM - Check systemd state on mw2235 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:10] PROBLEM - Check systemd state on mw2277 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:12] PROBLEM - Check systemd state on mw2309 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:15] i was just looking, joe [13:45:15] ah ganeti, I was checking spicerack [13:45:16] uff [13:45:30] elukey: yes, ganeti logs on ganeti1003 [13:45:31] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: add thanos::httpd to proxy thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/597245 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [13:45:32] <_joe_> mutante: I'll fix all of them once the migration is done, but it's a false positive [13:45:38] <_joe_> and only happens on a few machines [13:45:44] ACK [13:45:58] PROBLEM - Check systemd state on mw2315 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:12] PROBLEM - Check systemd state on mw2353 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:34] PROBLEM - Check systemd state on mw2228 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:10] PROBLEM - Check systemd state on mw2227 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:12] PROBLEM - Check systemd state on mw2181 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:18] PROBLEM - Check systemd state on mw2301 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:44] PROBLEM - Check systemd state on mw2229 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:44] PROBLEM - Check systemd state on mw2367 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:00] mutante: https://wikitech.wikimedia.org/wiki/Ganeti#Verify_cluster_resource_availability [13:48:20] PROBLEM - Check systemd state on mw2327 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:04] PROBLEM - Check systemd state on mw2357 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:22] PROBLEM - Check systemd state on mw2193 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:22] PROBLEM - Check systemd state on mw2198 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:31] 10Operations, 10Analytics, 10serviceops, 10vm-requests: Create a VM for matomo1002 (eqiad) - https://phabricator.wikimedia.org/T252742 (10elukey) 05Open→03Stalled This is currently blocked due to resource constraints in row_c eqiad for Ganeti, see https://wikitech.wikimedia.org/wiki/Ganeti#Verify_clust... [13:49:56] (03CR) 10Ottomata: profile::java: one profile to rule them all (openjdk-x versions) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597219 (owner: 10Elukey) [13:50:10] PROBLEM - Check systemd state on mw2185 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:15] 10Operations, 10Analytics, 10observability: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10elukey) [13:50:38] RECOVERY - Check systemd state on mw2353 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:46] RECOVERY - Check systemd state on mw2229 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:46] RECOVERY - Check systemd state on mw2367 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:47] <_joe_> ok I sent a first set of resets [13:50:54] RECOVERY - Check systemd state on mw2178 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:00] RECOVERY - Check systemd state on mw2165 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:02] RECOVERY - Check systemd state on mw2228 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:06] RECOVERY - Check systemd state on mw2357 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:12] RECOVERY - Check systemd state on mw2185 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:12] RECOVERY - Check systemd state on mw2227 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:14] RECOVERY - Check systemd state on mw2181 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:22] RECOVERY - Check systemd state on mw2301 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:24] RECOVERY - Check systemd state on mw2327 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:24] RECOVERY - Check systemd state on mw2193 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:26] RECOVERY - Check systemd state on mw2198 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:26] RECOVERY - Check systemd state on mw2172 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:26] RECOVERY - Check systemd state on mw2315 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:30] RECOVERY - Check systemd state on mw2375 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:37] elukey: looks great. thanks, added https://wikitech.wikimedia.org/w/index.php?title=Ganeti&type=revision&diff=1866702&oldid=1866701 [13:53:06] !log configure new AMS-IX port as quarantine - T251121 [13:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:45] (03CR) 10Jcrespo: "Let me know if this answers your question..." (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) (owner: 10Privacybatm) [13:55:16] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [13:55:48] RECOVERY - Check systemd state on mw2174 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:26] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [13:58:04] PROBLEM - Check systemd state on mw2310 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:08] PROBLEM - Check systemd state on mw2240 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:52] PROBLEM - Check systemd state on mw2275 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:16] RECOVERY - Check systemd state on mw2186 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:18] RECOVERY - Check systemd state on mw2277 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:18] RECOVERY - Check systemd state on mw2309 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:18] PROBLEM - Check systemd state on mw2355 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:38] PROBLEM - Check systemd state on mw2184 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:42] uh... [14:00:44] PROBLEM - Check systemd state on mw2236 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:36] PROBLEM - Check systemd state on mw2313 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:11] <_joe_> vgutierrez: again, it's the nginx/envoy swap [14:02:18] :) [14:02:25] <_joe_> sometimes nginx get marked as failed as it won't stop in time for the package removal [14:02:28] RECOVERY - Check systemd state on mw2235 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:46] PROBLEM - Check systemd state on mw2363 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:12] PROBLEM - Check systemd state on mw2255 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:22] PROBLEM - Check systemd state on mw2359 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:50] PROBLEM - Check systemd state on mw2365 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:14] PROBLEM - Check systemd state on mw2361 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:50] RECOVERY - Check systemd state on mw2359 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:52] RECOVERY - Check systemd state on mw2236 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:52] RECOVERY - Check systemd state on mw2240 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:11] <_joe_> cdanis: I'm done [14:05:18] RECOVERY - Check systemd state on mw2365 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:40] RECOVERY - Check systemd state on mw2363 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:40] RECOVERY - Check systemd state on mw2275 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:40] RECOVERY - Check systemd state on mw2361 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:41] (03PS10) 10Privacybatm: transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) [14:05:52] RECOVERY - Check systemd state on mw2313 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:52] RECOVERY - Check systemd state on mw2355 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:08] RECOVERY - Check systemd state on mw2255 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:12] RECOVERY - Check systemd state on mw2184 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:16] RECOVERY - Check systemd state on mw2310 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:46] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: ASAP) install additional SSDs into prometheus200[34] - https://phabricator.wikimedia.org/T251622 (10Papaul) [14:08:21] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: ASAP) install additional SSDs into prometheus200[34] - https://phabricator.wikimedia.org/T251622 (10Papaul) a:05Papaul→03fgiunchedi @fgiunchedi All your's [14:08:43] (03CR) 10Privacybatm: "> Patch Set 4:" (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) (owner: 10Privacybatm) [14:08:53] 10Operations, 10netops: Set minimum-links 2 to AMS-IX LACP - https://phabricator.wikimedia.org/T253122 (10ayounsi) p:05Triage→03High [14:14:07] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) We have 5 mw servers left to be racked in row c rack c3 since we used 10 servers in T252185 [14:14:14] 10Operations, 10netops: Set minimum-links 2 to AMS-IX LACP - https://phabricator.wikimedia.org/T253122 (10CDanis) LGTM assuming we don't also configure `optimize-aggregate-frr`: https://kb.juniper.net/InfoCenter/index?page=content&id=KB34635&actp=METADATA [14:14:30] (03PS5) 10Privacybatm: CuminExecution.py: Improve output message readabiliy of transfer.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) [14:19:18] (03PS6) 10Privacybatm: CuminExecution.py: Improve output message readabiliy of transfer.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/597069 (https://phabricator.wikimedia.org/T252802) [14:19:25] 10Operations, 10netops: Set minimum-links 2 to AMS-IX LACP - https://phabricator.wikimedia.org/T253122 (10jbond) LGMT [14:20:57] (03PS6) 10ZPapierski: Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (https://phabricator.wikimedia.org/T237089) (owner: 10EBernhardson) [14:24:39] (03PS2) 10Alexandros Kosiaris: Added support egress rules for blubberoid chart. Added egress template in common _helpers.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/596181 (owner: 10Apakhomov) [14:25:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] "This is pretty nice. Thanks (and sorry for the late review). I 'll merge as is, followed by a change to enable this across the 3 clusters" [deployment-charts] - 10https://gerrit.wikimedia.org/r/596181 (owner: 10Apakhomov) [14:25:51] (03Merged) 10jenkins-bot: Added support egress rules for blubberoid chart. Added egress template in common _helpers.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/596181 (owner: 10Apakhomov) [14:26:03] !log Set minimum-links 2 to AMS-IX LACP - T253122 [14:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:07] T253122: Set minimum-links 2 to AMS-IX LACP - https://phabricator.wikimedia.org/T253122 [14:28:41] 10Operations, 10netops: Set minimum-links 2 to AMS-IX LACP - https://phabricator.wikimedia.org/T253122 (10ayounsi) 05Open→03Resolved Thanks! This will also help in case the wrong cable gets bumped into during the new link provisioning. [14:28:44] (03PS1) 10Ssingh: cescout: enable restart for the Postgres service (finalize f3a35978) [puppet] - 10https://gerrit.wikimedia.org/r/597288 [14:30:56] (03CR) 10Ssingh: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/22596/cescout1001.eqiad.wmnet/index.html merging trivial cescout change." [puppet] - 10https://gerrit.wikimedia.org/r/597288 (owner: 10Ssingh) [14:31:20] (03PS1) 10Alexandros Kosiaris: blubberoid: Add a networkpolicy fixture for CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/597289 (https://phabricator.wikimedia.org/T249927) [14:32:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] "CI exercised those code paths as well, great!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/597289 (https://phabricator.wikimedia.org/T249927) (owner: 10Alexandros Kosiaris) [14:32:42] (03Merged) 10jenkins-bot: blubberoid: Add a networkpolicy fixture for CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/597289 (https://phabricator.wikimedia.org/T249927) (owner: 10Alexandros Kosiaris) [14:33:29] (03PS1) 10Alexandros Kosiaris: blubberoid: Bump version to 0.0.22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/597290 (https://phabricator.wikimedia.org/T249927) [14:34:25] (03PS2) 10Alexandros Kosiaris: blubberoid: Bump version to 0.0.22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/597290 (https://phabricator.wikimedia.org/T249927) [14:36:38] (03PS1) 10Alexandros Kosiaris: blubberoid: Enable egress networkpolicy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/597291 (https://phabricator.wikimedia.org/T249927) [14:36:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] blubberoid: Bump version to 0.0.22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/597290 (https://phabricator.wikimedia.org/T249927) (owner: 10Alexandros Kosiaris) [14:37:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] blubberoid: Enable egress networkpolicy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/597291 (https://phabricator.wikimedia.org/T249927) (owner: 10Alexandros Kosiaris) [14:37:25] (03Merged) 10jenkins-bot: blubberoid: Bump version to 0.0.22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/597290 (https://phabricator.wikimedia.org/T249927) (owner: 10Alexandros Kosiaris) [14:37:25] (03Merged) 10jenkins-bot: blubberoid: Enable egress networkpolicy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/597291 (https://phabricator.wikimedia.org/T249927) (owner: 10Alexandros Kosiaris) [14:38:27] in a few minutes I'll be disabling puppet on all physical hosts so I can do a quick canary of https://gerrit.wikimedia.org/r/c/operations/puppet/+/549683 on a handful of hosts before letting a wider deployment happen [14:38:53] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [14:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:39] (03PS5) 10Jbond: puppetmaster::gitclone: add pre-commit to private repo [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) [14:40:07] (03CR) 10Jbond: "thanks updated" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) (owner: 10Jbond) [14:41:48] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review, and 2 others: Support kubernetes Egress networkpolicies in our helm charts - https://phabricator.wikimedia.org/T249927 (10akosiaris) blubberoid in staging switch to have the new policy in the Networkpolicy. Verified with `kubectl describe netpol... [14:46:33] (03CR) 10CDanis: [C: 03+2] prometheus: export NIC firmware versions [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [14:47:03] !log disabling puppet on all physical hosts ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕥☕ sudo cumin 'F:virtual = physical' 'disable-puppet "cdanis deploying I68c97d5"' [14:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:57] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Apache on Superset [puppet] - 10https://gerrit.wikimedia.org/r/597296 (https://phabricator.wikimedia.org/T135991) [15:14:47] PROBLEM - Check systemd state on cescout1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:15] ^ expected [15:16:39] RECOVERY - Check systemd state on cescout1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:47] !log canary on ~150 hosts looks great, re-enabling puppet on all physical hosts ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕥☕ sudo cumin 'F:virtual = physical' 'enable-puppet "cdanis deploying I68c97d5"' [15:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:23] (03PS1) 10Muehlenhoff: Exclude /mnt/hfds on labstore1006/1007 for debdeploy restart checks [puppet] - 10https://gerrit.wikimedia.org/r/597298 [15:20:35] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime [15:20:36] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:53] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:23:05] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:23:12] ^ taking care of it [15:23:41] !log kormat@cumin1001 dbctl commit (dc=all): 'Repool db2073 into s4 T252985', diff saved to https://phabricator.wikimedia.org/P11236 and previous config saved to /var/cache/conftool/dbconfig/20200519-152340-kormat.json [15:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:48] T252985: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 [15:23:49] (03CR) 10Filippo Giunchedi: profile: add thanos::swift::frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/597017 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:23:53] I think we should increase the soft time of that [15:23:58] it alerts too quickly [15:24:07] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: add thanos::swift::frontend [puppet] - 10https://gerrit.wikimedia.org/r/597017 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:24:11] jynus: agreed [15:24:29] ACKNOWLEDGEMENT - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff Kormat I was slow. https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:24:29] ACKNOWLEDGEMENT - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff Kormat I was slow. https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:26:44] (03PS3) 10Filippo Giunchedi: profile: add thanos::swift::frontend [puppet] - 10https://gerrit.wikimedia.org/r/597017 (https://phabricator.wikimedia.org/T252186) [15:26:46] (03PS3) 10Filippo Giunchedi: thanos: add Envoy TLS terminator [puppet] - 10https://gerrit.wikimedia.org/r/597018 (https://phabricator.wikimedia.org/T252186) [15:26:48] (03PS3) 10Filippo Giunchedi: thanos: add Store Gateway [puppet] - 10https://gerrit.wikimedia.org/r/597019 (https://phabricator.wikimedia.org/T252186) [15:26:50] (03PS2) 10Filippo Giunchedi: thanos: add objstore support to sidecar [puppet] - 10https://gerrit.wikimedia.org/r/597071 (https://phabricator.wikimedia.org/T252186) [15:26:52] (03PS3) 10Filippo Giunchedi: thanos: add thanos::compact [puppet] - 10https://gerrit.wikimedia.org/r/597072 (https://phabricator.wikimedia.org/T252186) [15:26:54] (03PS3) 10Filippo Giunchedi: profile: add thanos::swift::backend [puppet] - 10https://gerrit.wikimedia.org/r/597073 (https://phabricator.wikimedia.org/T252186) [15:28:45] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:28:55] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [15:32:24] (03PS1) 10Cwhite: upgrade golang seed image to buster and upgrade golang to 1.14 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597299 [15:37:04] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: add thanos::swift::frontend [puppet] - 10https://gerrit.wikimedia.org/r/597017 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:37:28] (03CR) 10Jbond: "> Patch Set 3: Code-Review-1" (034 comments) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (owner: 10Jbond) [15:41:13] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad total VRPs alert, total VRPs alert, valid ROAs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [15:42:01] XioNoX: ^^^ [15:42:19] first time I see that one [15:43:33] we lost all of the lacnic ROAs [15:44:15] LMK if you need a hand [15:45:47] volans: nothing urgent, worse case we don't enforce RPKI for that region, no risk of null routing legit traffic [15:46:02] k [15:49:40] looks like lacnic is back as of 15:45 [15:49:47] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: add thanos::swift::backend [puppet] - 10https://gerrit.wikimedia.org/r/597073 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [15:49:55] (03PS4) 10Filippo Giunchedi: profile: add thanos::swift::backend [puppet] - 10https://gerrit.wikimedia.org/r/597073 (https://phabricator.wikimedia.org/T252186) [15:49:59] volans: "job: yeah seems they nuke themselves 3 times a day" [15:49:59] https://mail.lacnic.net/pipermail/lacnog/2020-May/008015.html [15:50:16] paravoid: ^ [15:50:17] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [15:50:53] jeez [15:51:55] PROBLEM - Memcached on thanos-fe2002 is CRITICAL: connect to address 10.192.16.133 and port 11211: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [15:52:18] that's me ^ [15:52:36] cdanis: also answers your question on the review ^ [15:52:49] (03PS4) 10Jbond: docker build: update the build process to us docker [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 (https://phabricator.wikimedia.org/T251574) [15:53:09] godog: great :D [15:54:09] PROBLEM - Thanos swift https backend on thanos-fe2002 is CRITICAL: connect to address 10.192.16.133 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Thanos [15:54:28] (03CR) 10Jbond: [C: 03+1] Enable base::service_auto_restart for Apache on Superset [puppet] - 10https://gerrit.wikimedia.org/r/597296 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:56:44] 10Operations, 10Analytics, 10observability: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10herron) [15:57:23] (03CR) 10Jbond: "LGTM, although i do wonder why we not just add /mnt/hdfs as a default. of the top of my head i dont think it causes an issue if its not a" [puppet] - 10https://gerrit.wikimedia.org/r/597298 (owner: 10Muehlenhoff) [15:58:22] (03PS1) 10Filippo Giunchedi: site: add thanos::backend role to thanos-be2* [puppet] - 10https://gerrit.wikimedia.org/r/597301 (https://phabricator.wikimedia.org/T252186) [15:59:50] 10Operations, 10Traffic: track NIC firmware version numbers across the fleet - https://phabricator.wikimedia.org/T236744 (10CDanis) 05Open→03Resolved Prometheus metrics now exist, via a textfile exporter installed by Puppet on every physical host. Sample output for the metric on an LVS machine: ` # HELP... [16:00:04] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200519T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:39] (03CR) 10Filippo Giunchedi: [C: 03+2] site: add thanos::backend role to thanos-be2* [puppet] - 10https://gerrit.wikimedia.org/r/597301 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:01:48] 10Operations, 10Analytics, 10observability: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10herron) Thanks for creating the task @elukey, working together on this SGTM I added a step to the description to help avoid introducing duplicate metrics while both prod and buster... [16:03:11] 10Puppet, 10Phabricator: Create puppet role for Phabricator hosted repo testing - https://phabricator.wikimedia.org/T104827 (10Aklapper) 05Stalled→03Declined Boldly declining as `phab-02` does not exist anymore and as it seems that nobody is interested in this anymore. If I misunderstood, please reopen. [16:05:13] (03Abandoned) 10Herron: netbox::frontend: log to localhost udp rsyslog listener (json compat) [puppet] - 10https://gerrit.wikimedia.org/r/597165 (owner: 10Herron) [16:05:58] 10Operations, 10Analytics, 10observability: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10elukey) Thanks @herron! I had a chat with Cole on IRC to establish a clear ownership for these hosts, I am wondering if Observability could be a better owner than Analytics nowadays? [16:08:42] (03PS1) 10Filippo Giunchedi: hieradata: add memcached_servers for thanos backend [puppet] - 10https://gerrit.wikimedia.org/r/597302 (https://phabricator.wikimedia.org/T252186) [16:09:21] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add memcached_servers for thanos backend [puppet] - 10https://gerrit.wikimedia.org/r/597302 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:10:30] godog: nice! are these running on buster? [16:10:40] if so I am curious about seeing their metrics [16:10:59] elukey: yeah all buster [16:11:17] very nice, so they'll get memcached 1.5.x [16:11:21] elukey: none of it is in production yet, feel free to poke around! [16:11:57] godog: let me know when it is ready, just to double check the memcached settings and metrics (curious) [16:12:42] (03PS1) 10JMeybohm: tls_helper: fix the envoy config configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/597303 (https://phabricator.wikimedia.org/T244843) [16:12:48] elukey: for sure! will do [16:15:52] <3 [16:20:38] (03PS1) 10Cwhite: update readme [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597304 [16:20:46] (03PS1) 10JMeybohm: envoy: Don't try to create a envoy config if it already exists [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597305 (https://phabricator.wikimedia.org/T244843) [16:24:27] !log power cycle thanos-fe* / thanos-be* [16:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:48] RECOVERY - Memcached on thanos-fe2002 is OK: TCP OK - 0.036 second response time on 10.192.16.133 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [16:26:48] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_10902: Servers thanos-fe2002.codfw.wmnet, thanos-fe2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:27:11] should recover soon ^ [16:28:11] 10Operations, 10netops: intermittent brief data dropouts for esams netflow data - https://phabricator.wikimedia.org/T253128 (10CDanis) [16:28:18] 10Operations, 10netops: intermittent brief data dropouts for esams netflow data - https://phabricator.wikimedia.org/T253128 (10CDanis) p:05Triage→03Low [16:29:22] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:30:09] 10Operations, 10Traffic: OCSP Stapling for Intermediates - https://phabricator.wikimedia.org/T148134 (10Aklapper) >>! In T148134#2760707, @BBlack wrote: > Stalling this on further progress in the rest of the world (browsers implementations, TLS Cached Info, etc). 3.5 years after stalling this task, it seems t... [16:32:29] (03PS2) 10JMeybohm: tls_helper: fix the envoy config configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/597303 (https://phabricator.wikimedia.org/T235411) [16:35:33] (03PS1) 10Filippo Giunchedi: hieradata: fix thanos-fe host list [puppet] - 10https://gerrit.wikimedia.org/r/597308 (https://phabricator.wikimedia.org/T252186) [16:38:02] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix thanos-fe host list [puppet] - 10https://gerrit.wikimedia.org/r/597308 (https://phabricator.wikimedia.org/T252186) (owner: 10Filippo Giunchedi) [16:38:59] 10Operations, 10docker-pkg: docker-pkg update cli renders unclear guidance - https://phabricator.wikimedia.org/T253131 (10colewhite) [16:39:25] 10Operations, 10docker-pkg, 10serviceops: docker-pkg update cli renders unclear guidance - https://phabricator.wikimedia.org/T253131 (10colewhite) [16:43:54] 10Operations, 10Traffic: Secure shared ticket key rotation for anycast authdns - https://phabricator.wikimedia.org/T240863 (10BBlack) 05Open→03Declined There's not much DoTLS adoption so far, and really our primary HTTPS termination needs this more than AuthDNS does, at which point we can just copy whateve... [16:44:00] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [16:47:40] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WQDS Data Reload - https://phabricator.wikimedia.org/T252068 (10RKemper) Data transfer is done across all instances as of last friday, along with a wdqs-categories reload that we tacked on. Circling back to mark this ti... [16:49:09] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WQDS Data Reload - https://phabricator.wikimedia.org/T252068 (10RKemper) 05Open→03Resolved Still need to circle back to dependent tickets and verify that those problems are solved [16:51:12] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: add stretch-wikimedia component/tesseract-410-bpo [puppet] - 10https://gerrit.wikimedia.org/r/597309 (https://phabricator.wikimedia.org/T247422) [16:52:38] (03PS1) 10BBlack: Add anycast authdns IP address [dns] - 10https://gerrit.wikimedia.org/r/597310 (https://phabricator.wikimedia.org/T98006) [16:52:58] (03PS1) 10BBlack: Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) [16:53:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: add stretch-wikimedia component/tesseract-410-bpo [puppet] - 10https://gerrit.wikimedia.org/r/597309 (https://phabricator.wikimedia.org/T247422) (owner: 10Arturo Borrero Gonzalez) [16:56:43] (03PS2) 10BBlack: Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) [17:00:04] halfak and accraze: Your horoscope predicts another unfortunate Services – Graphoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200519T1700). [17:00:53] PROBLEM - Stale file for node-exporter textfile in eqiad on icinga1001 is CRITICAL: cluster=mysql file=device_smart.prom instance=db1140:9100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [17:03:37] 10Operations: Change main branch of puppet repository to be 'master' instead of production - https://phabricator.wikimedia.org/T101632 (10Aklapper) 05Stalled→03Open >>! In T101632#1346509, @chasemp wrote: > to be discussed with Opsen Then opsen should discuss it. :) Resetting task status. [17:09:22] !log added tesseract suite to stretch-wikimedia component/tesseract-410-bpo (T247422) [17:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:26] T247422: Update Tesseract on Toolforge to v4.1.0 - https://phabricator.wikimedia.org/T247422 [17:15:10] (03PS3) 10BBlack: Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) [17:15:12] (03PS1) 10BBlack: check_dns_query: add localonly option [puppet] - 10https://gerrit.wikimedia.org/r/597315 (https://phabricator.wikimedia.org/T98006) [17:16:43] (03PS1) 10Arturo Borrero Gonzalez: toolforge: exec environment: use newer tesseract suite [puppet] - 10https://gerrit.wikimedia.org/r/597316 (https://phabricator.wikimedia.org/T247422) [17:20:55] 10Operations, 10Research: Add Git LFS support for research/wikiworkshop - https://phabricator.wikimedia.org/T252956 (10leila) Ok. I sent the email to office IT and requested that they put them on YT. I'll share the links with you, @bmansurov, when I have them. [17:21:44] RECOVERY - Stale file for node-exporter textfile in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [17:26:10] 10Operations, 10Research: Add Git LFS support for research/wikiworkshop - https://phabricator.wikimedia.org/T252956 (10RLazarus) p:05Triage→03Medium [17:27:04] 10Operations, 10conftool: dbctl gives user-hostile diffs - https://phabricator.wikimedia.org/T253025 (10RLazarus) p:05Triage→03Medium [17:31:19] 10Operations, 10observability: Better manage java updates for ELK7 - https://phabricator.wikimedia.org/T252913 (10RLazarus) p:05Triage→03Medium [17:32:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: exec environment: use newer tesseract suite [puppet] - 10https://gerrit.wikimedia.org/r/597316 (https://phabricator.wikimedia.org/T247422) (owner: 10Arturo Borrero Gonzalez) [17:33:17] 10Operations, 10observability: Icinga refresh hardware selection (2020) - https://phabricator.wikimedia.org/T251644 (10RobH) [17:33:24] 10Operations, 10observability: Icinga refresh hardware selection (2020) - https://phabricator.wikimedia.org/T251644 (10RobH) [17:34:10] 10Operations, 10netops: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10RLazarus) p:05Triage→03Medium [17:35:39] (03CR) 10Bstorm: [C: 03+2] paws: Introduce the role skeleton for the paws servers [puppet] - 10https://gerrit.wikimedia.org/r/596478 (https://phabricator.wikimedia.org/T188912) (owner: 10Bstorm) [17:38:16] 10Operations, 10Analytics, 10observability: Move kafkamon hosts to Debian Buster - https://phabricator.wikimedia.org/T252773 (10RLazarus) p:05Triage→03Medium a:03herron [17:41:49] (03PS2) 10Cwhite: upgrade golang seed image to buster and upgrade golang to 1.14 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597299 [17:44:32] PROBLEM - Check systemd state on ms-be1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:54] (03PS1) 10Cwhite: add loki 1.4.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) [17:50:22] (03PS1) 10CDanis: dbctl: make 'yes' equivalent to 'y' in confirmations [software/conftool] - 10https://gerrit.wikimedia.org/r/597318 [17:54:02] (03CR) 10Volans: [C: 03+1] "LGTM, but is missing tests ;)" [software/conftool] - 10https://gerrit.wikimedia.org/r/597318 (owner: 10CDanis) [17:56:13] (03PS2) 10Cwhite: add loki 1.4.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/597317 (https://phabricator.wikimedia.org/T222826) [17:57:36] jouncebot next [17:57:36] In 0 hour(s) and 2 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200519T1800) [18:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200519T1800) [18:00:05] DannyS712: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:01:32] https://www.irccloud.com/pastebin/ntEfuyQw/ [18:01:51] ^ ignore that, sorry [18:02:00] Ready for SWAT: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/597174/ [18:04:36] I will do the SWAT [18:05:00] And I forgot to list my own patch lol [18:08:01] (03PS2) 10Catrope: Enable GrowthExperiments features on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596313 (https://phabricator.wikimedia.org/T252420) [18:08:10] (03CR) 10Catrope: [C: 03+2] Enable GrowthExperiments features on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596313 (https://phabricator.wikimedia.org/T252420) (owner: 10Catrope) [18:08:52] (03Merged) 10jenkins-bot: Enable GrowthExperiments features on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596313 (https://phabricator.wikimedia.org/T252420) (owner: 10Catrope) [18:09:16] (03PS3) 10DannyS712: Enwiki config: Grant template editors `editcontentmodel` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597174 (https://phabricator.wikimedia.org/T253081) [18:12:33] RECOVERY - Check systemd state on ms-be1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:03] (03PS1) 10Catrope: GrowthExperiments: Update mentor list for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597321 (https://phabricator.wikimedia.org/T252420) [18:13:43] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Update mentor list for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597321 (https://phabricator.wikimedia.org/T252420) (owner: 10Catrope) [18:14:50] (03Merged) 10jenkins-bot: GrowthExperiments: Update mentor list for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597321 (https://phabricator.wikimedia.org/T252420) (owner: 10Catrope) [18:14:56] (03PS4) 10DannyS712: Enwiki config: Grant template editors `editcontentmodel` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597174 (https://phabricator.wikimedia.org/T253081) [18:15:25] (03PS1) 10Cwhite: attempt to detect ca bundle at runtime [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/597322 [18:16:46] (03CR) 10jerkins-bot: [V: 04-1] attempt to detect ca bundle at runtime [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/597322 (owner: 10Cwhite) [18:21:46] (03PS2) 10Cwhite: attempt to detect ca bundle at runtime fix flake8 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/597322 [18:22:59] (03CR) 10jerkins-bot: [V: 04-1] attempt to detect ca bundle at runtime fix flake8 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/597322 (owner: 10Cwhite) [18:25:09] @Catrope any updates? [18:27:50] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: ASAP) install additional SSDs into prometheus100[34] - https://phabricator.wikimedia.org/T251621 (10wiki_willy) Looks like it arrived on Friday, but could be that Equinix hasn't moved them over to the storage room yet: https://www.dell.com/support/orders/us/en/... [18:32:20] (03PS2) 10BBlack: check_dns_query: add localonly option [puppet] - 10https://gerrit.wikimedia.org/r/597315 (https://phabricator.wikimedia.org/T98006) [18:32:22] (03PS4) 10BBlack: Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) [18:32:24] (03PS1) 10BBlack: Bugfix for check_dns_query error reporting [puppet] - 10https://gerrit.wikimedia.org/r/597326 (https://phabricator.wikimedia.org/T98006) [18:33:24] (03CR) 10jerkins-bot: [V: 04-1] Bugfix for check_dns_query error reporting [puppet] - 10https://gerrit.wikimedia.org/r/597326 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [18:34:18] PROBLEM - Check systemd state on ms-be1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:39] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable GrowthExperiments features on frwiki (T252420) (duration: 01m 08s) [18:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:43] T252420: Scale: enable Growth team features on French Wikipedia - https://phabricator.wikimedia.org/T252420 [18:35:45] (03PS2) 10BBlack: Fix bug in check_dns_query error reporting [puppet] - 10https://gerrit.wikimedia.org/r/597326 (https://phabricator.wikimedia.org/T98006) [18:35:47] (03PS3) 10BBlack: check_dns_query: add localonly option [puppet] - 10https://gerrit.wikimedia.org/r/597315 (https://phabricator.wikimedia.org/T98006) [18:35:49] (03PS5) 10BBlack: Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) [18:36:04] DannyS712: Sorry, my change took a while to test [18:36:31] Doing yours now [18:37:07] (03CR) 10Catrope: [C: 03+2] Enwiki config: Grant template editors `editcontentmodel` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597174 (https://phabricator.wikimedia.org/T253081) (owner: 10DannyS712) [18:37:33] (03CR) 10BBlack: [C: 03+2] Fix bug in check_dns_query error reporting [puppet] - 10https://gerrit.wikimedia.org/r/597326 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [18:37:54] (03Merged) 10jenkins-bot: Enwiki config: Grant template editors `editcontentmodel` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597174 (https://phabricator.wikimedia.org/T253081) (owner: 10DannyS712) [18:38:26] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.207e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:40:18] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 61 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:41:06] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Grant template editors editcontentmodel on enwiki (T253081) (duration: 01m 06s) [18:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:09] T253081: Add editcontentmodel right to the templateeditor user group on the English Wikipedia - https://phabricator.wikimedia.org/T253081 [18:41:22] DannyS712: All done, sorry for the wait [18:41:46] RECOVERY - Check systemd state on ms-be1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:41:54] testing [18:42:34] works [18:42:36] thanks [18:46:19] !log performing rolling restarts of codfw/eqiad ELK clusters for java updates [18:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:45] (03CR) 10Krinkle: [C: 03+1] Set "coalesceKeys" to "non-global" for testwiki and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575098 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [19:02:07] (03CR) 10Ayounsi: [C: 03+1] "Great name." [dns] - 10https://gerrit.wikimedia.org/r/597310 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [19:07:16] 10Operations, 10conftool: dbctl gives user-hostile diffs - https://phabricator.wikimedia.org/T253025 (10CDanis) a:03CDanis [19:08:45] 10Operations, 10Thumbor, 10serviceops, 10Sustainability (Incident Prevention): Reverse proxy supporting XFF-based per-IP concurrency limit and request queueing - https://phabricator.wikimedia.org/T252749 (10RLazarus) p:05Triage→03Medium Agree we ought to do this, and I think it's something Envoy can do... [19:11:37] (03PS4) 10BBlack: check_dns_query: add localonly option [puppet] - 10https://gerrit.wikimedia.org/r/597315 (https://phabricator.wikimedia.org/T98006) [19:11:39] (03PS6) 10BBlack: Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) [19:11:41] (03PS1) 10BBlack: Split nagios_common::check_dns_query from commands [puppet] - 10https://gerrit.wikimedia.org/r/597329 (https://phabricator.wikimedia.org/T98006) [19:11:43] (03PS1) 10BBlack: Add check_dns_query to dns profiles [puppet] - 10https://gerrit.wikimedia.org/r/597330 (https://phabricator.wikimedia.org/T98006) [19:11:45] (03PS1) 10BBlack: Use check_dns_query for anycast recdns [puppet] - 10https://gerrit.wikimedia.org/r/597331 (https://phabricator.wikimedia.org/T241965) [19:12:28] (03CR) 10jerkins-bot: [V: 04-1] Split nagios_common::check_dns_query from commands [puppet] - 10https://gerrit.wikimedia.org/r/597329 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [19:13:17] (03CR) 10jerkins-bot: [V: 04-1] Add check_dns_query to dns profiles [puppet] - 10https://gerrit.wikimedia.org/r/597330 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [19:16:16] (03PS2) 10BBlack: Split nagios_common::check_dns_query from commands [puppet] - 10https://gerrit.wikimedia.org/r/597329 (https://phabricator.wikimedia.org/T98006) [19:16:18] (03PS2) 10BBlack: Add check_dns_query to dns profiles [puppet] - 10https://gerrit.wikimedia.org/r/597330 (https://phabricator.wikimedia.org/T98006) [19:16:20] (03PS5) 10BBlack: check_dns_query: add localonly option [puppet] - 10https://gerrit.wikimedia.org/r/597315 (https://phabricator.wikimedia.org/T98006) [19:16:22] (03PS2) 10BBlack: Use check_dns_query for anycast recdns [puppet] - 10https://gerrit.wikimedia.org/r/597331 (https://phabricator.wikimedia.org/T241965) [19:16:24] (03PS7) 10BBlack: Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) [19:17:30] (03CR) 10jerkins-bot: [V: 04-1] Add check_dns_query to dns profiles [puppet] - 10https://gerrit.wikimedia.org/r/597330 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [19:17:46] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [19:20:57] (03PS3) 10BBlack: Add check_dns_query to dns profiles [puppet] - 10https://gerrit.wikimedia.org/r/597330 (https://phabricator.wikimedia.org/T98006) [19:20:59] (03PS6) 10BBlack: check_dns_query: add localonly option [puppet] - 10https://gerrit.wikimedia.org/r/597315 (https://phabricator.wikimedia.org/T98006) [19:21:01] (03PS3) 10BBlack: Use check_dns_query for anycast recdns [puppet] - 10https://gerrit.wikimedia.org/r/597331 (https://phabricator.wikimedia.org/T241965) [19:21:03] (03PS8) 10BBlack: Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) [19:25:12] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [19:28:46] 10Operations, 10netops: intermittent brief data dropouts for esams netflow data - https://phabricator.wikimedia.org/T253128 (10ayounsi) Relevant [[ https://turnilo.wikimedia.org/#wmf_netflow/4/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ASn... [19:30:17] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [19:31:47] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Marostegui) Downtime expired - I have acked the alerts in Icinga [19:35:49] (03PS3) 10Cwhite: attempt to detect ca bundle at runtime [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/597322 [19:41:42] (03PS4) 10BBlack: Add check_dns_query to dns profiles [puppet] - 10https://gerrit.wikimedia.org/r/597330 (https://phabricator.wikimedia.org/T98006) [19:42:33] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10diego) Thanks @Marostegui For accessing the Jupyter Hub notebook @Rvvalentim needs a ldap user I think. Should we open another ticket for this? Thanks again. [19:43:50] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10Marostegui) Yeah,let's track that in a different one I would suggest. [19:50:02] 10Operations, 10ops-codfw, 10netops: (Need by: End of July-2020 ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) p:05Triage→03Medium a:03Papaul [19:51:44] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Krinkle) [19:51:55] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Krinkle) [19:52:00] (03PS7) 10BBlack: check_dns_query: add localonly option [puppet] - 10https://gerrit.wikimedia.org/r/597315 (https://phabricator.wikimedia.org/T98006) [19:52:02] (03PS4) 10BBlack: Use check_dns_query for anycast recdns [puppet] - 10https://gerrit.wikimedia.org/r/597331 (https://phabricator.wikimedia.org/T241965) [19:52:03] (03PS9) 10BBlack: Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) [19:55:58] (03CR) 10Cwhite: puppetmaster::gitclone: add pre-commit to private repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) (owner: 10Jbond) [19:56:39] 10Operations, 10Jupyter-Hub, 10SRE-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@Rvvalentim) - https://phabricator.wikimedia.org/T253155 (10diego) [19:57:25] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10diego) [19:57:27] 10Operations, 10Jupyter-Hub, 10SRE-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@Rvvalentim) - https://phabricator.wikimedia.org/T253155 (10diego) [20:08:41] (03PS10) 10BBlack: Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) [20:09:57] jouncebot: next [20:09:57] In 1 hour(s) and 50 minute(s): Namespace&Sitename Localisation Deployments (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200519T2200) [20:11:39] (03PS11) 10BBlack: Add testable anycast authdns address and bird stuff [puppet] - 10https://gerrit.wikimedia.org/r/597311 (https://phabricator.wikimedia.org/T98006) [20:18:36] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@Rvvalentim) - https://phabricator.wikimedia.org/T253155 (10Peachey88) Per https://wikitech.wikimedia.org/wiki/SWAP#Access_and_infrastructure which leads to https://wikitech.wikimedia.org/wiki/Analytics/... [20:26:02] (03CR) 10BBlack: [C: 03+2] Split nagios_common::check_dns_query from commands [puppet] - 10https://gerrit.wikimedia.org/r/597329 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [20:31:48] (03CR) 10BBlack: [C: 03+2] Add check_dns_query to dns profiles [puppet] - 10https://gerrit.wikimedia.org/r/597330 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [20:31:50] (03CR) 10BBlack: [C: 03+2] check_dns_query: add localonly option [puppet] - 10https://gerrit.wikimedia.org/r/597315 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [20:31:56] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.707e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:33:46] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 717 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:34:10] 10Operations, 10conftool: dbctl gives user-hostile diffs - https://phabricator.wikimedia.org/T253025 (10CDanis) Apparently diffs like the one you generated tickle some paradoxical behavior in Python difflib. I tried `icdiff` (which uses the same engine) and got similarly-asinine output: {F31833204} (see P1123... [20:39:31] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 (10Papaul) [20:39:34] 10Operations, 10Traffic: Whitelist x-wikimedia-debug header field (currently not allowed by Access-Control-Allow-Headers in preflight response) - https://phabricator.wikimedia.org/T252826 (10BPirkle) [20:41:06] (03CR) 10BBlack: [C: 03+2] Use check_dns_query for anycast recdns [puppet] - 10https://gerrit.wikimedia.org/r/597331 (https://phabricator.wikimedia.org/T241965) (owner: 10BBlack) [20:45:10] 10Operations, 10observability, 10Patch-For-Review: Use check_dns_query for anycast DNS checks - https://phabricator.wikimedia.org/T241965 (10BBlack) 05Open→03Resolved a:03BBlack [20:46:46] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10RLazarus) a:03dcipoletti @dcipoletti Can you please ask your manager to comment on this task, approving your access? At the same time, hopefully to speed it up a little: @... [20:47:49] 10Operations, 10conftool: dbctl gives user-hostile diffs - https://phabricator.wikimedia.org/T253025 (10CDanis) If we restrict difflib to just the `s4` section of `groupLoadsBySection` in this same diff, we somehow get nearly-reasonable output: {F31833251} [20:48:00] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10RLazarus) p:05Triage→03Medium [20:51:04] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Krinkle) [20:51:57] (03CR) 10BBlack: [C: 03+2] Add anycast authdns IP address [dns] - 10https://gerrit.wikimedia.org/r/597310 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [20:56:26] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@Rvvalentim) - https://phabricator.wikimedia.org/T253155 (10RLazarus) p:05Triage→03Medium a:03KFrancis I see @Nuria's authorization on the parent task, so going ahead with this. @KFrancis Hi, can... [20:58:31] 10Operations, 10conftool: dbctl gives user-hostile diffs - https://phabricator.wikimedia.org/T253025 (10CDanis) Filed upstream with Python as https://bugs.python.org/issue40691 [21:08:38] RoanKattouw: just logging in [21:09:19] RhinosF1: It's still an hour away, right? [21:09:28] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) Status Update: Worked through a bunch of the software-level complexities today with getting bird::anycast to advertise an authdns IP from all the auth... [21:09:34] RoanKattouw: yeah [21:10:39] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) 05Stalled→03Open Upgrading labstore1005 on Thursday this week. [21:10:42] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Bstorm) [21:10:50] 10Operations, 10DC-Ops, 10cloud-services-team (Kanban): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10Bstorm) [21:14:39] OK just checking :) [21:14:56] RoanKattouw: I like to be early [21:15:02] * RhinosF1 waits for jouncebot [21:15:16] jouncebot: next [21:15:17] In 0 hour(s) and 44 minute(s): Namespace&Sitename Localisation Deployments (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200519T2200) [21:15:22] :) [21:15:57] (03CR) 10Jbond: "thanks cole please add to the comment if i missed anything" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) (owner: 10Jbond) [21:16:12] 10Operations, 10LDAP-Access-Requests: Add Daniel Cipoletti to analytics-privatedata-users - https://phabricator.wikimedia.org/T253086 (10dr0ptp4kt) Approved. [21:21:53] (03CR) 10Cwhite: [C: 03+1] puppetmaster::gitclone: add pre-commit to private repo [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) (owner: 10Jbond) [21:22:16] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10Jclark-ctr) Main board replaced today entered password & management address into server [21:37:40] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/597346 [21:45:58] (03PS1) 10Clarakosi: Echostore: Fix labs configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597348 (https://phabricator.wikimedia.org/T252898) [21:46:08] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1001/22607/icinga1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/597346 (owner: 10Herron) [21:49:29] (03PS2) 10Herron: lvs::monitor: expand icinga service descriptions [puppet] - 10https://gerrit.wikimedia.org/r/597346 (https://phabricator.wikimedia.org/T211692) [22:00:04] RoanKattouw and RhinosF1: Your horoscope predicts another unfortunate Namespace&Sitename Localisation Deployments deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200519T2200). [22:00:04] RhinosF1: A patch you scheduled for Namespace&Sitename Localisation Deployments is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [22:00:11] o/ [22:00:40] Alright let's do this [22:01:13] RoanKattouw: .31 & .32 branches active so backports ready and listed on https://phabricator.wikimedia.org/T253070 for all [22:01:32] OK I've +2ed the jv ones, let's do those first [22:02:13] RoanKattouw: you going for debug1002 as always? [22:02:34] Yeah I will [22:02:46] I won't be able to run namespaceDupes until it's everywhere though [22:02:52] * RhinosF1 fired up, got it! [22:02:54] But let's first test if it works without a full scap [22:03:16] * RhinosF1 crosses everything [22:12:11] (03CR) 10Krinkle: "> It doesn't have deploymentwiki in the output, but I assume that's because beta wikis aren't automated into CI." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595634 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [22:13:34] (03CR) 10Krinkle: "Captured at https://phabricator.wikimedia.org/P11240 as it will expire soon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595634 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [22:18:34] RoanKattouw: merged! [22:18:47] (03CR) 10Catrope: [C: 03+2] Site name & meta namespace localisations for ti[wikipedia|wiktionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 (https://phabricator.wikimedia.org/T251287) (owner: 10RhinosF1) [22:20:38] RhinosF1: OK it's on mwdebug1002, let's see if it worked [22:20:44] RoanKattouw: let me know when to check jvwiki on mwdebug1002 - why removed cr on that ^ [22:20:50] just saw ping [22:20:53] * RhinosF1 looks [22:21:25] OK that seems to have worked? [22:22:29] RoanKattouw: no? That should be Naraguna. Special:AllPages gives wrong name as well. https://usercontent.irccloud-cdn.com/file/Ffnx2CRr/Screenshot%202020-05-19%20at%2023.21.54.png [22:22:45] Oh right [22:23:54] OK well, I guess we'll need a full scap [22:23:56] RoanKattouw: same on jvwiktionary as well [22:23:59] First merging all the other ones [22:24:01] looks like it [22:24:06] k, do config as well [22:26:26] RoanKattouw: still https://gerrit.wikimedia.org/r/597235, https://gerrit.wikimedia.org/r/597236 and https://gerrit.wikimedia.org/r/595883 to CR+2 [22:26:29] Holding that one for last because it'll be quick [22:26:33] the config one that is [22:26:52] RoanKattouw: got it [22:27:32] RoanKattouw: now we wait for zuul, looks busy. [22:27:35] 10Operations, 10docker-pkg, 10serviceops: docker-pkg update cli renders unclear guidance - https://phabricator.wikimedia.org/T253131 (10RLazarus) p:05Triage→03Medium Looks like this comes from [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/docker-pkg/+/1023a62053f730a349db99dd... [22:36:59] RoanKattouw: just core to go! [22:41:33] (03CR) 10Jforrester: "Shall I deploy this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597348 (https://phabricator.wikimedia.org/T252898) (owner: 10Clarakosi) [22:45:39] RoanKattouw: All merged! [22:50:48] (03CR) 10Cicalese: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597348 (https://phabricator.wikimedia.org/T252898) (owner: 10Clarakosi) [22:51:22] OK, let's run this scap, fingers crossed [22:51:40] RoanKattouw: yep [22:51:43] (03CR) 10Jforrester: "OK, I'll land this after the current full scap is done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597348 (https://phabricator.wikimedia.org/T252898) (owner: 10Clarakosi) [22:53:11] !log catrope@deploy1001 Started scap: i18n scap for namespace localizations (T251287, T252754) [22:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:16] T251287: Localised sitenames/namespaces for ti.wikipedia and ti.wiktionary - https://phabricator.wikimedia.org/T251287 [22:53:16] T252754: Rename user namespace in Javanese wikis: 'Panganggo' to 'Naraguna' - https://phabricator.wikimedia.org/T252754 [22:53:35] This will take a while, thankfully the SWAT window is empty [22:54:12] RoanKattouw: I doubt much is swatting. I think James_F has a beta only batch above. [22:54:38] That's just my RelEng fix-things-now prerogative. No window involved; no rush. [22:54:50] James_F: Np [22:57:06] (03PS1) 10Andrew Bogott: openstack::util::admin_scripts: Include a mariadb client package [puppet] - 10https://gerrit.wikimedia.org/r/597358 [22:58:22] RoanKattouw: Shall I update the calendar to show this as taking 2 hours and no swat? [22:58:31] Leave it be. [22:58:37] Yeah no need [22:58:39] k [22:59:27] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@Rvvalentim) - https://phabricator.wikimedia.org/T253155 (10KFrancis) @RLazarus Rodolfo doesn't have an NDA on file. To begin the process, I'll need his: -Full legal name -Mailing address -Email addres... [22:59:28] (03CR) 10Andrew Bogott: [C: 03+2] openstack::util::admin_scripts: Include a mariadb client package [puppet] - 10https://gerrit.wikimedia.org/r/597358 (owner: 10Andrew Bogott) [23:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200519T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:10:50] RoanKattouw: how we looking? [23:11:03] scaps are slow, it's still going [23:11:15] RoanKattouw: Out of the i18n build step yet? [23:11:21] Yes [23:11:28] https://usercontent.irccloud-cdn.com/file/BiIUzQwj/Screenshot%20from%202020-05-19%2016-11-10.png [23:11:51] OK, so it's going slightly faster than normal, as you'd expect at this time of day. [23:12:23] cool [23:12:36] James_F: what's normal? [23:13:13] RhinosF1: 13–20m maybe for the l10n-update step. [23:13:31] i18n changes not running on the train are very rare for this reason. [23:13:37] ah [23:14:38] Sorry, I should have been clearer: the entire scap process can take 40-50 minutes sometimes [23:14:52] cool [23:14:52] It can take > 2 hours sometimes. [23:15:03] But we think that's a bug in rsync. [23:15:47] James_F: don't say that. It's 00:15 already. [23:16:12] Some colleagues have found that hitting "Enter" repeatedly can speed that step up. [23:16:17] Which isn't re-assuring. [23:16:34] Hey James_F I'm around if you need someone to test https://phabricator.wikimedia.org/T252846 after SWAT [23:16:58] Jdlrobson: Thanks! I can handle it if you're busy though; thank you for the fix. [23:17:20] this is important and it's a bit of an obscure feature [23:18:05] Jdlrobson: Working well on Beta Cluster: https://en.wikivoyage.beta.wmflabs.org/wiki/Tokushima [23:18:22] sweet [23:19:06] Jdlrobson: As Roan's doing a full scap, it may take a while. :-( [23:19:16] Jdlrobson: I saw a ticket to undeploy Extension:Insider fyi https://phabricator.wikimedia.org/T253096 as not working & unused [23:19:19] want me to create the patch for the branch and put on wikitech:deployments ? [23:19:27] RhinosF1: yeh i was also wondering about that [23:19:33] RhinosF1: i couldnt find a usage in production [23:19:34] :) [23:19:34] Jdlrobson: Nah, I'm merging the cherry-pick already. [23:19:39] James_F: ok great! [23:19:52] Jdlrobson: thought I'd tell you [23:20:00] RhinosF1: Yeah, my thought was to ask them to confirm they don't want it now that it's working again. [23:20:02] On wikivoyage they use Template:hasDocent [23:20:06] but it doesn't seem to have any effect [23:20:08] enwikivoyage that is [23:20:11] Right. [23:20:21] i'm all for removing stuff from production :) [23:20:23] James_F: good idea. Fixing the bug seems better [23:20:38] Undeploying is also fine. [23:20:47] thanks for taking care of this James_F [23:20:49] But if we're dropping it, we should drop it everywhere and be done with it. [23:21:28] James_F: and sorry i cant give you a more exact answer about whether https://phabricator.wikimedia.org/T252800 is a train blocker :( [23:21:46] Yeah, it's a mess, worry not. [23:26:45] (03CR) 10BPirkle: [C: 03+1] "Doesn't sound like a separate review is required, but just in case: LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597348 (https://phabricator.wikimedia.org/T252898) (owner: 10Clarakosi) [23:29:11] Starting to get somewhere now https://usercontent.irccloud-cdn.com/file/XpXlfJg4/Screenshot%20from%202020-05-19%2016-28-48.png [23:29:29] Finally. [23:30:44] RoanKattouw: how long left as a guess? [23:31:02] Probably 15 minutes? [23:31:09] About 10 mins? [23:31:15] cool [23:31:17] Yeah something like that [23:31:25] There's the opcache reset check as well, but that's only a minute or so. [23:39:20] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@Rvvalentim) - https://phabricator.wikimedia.org/T253155 (10diego) @KFrancis , @Rvvalentim is research intern - hired through upwork - in the same situation than the one described on this task: T252129... [23:39:40] Jdlrobson: Sneaked the patch onto mwdebug1001; it works great. Thank you. [23:40:18] sweet thanks James_F [23:40:23] less errors \o/ [23:40:44] Not deployed just yet though. Waiting for the full re-build. [23:41:13] James_F: on plus side once i've ironed out all these kinks in the hook/skin system it should be much more difficult for skins to cause exceptions in this way [23:41:19] skins are going to be dumb dumb dumb [23:41:21] Excellent. [23:41:36] Maybe we can slim them down all the way to a LESS file. ;-) [23:46:35] RoanKattouw: How's it going? [23:46:36] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@Rvvalentim) - https://phabricator.wikimedia.org/T253155 (10KFrancis) @diego Thank you for the clarification. As @Rvvalentim was hired as an intern through Upwork, their NDA would be covered by the full... [23:46:45] 99%, 1 server left [23:46:51] Joy. [23:46:53] :) [23:47:05] Has it been stuck on that server for a while? [23:50:28] Moved on to scap-cdb-rebuild now [23:50:37] Cool. [23:50:59] * RhinosF1 googles scap-cdb-rebuild [23:54:22] 10Operations, 10Jupyter-Hub, 10LDAP-Access-Requests: Give access to the JupyterHub (SWAP) notebooks to (@Rvvalentim) - https://phabricator.wikimedia.org/T253155 (10diego) Great! Thanks @KFrancis. @RLazarus do we need any further information/documentation? [23:55:38] !log catrope@deploy1001 Finished scap: i18n scap for namespace localizations (T251287, T252754) (duration: 62m 26s) [23:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:55:42] T251287: Localised sitenames/namespaces for ti.wikipedia and ti.wiktionary - https://phabricator.wikimedia.org/T251287 [23:55:42] T252754: Rename user namespace in Javanese wikis: 'Panganggo' to 'Naraguna' - https://phabricator.wikimedia.org/T252754 [23:55:51] Aha, finally. [23:55:54] RoanKattouw: All done? [23:56:01] 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10MBeat33) Thank you @Dzahn and @JGulingan @Dzahn the origin of this Task is from a conversation between Donor Services and OIT. We have a persistent issue that I'd appreciate your input on, as well a ques... [23:56:04] Not quite [23:56:12] Now I have to test if it works, and do one more config patch [23:56:14] Oh, you need to run the maintenance script. [23:56:26] Can I sync the one change I've already prepped? [23:56:28] Oh yes that too [23:56:32] Yes go for it [23:56:36] Ta. [23:56:56] (03CR) 10Jforrester: [C: 03+2] Echostore: Fix labs configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597348 (https://phabricator.wikimedia.org/T252898) (owner: 10Clarakosi) [23:56:58] RoanKattouw: once config patch is done, LGTM [23:56:59] OK things seem to work on jvwiki and tiwiki [23:57:22] Except for the project namespace on tiwiki which will be addressed by that config patch [23:57:30] I'll run the script on jvwiki in the meantime [23:57:31] RoanKattouw: I'll update phab. Yeah, correct. [23:57:38] And other jv wikis [23:57:41] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.32/extensions/Insider/includes/InsiderHooks.php: T252846 Use SidebarBeforeOutput hook with correct format (duration: 01m 06s) [23:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:44] T252846: Wikivoyage deprecation warnings relating to unsupported SidebarBeforeOutput usage - https://phabricator.wikimedia.org/T252846 [23:57:45] (03Merged) 10jenkins-bot: Echostore: Fix labs configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597348 (https://phabricator.wikimedia.org/T252898) (owner: 10Clarakosi) [23:57:55] RoanKattouw: jvwiki and jvwiktionary exist [23:58:46] OK ran both. jvwiktionary had nothing to fix, jvwiki just had some link table rows, no actual pages [23:58:58] !log Ran namespaceDupes.php on jvwiki and jvwiktionary for T252754 [23:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:15] (03PS8) 10Catrope: Site name & meta namespace localisations for ti[wikipedia|wiktionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 (https://phabricator.wikimedia.org/T251287) (owner: 10RhinosF1)