[00:00:04] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200507T0000). Please do the needful. [00:12:42] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [00:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:10] PROBLEM - Check systemd state on ms-be1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:52] RECOVERY - Check systemd state on ms-be1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:34] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) [02:45:58] PROBLEM - Hadoop NodeManager on analytics1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [02:55:35] !log reverting group1 to 1.35.0-wmf.30 for T252079 [02:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:39] T252079: mw.wikibase.getLabelByLang('Q1','en') returning nil today - https://phabricator.wikimedia.org/T252079 [02:56:41] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert group1 wikis to 1.35.0-wmf.30 for T252079 [02:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:17] (03PS1) 10Brennen Bearnes: Revert "group1 wikis to 1.35.0-wmf.31" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594822 (https://phabricator.wikimedia.org/T252079) [02:58:19] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group1 wikis to 1.35.0-wmf.31" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594822 (https://phabricator.wikimedia.org/T252079) (owner: 10Brennen Bearnes) [02:58:58] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.35.0-wmf.31" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594822 (https://phabricator.wikimedia.org/T252079) (owner: 10Brennen Bearnes) [02:59:08] mdholloway: see above ^ [04:19:34] (03PS1) 10Marostegui: install_server: Allow reimage of db2078 [puppet] - 10https://gerrit.wikimedia.org/r/594827 (https://phabricator.wikimedia.org/T250666) [04:20:12] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage of db2078 [puppet] - 10https://gerrit.wikimedia.org/r/594827 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [04:47:59] 10Operations, 10DBA: Upgrade and restart s3 and s7 primary DB master: Thu 7th May - https://phabricator.wikimedia.org/T251158 (10Ovedc) Hello! can you please allow the register users dismiss the notice, once they read it? Thanks! [04:49:28] 10Operations, 10DBA: Upgrade and restart s3 and s7 primary DB master: Thu 7th May - https://phabricator.wikimedia.org/T251158 (10Marostegui) >>! In T251158#6115189, @Ovedc wrote: > Hello! can you please allow the register users dismiss the notice, once they read it? Thanks! Thanks for the message. However we... [05:00:04] marostegui: Dear deployers, time to do the s3 and s7 primary database master restart deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200507T0500). [05:00:13] \o/ [05:00:19] <_joe_> lol [05:00:22] <_joe_> very appropriate [05:00:32] Going to start [05:00:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s3 and s7 as read-only for maintenance T251158', diff saved to https://phabricator.wikimedia.org/P11166 and previous config saved to /var/cache/conftool/dbconfig/20200507-050046-marostegui.json [05:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:50] T251158: Upgrade and restart s3 and s7 primary DB master: Thu 7th May - https://phabricator.wikimedia.org/T251158 [05:01:17] RO confirmed [05:01:18] restarting [05:01:26] At https://en.wikinews.org/wiki/Main_Page the notice still says "will be performed soon" [05:01:36] Is this something a centralnotice admin can fix? [05:01:51] DannyS712: it says the hour too [05:02:07] yes, its just confusing if you don't have a UTC clock [05:02:38] DannyS712: probably comment on the RO task, this is probably the worst time to get that sort of discussion [05:02:48] ok [05:02:55] <_joe_> yeah indeed [05:03:26] s3 finished, waiting for s7 [05:03:53] marostegui: DannyS712 _joe_ hi! I have CentralNotice admin rights, if that's useful... [05:04:12] <_joe_> please let this channel be used for the actual SRE operation [05:04:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s3 and s7 as read-only=off for maintenance T251158', diff saved to https://phabricator.wikimedia.org/P11167 and previous config saved to /var/cache/conftool/dbconfig/20200507-050419-marostegui.json [05:04:20] <_joe_> we're in the middle of a delicate operation [05:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:26] maintenance finished - checking [05:04:45] _joe_ ok good luck, I was just responding to a comment here [05:05:07] I can edit on s3 and s7 [05:05:12] checking some more wikis [05:05:34] <_joe_> errors are back to normal rates marostegui [05:05:46] <_joe_> I don't see edit conflicts or errors AFAICT [05:07:03] Yeah, I have edited a few wikis on s3 and s7 and they work fine [05:07:07] <_joe_> AndyRussG: sorry if you felt offended - but marostegui had just asked to avoid discussing this while we're restarting two db masters. We were both looking at 10 graphs and logs at the same time so any other discussion was distracting in the moment :) [05:07:23] Also I can see stuff happening on recentchangesç [05:07:51] @_joe_ my apologies for the poor timing with my comment [05:08:08] DannyS712: ^ and now that we have a bit more time, you probably want to let them know on T251157 as that is something we cannot fix at the moment, specially for all the 900 wikis involved in this operation [05:08:08] T251157: Read only time window needed for wikis on s3 and s7 - https://phabricator.wikimedia.org/T251157 [05:10:45] <_joe_> DannyS712: no problems, we were just very focused in the moment [05:13:21] AndyRussG: Thanks for the offer, but at the moment we were so focused making sure nothing was being done wrongly...specially with primary database masters, which are a big SPOF :) [05:14:35] marostegui _joe_ no offence taken of course! and thanks for explaining... My bad, I just got pinged by the term "centralnotice" and didn't get the full context of the discussion. Many apologies for the distraction :) [05:14:47] 10Operations, 10DBA: Upgrade and restart s3 and s7 primary DB master: Thu 7th May - https://phabricator.wikimedia.org/T251158 (10Marostegui) 05Open→03Resolved This is done. RO started: 05:00:47 RO finished: 05:04:19 [05:14:50] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:15:12] <_joe_> I don't see that CN though 🤔 [05:15:23] AndyRussG: no worries at all - much appreciated you were so responsive. Thankfully all went fine, I will follow up the the discussion on the read-only task [05:15:56] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [05:16:07] marostegui: _joe_ thanks for working on difficult stuff!!! :) [05:17:25] 10Operations, 10DBA: Upgrade and restart s3 and s7 primary DB master: Thu 7th May - https://phabricator.wikimedia.org/T251158 (10Marostegui) [05:22:57] !log Reimage db2078 [05:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:56] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=mysql-misc site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:26:56] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [05:27:06] ^ expected [05:27:13] and the next ones too [05:27:42] PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [05:28:34] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [05:33:25] !log restart hadoop yarn nodemanager on analytics1071 [05:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:30] RECOVERY - Hadoop NodeManager on analytics1071 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:45:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [05:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:31] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [05:54:41] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [05:55:11] RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [06:00:05] PROBLEM - MediaWiki centralauth errors on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [06:07:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:08:54] (03PS1) 10Marostegui: Revert "install_server: Allow reimage of db2078" [puppet] - 10https://gerrit.wikimedia.org/r/594833 [06:09:40] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage of db2078" [puppet] - 10https://gerrit.wikimedia.org/r/594833 (owner: 10Marostegui) [06:19:45] 10Operations, 10serviceops: Chaos Engineering - Stop for x hours one or more mc10xx memcached shards - https://phabricator.wikimedia.org/T251378 (10elukey) >>! In T251378#6111482, @Joe wrote: > @elukey let's schedule this test for 6:00Z on monday, May 11th? +1 [06:21:26] (03PS1) 10Marostegui: install_server: Reimage dbproxy1020 [puppet] - 10https://gerrit.wikimedia.org/r/594834 [06:22:20] RECOVERY - MediaWiki centralauth errors on graphite1004 is OK: OK: Less than 30.00% above the threshold [0.5] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=3&fullscreen&orgId=1 [06:23:00] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage dbproxy1020 [puppet] - 10https://gerrit.wikimedia.org/r/594834 (owner: 10Marostegui) [06:39:11] (03PS2) 10Jcrespo: mariadb: enable read_only monitoring in parsercaches [puppet] - 10https://gerrit.wikimedia.org/r/593527 (https://phabricator.wikimedia.org/T172489) [06:45:45] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/22376/" [puppet] - 10https://gerrit.wikimedia.org/r/593527 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [06:46:53] (03PS1) 10JJMC89: Add the investiagte right to the checkuser group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594881 [06:47:58] (03CR) 10Marostegui: [C: 03+1] mariadb: enable read_only monitoring in parsercaches [puppet] - 10https://gerrit.wikimedia.org/r/593527 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [06:51:23] 10Operations, 10ops-eqiad: Degraded RAID on kafka-jumbo1001 - https://phabricator.wikimedia.org/T251586 (10elukey) >>! In T251586#6114018, @Cmjohnson wrote: > @elukey I need to power this server off, I am not able to reach the mgmt/idrac for the server and need access to pull the required Dell report. We ca... [06:57:33] (03PS3) 10Elukey: profile::kerberos::client: set KRB5CCNAME with default_ccache_name [puppet] - 10https://gerrit.wikimedia.org/r/594726 [06:58:22] (03CR) 10Jcrespo: [C: 03+2] mariadb: enable read_only monitoring in parsercaches [puppet] - 10https://gerrit.wikimedia.org/r/593527 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [06:59:27] (03CR) 10Dzahn: [C: 03+1] "[ldap-corp1001:~] $ /usr/bin/ldapsearch -x "mail=jgiann*"" [puppet] - 10https://gerrit.wikimedia.org/r/594757 (https://phabricator.wikimedia.org/T251899) (owner: 10Cwhite) [07:00:05] (03PS2) 10JJMC89: Add the investiagte right to the checkuser group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594881 (https://phabricator.wikimedia.org/T251932) [07:00:41] (03CR) 10Dzahn: [C: 03+1] "If he is a contractor we would need an expiration date and contact. Corp LDAP tells me he is a full-time contractor. But sometimes people " [puppet] - 10https://gerrit.wikimedia.org/r/594757 (https://phabricator.wikimedia.org/T251899) (owner: 10Cwhite) [07:03:19] (03PS1) 10Jcrespo: mariadb: Enable read_only check to page on primary masters [puppet] - 10https://gerrit.wikimedia.org/r/594885 (https://phabricator.wikimedia.org/T172489) [07:03:54] 10Operations, 10Wikidata, 10Patch-For-Review, 10User-notice, 10Wikimedia-Incident: Wikidata and dewiki databases locked - https://phabricator.wikimedia.org/T171928 (10jcrespo) [07:04:56] (03PS2) 10Jcrespo: mariadb: Enable read_only check to page on primary masters [puppet] - 10https://gerrit.wikimedia.org/r/594885 (https://phabricator.wikimedia.org/T172489) [07:05:36] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for apache2 on xhgui [puppet] - 10https://gerrit.wikimedia.org/r/594685 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:13:42] (03PS3) 10Jcrespo: mariadb: Enable read_only check to page on primary masters [puppet] - 10https://gerrit.wikimedia.org/r/594885 (https://phabricator.wikimedia.org/T172489) [07:17:24] (03PS4) 10Jcrespo: mariadb: Enable read_only check to page on primary masters [puppet] - 10https://gerrit.wikimedia.org/r/594885 (https://phabricator.wikimedia.org/T172489) [07:19:30] (03CR) 10Marostegui: mariadb: Enable read_only check to page on primary masters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594885 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [07:22:47] (03PS5) 10Jcrespo: mariadb: Enable read_only check to page on primary masters [puppet] - 10https://gerrit.wikimedia.org/r/594885 (https://phabricator.wikimedia.org/T172489) [07:23:11] (03CR) 10Jcrespo: "Solved" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594885 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [07:24:43] (03PS6) 10Jcrespo: mariadb: Enable read_only check to page on primary masters [puppet] - 10https://gerrit.wikimedia.org/r/594885 (https://phabricator.wikimedia.org/T172489) [07:29:12] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 143.4 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [07:30:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/576102 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [07:33:20] (03PS1) 10Dzahn: phabricator: no dumps when running in cloud [puppet] - 10https://gerrit.wikimedia.org/r/594892 [07:34:10] (03CR) 10Muehlenhoff: [C: 03+1] "The "full time contractor" in corp LDAP in most cases still mean regular staff status, only with Safeguard. https://office.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/594757 (https://phabricator.wikimedia.org/T251899) (owner: 10Cwhite) [07:36:41] PROBLEM - MariaDB read only pc2 on pc1008 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:38:48] ^working on that [07:42:56] (03CR) 10Dzahn: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/594757 (https://phabricator.wikimedia.org/T251899) (owner: 10Cwhite) [07:43:40] (03CR) 10Dzahn: [C: 03+2] "This is what caused those emails about the failed dump." [puppet] - 10https://gerrit.wikimedia.org/r/594892 (owner: 10Dzahn) [07:44:06] !log further decrease weight for ms-be10[678] - T252008 [07:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:09] T252008: Decom ms-be101[678] - https://phabricator.wikimedia.org/T252008 [07:44:27] PROBLEM - MariaDB read only pc1 on pc1010 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:44:41] RECOVERY - MariaDB read only pc2 on pc1008 is OK: Version 10.4.12-MariaDB-log, Uptime 2337586s, read_only: False, 4144.55 QPS, connection latency: 0.003184s, query latency: 0.000744s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:45:43] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/594726 (owner: 10Elukey) [07:48:30] RECOVERY - MariaDB read only pc1 on pc1010 is OK: Version 10.1.43-MariaDB, Uptime 12699699s, read_only: False, 828.23 QPS, connection latency: 0.004686s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:50:18] (03PS4) 10Elukey: profile::kerberos::client: set KRB5CCNAME with default_ccache_name [puppet] - 10https://gerrit.wikimedia.org/r/594726 [07:51:10] (03PS1) 10Elukey: Use new kerberos credential cache on stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/594893 [07:51:42] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) @hashar So.. when do we schedule the ma... [07:55:15] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/22378/stat1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/594893 (owner: 10Elukey) [07:55:48] PROBLEM - MariaDB read only pc3 on pc2009 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 13304017s, 637.73 QPS, connection latency: 0.002251s, query latency: 0.000414s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:55:48] PROBLEM - MariaDB read only pc1 on pc2007 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.1.43-MariaDB, Uptime 13304419s, 710.75 QPS, connection latency: 0.004397s, query latency: 0.000563s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:56:08] ^those are real problems [07:56:15] but not user impacting, fixing now [07:59:05] (03PS9) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [08:00:08] (03CR) 10jerkins-bot: [V: 04-1] ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [08:02:33] (03PS10) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [08:03:33] (03CR) 10jerkins-bot: [V: 04-1] ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [08:05:15] (03PS11) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [08:06:17] (03CR) 10jerkins-bot: [V: 04-1] ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [08:06:41] !log setting pc2007, pc2009 as read-write [08:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:26] (03CR) 10Muehlenhoff: [C: 03+1] "> ACK, it seems we have no real way to tell who is actually a contractor then?" [puppet] - 10https://gerrit.wikimedia.org/r/594757 (https://phabricator.wikimedia.org/T251899) (owner: 10Cwhite) [08:07:28] RECOVERY - MariaDB read only pc1 on pc2007 is OK: Version 10.1.43-MariaDB, Uptime 13305119s, read_only: False, 1037.76 QPS, connection latency: 0.002482s, query latency: 0.000508s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:07:28] RECOVERY - MariaDB read only pc3 on pc2009 is OK: Version 10.1.43-MariaDB, Uptime 13304718s, read_only: False, 746.78 QPS, connection latency: 0.002460s, query latency: 0.000464s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:07:43] the deployement_server role is broken in cloud because of changes related to the mcrouter gutter pool.. hmmm [08:07:56] (03PS12) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [08:07:59] trying to figure out what i need for: []' is not applicable to an Undef Value. mediawiki/mcrouter_wancache.pp, line: 19 [08:08:19] ^we recovered after fix [08:08:21] kind of surprised it affects deployment_server though [08:14:56] (03CR) 10Gehel: [C: 04-1] "Minor exception management issue, see comment inline." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [08:15:36] <_joe_> jouncebot: next? [08:15:39] <_joe_> jouncebot: next [08:15:39] In 2 hour(s) and 44 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200507T1100) [08:24:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great, let's give it a shot!" [puppet] - 10https://gerrit.wikimedia.org/r/594689 (https://phabricator.wikimedia.org/T233950) (owner: 10Jbond) [08:32:50] !log upgrading restbase-dev to latest OpenJDK security update [08:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:10] (03PS1) 10Ema: vcl: set large_objects_cutoff in hiera [puppet] - 10https://gerrit.wikimedia.org/r/594896 (https://phabricator.wikimedia.org/T249809) [08:37:01] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Nikolaos Gkountas - https://phabricator.wikimedia.org/T252100 (10ngkountas) [08:37:50] (03CR) 10Ema: "vcl noop: https://puppet-compiler.wmflabs.org/compiler1003/22379/" [puppet] - 10https://gerrit.wikimedia.org/r/594896 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [08:46:03] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:47:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:51:20] (03PS1) 10Dzahn: cloud/devtools: fix puppet run by adding mcrouter gutter hash in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/594900 [08:52:07] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: fix puppet run by adding mcrouter gutter hash in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/594900 (owner: 10Dzahn) [09:01:40] (03PS1) 10Dzahn: deployment_server: in cloud we need more space in /srv [puppet] - 10https://gerrit.wikimedia.org/r/594903 [09:05:21] (03PS1) 10Tchanders: Group CheckUser rights together in CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594904 [09:07:14] (03PS7) 10Jcrespo: mariadb: Enable read_only check to page on primary masters [puppet] - 10https://gerrit.wikimedia.org/r/594885 (https://phabricator.wikimedia.org/T172489) [09:07:16] (03PS1) 10Jcrespo: mariadb: Enable read_only monitoring on misc hosts [puppet] - 10https://gerrit.wikimedia.org/r/594905 (https://phabricator.wikimedia.org/T172489) [09:11:49] !log roll restart cassandra on aqs1005 to pick up new openjdk upgrades (canary) [09:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:13] (03CR) 10Tchanders: [C: 03+1] Add the investiagte right to the checkuser group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594881 (https://phabricator.wikimedia.org/T251932) (owner: 10JJMC89) [09:19:59] (03CR) 10Elukey: [C: 03+2] profile::kerberos::client: set KRB5CCNAME with default_ccache_name [puppet] - 10https://gerrit.wikimedia.org/r/594726 (owner: 10Elukey) [09:20:09] (03CR) 10Elukey: [C: 03+2] Use new kerberos credential cache on stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/594893 (owner: 10Elukey) [09:20:22] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:23:00] 10Operations, 10ops-requests: GeoIP Puppet Module Fails in Labs - https://phabricator.wikimedia.org/T83447 (10Dzahn) [09:24:54] (03PS3) 10Tchanders: Add the investigate right to the checkuser group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594881 (https://phabricator.wikimedia.org/T251932) (owner: 10JJMC89) [09:25:48] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 22727 bytes in 7.414 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:27:20] (03PS1) 10Ayounsi: Remove Juniper report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/594908 [09:29:00] (03PS2) 10Ayounsi: Remove Juniper report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/594908 [09:34:12] (03PS1) 10Elukey: profile::kerberos::client: fix credential cache location and env variable [puppet] - 10https://gerrit.wikimedia.org/r/594909 [09:36:41] (03CR) 10Volans: [C: 03+1] ":/" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/594908 (owner: 10Ayounsi) [09:37:05] (03PS6) 10Ayounsi: Juniper to Netbox import script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/566812 [09:38:01] (03CR) 10Ayounsi: [C: 03+2] Remove Juniper report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/594908 (owner: 10Ayounsi) [09:38:38] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:39:39] (03CR) 10Elukey: [C: 03+2] profile::kerberos::client: fix credential cache location and env variable [puppet] - 10https://gerrit.wikimedia.org/r/594909 (owner: 10Elukey) [09:40:38] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 22735 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:43:22] (03PS1) 10Dzahn: cloud/devtools: add Hiera key/values for puppet master [puppet] - 10https://gerrit.wikimedia.org/r/594912 [09:47:20] (03PS1) 10Dzahn: cloud/devtools: add missing vcs::addresses for phab instances [puppet] - 10https://gerrit.wikimedia.org/r/594913 [09:50:12] (03PS1) 10Filippo Giunchedi: role: rename thanos::query in thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/594914 (https://phabricator.wikimedia.org/T233956) [09:51:29] (03CR) 10Filippo Giunchedi: [C: 03+2] role: rename thanos::query in thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/594914 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:52:51] (03PS2) 10Dzahn: deployment_server: in cloud we need more space in /srv [puppet] - 10https://gerrit.wikimedia.org/r/594903 [09:53:00] (03PS1) 10Dzahn: cloud/devtools: fix phabricator domain name and cluster search [puppet] - 10https://gerrit.wikimedia.org/r/594915 [10:03:39] (03PS1) 10Dzahn: cloud/devtools: fix Hiera values for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/594916 [10:05:22] (03CR) 10Dzahn: "Once this is merged we can remove the second role from the instance and it should just work with the single role like in prod." [puppet] - 10https://gerrit.wikimedia.org/r/594903 (owner: 10Dzahn) [10:06:01] (03CR) 10Dzahn: "adjusted to existing values in Horizon Hiera. once this is merged, please remove from there." [puppet] - 10https://gerrit.wikimedia.org/r/594915 (owner: 10Dzahn) [10:06:38] (03CR) 10Dzahn: "once merged let's remove from Horizon Hiera so it all just works by applying the role and that's it" [puppet] - 10https://gerrit.wikimedia.org/r/594913 (owner: 10Dzahn) [10:07:08] (03CR) 10Dzahn: "once merged let's remove all this stuff from Horizon" [puppet] - 10https://gerrit.wikimedia.org/r/594912 (owner: 10Dzahn) [10:07:30] !log installing Java security updates on restbase/sessionstore [10:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:27] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Dzahn) >>! In T251726#6109023, @Vgutierrez wrote: > the icinga check on cp hosts currently warns 30 days before and goes critical 15 days before cert... [10:13:15] (03CR) 10Dzahn: "once this is deployed i recommend running an actual command (for example "schedule downtime" for an hour on something random) from the web" [puppet] - 10https://gerrit.wikimedia.org/r/594771 (https://phabricator.wikimedia.org/T251572) (owner: 10Ryan Kemper) [10:14:52] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:15:06] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:17:21] XioNoX: it's Telia again, not scheduled (CenturyLink is in scheduled maint) [10:17:30] will make a ticket, right [10:17:38] thx! [10:18:10] * mutante recycles ticket for the same circuit [10:18:28] (03PS1) 10Filippo Giunchedi: prometheus: add thanos sidecar to k8s instance [puppet] - 10https://gerrit.wikimedia.org/r/594919 (https://phabricator.wikimedia.org/T233956) [10:18:51] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) 05Resolved→03Open IC-313592 is down again Transport: cr2-eqord:xe-0/1/1 (Telia, IC-313592, 51ms 10Gbps wave) {#1062}; Transport: cr3-ulsfo:xe-0/1/1 (Telia, IC-313592, 51ms 10Gbps wave)... [10:19:26] (03PS1) 10Hnowlan: changeprop: fix scoping issues in templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/594921 [10:20:07] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T221259#6115578 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:20:07] ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T221259#6115578 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:20:24] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 7 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10hnowlan) a:05holger.knust→03hnowlan [10:21:17] (03CR) 10Filippo Giunchedi: "This is effectively a noop now, in the sense that the query component isn't active yet so the sidecar won't receive queries" [puppet] - 10https://gerrit.wikimedia.org/r/594919 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [10:22:02] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:23:18] 10Operations, 10SRE-Access-Requests: Requesting Access to sites from Google Search Console - https://phabricator.wikimedia.org/T251128 (10ahemmer) No probs @colewhite ! @DZierten Would you kindly approve this request please? [10:23:36] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10Kormat) It is (probably) possible to work around this by not setting `partman-auto/method` and using `partman/early_command` to prepopulate the metadata that partman uses. It's... [10:24:06] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:24:21] 10Operations, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10Dzahn) sent mail to Telia [10:24:30] mutante: dunno if you did it here, but I usually put an expiration time to the ACK. That way if the circuit doesn't come back for any reason, we will not forget [10:25:07] (03PS1) 10JMeybohm: _tls_helpers: Use defaults provided in docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/594922 [10:25:50] XioNoX: an ACK should be removed on "next state change" even without doing that. that's why i like them so much better than disabled notifications. no need to remember it when it changes [10:26:07] "This is to inform you that we observed some flaps in our WDM equipment and we suspect a card problem in Palo Alto, USA. Investigation is now ongoing with our vendor and we will keep you posted. " [10:26:21] mutante: but if they don't come back up then we might forget about them [10:26:29] it happened at least one on circuits [10:26:31] (03PS1) 10Ema: Revert "vcl: test 'exp' admission policy on two nodes" [puppet] - 10https://gerrit.wikimedia.org/r/594923 (https://phabricator.wikimedia.org/T249809) [10:26:32] once* [10:28:45] ok, i can add an expiration but doesn't this negate the whole "we make tickets because those are not forgotten" thing? [10:29:17] well, it already came back anyways because it's flapping :p [10:29:30] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:29:36] (03CR) 10Ema: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22381/" [puppet] - 10https://gerrit.wikimedia.org/r/594923 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [10:30:22] ACKNOWLEDGEMENT - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T221259 - The acknowledgement expires at: 2020-05-11 10:29:46. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:30:22] ACKNOWLEDGEMENT - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn https://phabricator.wikimedia.org/T221259 - The acknowledgement expires at: 2020-05-11 10:29:46. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:30:49] aha, it even says the date here on IRC, picked Monday [10:33:06] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:33:12] (03CR) 10JMeybohm: "The downside of this is that looking at the deployment does not give a direct idea of where the admin interface is bound to. Plus, telemet" [deployment-charts] - 10https://gerrit.wikimedia.org/r/594922 (owner: 10JMeybohm) [10:36:11] (03PS1) 10JMeybohm: wikifeeds: Add TLS termination support [deployment-charts] - 10https://gerrit.wikimedia.org/r/594924 (https://phabricator.wikimedia.org/T235411) [10:36:55] (03PS3) 10Dzahn: remove IPs of recently decom'ed appservers in eqiad D5 [dns] - 10https://gerrit.wikimedia.org/r/583377 (https://phabricator.wikimedia.org/T247780) [10:38:20] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:38:32] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:38:56] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) a:05Joe→03JMeybohm [10:40:52] (03CR) 10Muehlenhoff: Initial debian commit (032 comments) [debs/anaconda] (debian) - 10https://gerrit.wikimedia.org/r/594204 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [10:41:54] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 76, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:42:06] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:44:43] (03PS5) 10Jbond: pcc: add ability to parse commit messages for `Hosts:` lines [puppet] - 10https://gerrit.wikimedia.org/r/594708 [10:49:32] (03PS6) 10Jbond: pcc: add ability to parse commit messages for `Hosts:` lines [puppet] - 10https://gerrit.wikimedia.org/r/594708 [10:50:42] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:51:07] (03PS7) 10Jbond: pcc: add ability to parse commit messages for `Hosts:` lines [puppet] - 10https://gerrit.wikimedia.org/r/594708 [10:56:19] (03PS8) 10Jbond: pcc: add ability to parse commit messages for `Hosts:` lines [puppet] - 10https://gerrit.wikimedia.org/r/594708 [10:57:42] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 22737 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:59:18] (03CR) 10Jbond: "updated thanks" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594708 (owner: 10Jbond) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200507T1100). [11:00:04] matthiasmullie: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:24] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10Marostegui) I think it would merit a PoC at least, so we can evaluate if it is 100% possible, so we can decide how to proceed further. We need to involve the Infra Foundations... [11:01:13] I'll go ahead and deploy my patch [11:02:18] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 123 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:03:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/594797 (owner: 10RLazarus) [11:07:16] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:07:33] !log mlitn@deploy1001 Synchronized php-1.35.0-wmf.31/extensions/WikibaseMediaInfo/: [MediaInfo] Add dummy concept chips without thumbnail (duration: 01m 09s) [11:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:00] (03CR) 10Marostegui: [C: 03+1] "Let's deploy with puppet stopped?" [puppet] - 10https://gerrit.wikimedia.org/r/594885 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [11:08:08] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 36 probes of 554 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:08:47] (03PS2) 10Jcrespo: mariadb: Enable read_only monitoring on misc hosts [puppet] - 10https://gerrit.wikimedia.org/r/594905 (https://phabricator.wikimedia.org/T172489) [11:09:26] nothing else to deploy this eu swat, iirc? [11:10:26] !log EU swat done [11:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:21] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: remove package_from_component define [puppet] - 10https://gerrit.wikimedia.org/r/594925 (https://phabricator.wikimedia.org/T251297) [11:11:27] (03PS1) 10Arturo Borrero Gonzalez: wmcs: kubeadm: introduce hiera support for selecting repo component [puppet] - 10https://gerrit.wikimedia.org/r/594926 (https://phabricator.wikimedia.org/T250866) [11:14:26] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22740 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:14:53] (03PS1) 10Cmjohnson: Adding kubernetes1007=1014 to netboot.cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/594927 (https://phabricator.wikimedia.org/T241850) [11:15:56] (03CR) 10Cmjohnson: [C: 03+2] Adding kubernetes1007=1014 to netboot.cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/594927 (https://phabricator.wikimedia.org/T241850) (owner: 10Cmjohnson) [11:16:12] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/22387/" [puppet] - 10https://gerrit.wikimedia.org/r/594905 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [11:20:01] (03CR) 10Marostegui: [C: 03+1] "I have verified that nagios can connect thru socket too" [puppet] - 10https://gerrit.wikimedia.org/r/594905 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [11:22:43] (03CR) 10Marostegui: [C: 03+1] "All masters have nagios connecting via socket" [puppet] - 10https://gerrit.wikimedia.org/r/594885 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [11:23:38] (03PS1) 10Cmjohnson: Adding new kubernetes nodes to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/594928 (https://phabricator.wikimedia.org/T241850) [11:24:05] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Nikolaos Gkountas - https://phabricator.wikimedia.org/T252100 (10Arrbee) Hello, this is an approved request as Nik needs to work on Content Translation issues. Thanks. [11:25:24] (03PS2) 10Cmjohnson: Adding new kubernetes nodes to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/594928 (https://phabricator.wikimedia.org/T241850) [11:26:30] (03PS2) 10Arturo Borrero Gonzalez: kubeadm: remove package_from_component define [puppet] - 10https://gerrit.wikimedia.org/r/594925 (https://phabricator.wikimedia.org/T251297) [11:26:32] (03PS2) 10Arturo Borrero Gonzalez: wmcs: kubeadm: introduce hiera support for selecting repo component [puppet] - 10https://gerrit.wikimedia.org/r/594926 (https://phabricator.wikimedia.org/T250866) [11:26:37] (03CR) 10Cmjohnson: [C: 03+2] Adding new kubernetes nodes to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/594928 (https://phabricator.wikimedia.org/T241850) (owner: 10Cmjohnson) [11:27:16] (03PS5) 10Jbond: sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) [11:28:56] (03CR) 10jerkins-bot: [V: 04-1] sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [11:29:26] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10Cmjohnson) [11:30:28] (03PS6) 10Jbond: sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) [11:30:51] (03CR) 10Jbond: "updated" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [11:31:21] !log enable ferm-status script https://gerrit.wikimedia.org/r/c/operations/puppet/+/576102 [11:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:25] (03CR) 10Jbond: [C: 03+2] ferm: enable ferm status script [puppet] - 10https://gerrit.wikimedia.org/r/576102 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [11:31:38] (03PS4) 10Jbond: ferm: enable ferm status script [puppet] - 10https://gerrit.wikimedia.org/r/576102 (https://phabricator.wikimedia.org/T206951) [11:33:30] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10Cmjohnson) [11:33:32] !log imported component/puppet5 for jessie-wikimedia into "main" [11:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:39] (03CR) 10Jbond: [C: 03+2] apereo_cas: add support for external tomcat instance [puppet] - 10https://gerrit.wikimedia.org/r/594689 (https://phabricator.wikimedia.org/T233950) (owner: 10Jbond) [11:42:28] (03PS1) 10Jbond: tomcat: fix closing xml tags [puppet] - 10https://gerrit.wikimedia.org/r/594930 [11:43:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/594930 (owner: 10Jbond) [11:44:06] (03PS2) 10Jbond: tomcat: fix closing xml tags [puppet] - 10https://gerrit.wikimedia.org/r/594930 [11:44:29] (03CR) 10Jbond: [V: 03+2 C: 03+2] tomcat: fix closing xml tags [puppet] - 10https://gerrit.wikimedia.org/r/594930 (owner: 10Jbond) [11:49:40] PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:51] ^^ this is me [11:50:16] PROBLEM - Check systemd state on idp-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:07] ACKNOWLEDGEMENT - Check systemd state on idp-test1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. John Bond jbond testing tomcat https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:07] ACKNOWLEDGEMENT - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. John Bond jbond testing tomcat https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:51:41] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-druid1001.eqiad.wmnet'... [11:51:53] (03CR) 1020after4: [C: 03+1] phabricator: Drop phd.pid-directory as it's now uneeded [puppet] - 10https://gerrit.wikimedia.org/r/594162 (owner: 10Paladox) [11:52:21] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-druid1002.eqiad.wmnet ` The log... [11:53:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [11:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:56] (03PS1) 10Jbond: apere_cas: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/594931 [11:54:40] (03CR) 10jerkins-bot: [V: 04-1] apere_cas: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/594931 (owner: 10Jbond) [11:55:21] (03CR) 1020after4: "Do we even use excel export? I don't remember this for some reason." [puppet] - 10https://gerrit.wikimedia.org/r/594157 (owner: 10Paladox) [11:55:42] (03CR) 10Muehlenhoff: apere_cas: fix dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594931 (owner: 10Jbond) [11:55:42] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [11:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:42] jouncebot: now [11:56:42] For the next 0 hour(s) and 3 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200507T1100) [11:56:56] jouncebot: next [11:56:56] In 1 hour(s) and 3 minute(s): Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200507T1300) [11:57:21] I'm going to add the fix for the train blocker to the end of swat [11:58:43] (03PS2) 10Jbond: apere_cas: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/594931 [11:58:45] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594931 (owner: 10Jbond) [11:59:56] (03CR) 10Jbond: [C: 03+2] apere_cas: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/594931 (owner: 10Jbond) [12:01:09] (03PS1) 10Dzahn: beta: fix deployment_server puppet by adding empty mcrouter gutter shard [puppet] - 10https://gerrit.wikimedia.org/r/594932 [12:02:09] (03CR) 10Dzahn: [C: 03+2] beta: fix deployment_server puppet by adding empty mcrouter gutter shard [puppet] - 10https://gerrit.wikimedia.org/r/594932 (owner: 10Dzahn) [12:02:18] RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:56] RECOVERY - Check systemd state on idp-test1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:15] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-druid1001.eqiad.wmnet ` The log... [12:08:40] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1002.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-druid1002.eqiad.wmnet'... [12:10:01] (03CR) 10Volans: sre.wdqs.data-transfer: manage ferm rules required for transfer (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [12:10:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [12:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:34] (03CR) 10Dzahn: [C: 03+2] deployment_server: in cloud we need more space in /srv [puppet] - 10https://gerrit.wikimedia.org/r/594903 (owner: 10Dzahn) [12:12:46] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:19] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.31/extensions/Wikibase: [[gerrit:594920]] T252079 Revert "Move prefetching-term-lookup-callback service wiring" (duration: 01m 12s) [12:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:22] T252079: mw.wikibase.getLabelByLang('Q1','en') returning nil today - https://phabricator.wikimedia.org/T252079 [12:14:46] (03PS1) 10Joal: Add analytics pageview-actors data purge [puppet] - 10https://gerrit.wikimedia.org/r/594933 (https://phabricator.wikimedia.org/T247344) [12:15:25] elukey: --^ for when you have time [12:15:33] cmjohnson1: you're reimaging hosts without a role, that doesn't work [12:15:36] see https://puppetboard.wikimedia.org/report/an-druid1001.eqiad.wmnet/dc92b35a26f491f799e748811fafaa8e08ae3f43 [12:15:51] (03PS1) 10Marostegui: install_server: Reimage pc1010|pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/594934 [12:16:06] thanks volans! forgot to do that [12:16:23] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage pc1010|pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/594934 (owner: 10Marostegui) [12:17:25] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-druid1001.eqiad.wmnet'... [12:18:19] (03PS1) 10Marostegui: install_server: Reimage pc1010 and pc2010 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/594935 [12:21:24] (03PS1) 10Jbond: apereo_cas: alow use to override daemon user [puppet] - 10https://gerrit.wikimedia.org/r/594936 [12:22:24] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: alow use to override daemon user [puppet] - 10https://gerrit.wikimedia.org/r/594936 (owner: 10Jbond) [12:23:50] (03PS2) 10Jbond: apereo_cas: alow use to override daemon user [puppet] - 10https://gerrit.wikimedia.org/r/594936 [12:25:04] (03CR) 10Jbond: [C: 03+2] apereo_cas: alow use to override daemon user [puppet] - 10https://gerrit.wikimedia.org/r/594936 (owner: 10Jbond) [12:25:46] (03PS1) 10Elukey: Revert "Use new kerberos credential cache on stat1005" [puppet] - 10https://gerrit.wikimedia.org/r/594937 [12:26:14] (03PS1) 10Cmjohnson: Adding an-druid100[12] and druid100[78] to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/594938 (https://phabricator.wikimedia.org/T245569) [12:26:36] (03CR) 10Elukey: [C: 03+2] Revert "Use new kerberos credential cache on stat1005" [puppet] - 10https://gerrit.wikimedia.org/r/594937 (owner: 10Elukey) [12:26:57] (03CR) 10Dzahn: [C: 03+1] Adding an-druid100[12] and druid100[78] to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/594938 (https://phabricator.wikimedia.org/T245569) (owner: 10Cmjohnson) [12:27:11] !log zpapierski@deploy1001 Started deploy [wdqs/wdqs@94906d0]: Deploy WDQS 0.3.28 + GUI [12:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:16] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage pc1010 and pc2010 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/594935 (owner: 10Marostegui) [12:29:12] (03CR) 10Cmjohnson: [C: 03+2] Adding an-druid100[12] and druid100[78] to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/594938 (https://phabricator.wikimedia.org/T245569) (owner: 10Cmjohnson) [12:31:56] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10Cmjohnson) [12:32:13] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10Cmjohnson) [12:32:54] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Cmjohnson) [12:33:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:35:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:37:46] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: fix Hiera values for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/594916 (owner: 10Dzahn) [12:38:22] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: add Hiera key/values for puppet master [puppet] - 10https://gerrit.wikimedia.org/r/594912 (owner: 10Dzahn) [12:38:25] (03PS1) 10Elukey: Assign role::statistics::explorer to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/594941 (https://phabricator.wikimedia.org/T249754) [12:41:29] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: add missing vcs::addresses for phab instances [puppet] - 10https://gerrit.wikimedia.org/r/594913 (owner: 10Dzahn) [12:42:11] (03CR) 10Vgutierrez: [C: 03+1] vcl: set large_objects_cutoff in hiera [puppet] - 10https://gerrit.wikimedia.org/r/594896 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [12:43:20] (03PS1) 10Cmjohnson: Adding new restbase1028-1030 servers to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/594942 (https://phabricator.wikimedia.org/T241784) [12:43:31] !log zpapierski@deploy1001 Finished deploy [wdqs/wdqs@94906d0]: Deploy WDQS 0.3.28 + GUI (duration: 16m 20s) [12:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:01] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-druid1001.... [12:44:11] (03CR) 10Dzahn: [C: 03+1] Adding new restbase1028-1030 servers to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/594942 (https://phabricator.wikimedia.org/T241784) (owner: 10Cmjohnson) [12:44:22] (03PS2) 10Cmjohnson: Adding new restbase1028-1030 servers to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/594942 (https://phabricator.wikimedia.org/T241784) [12:44:56] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: fix phabricator domain name and cluster search [puppet] - 10https://gerrit.wikimedia.org/r/594915 (owner: 10Dzahn) [12:45:03] (03CR) 10Cmjohnson: [C: 03+2] Adding new restbase1028-1030 servers to site.pp role insetup [puppet] - 10https://gerrit.wikimedia.org/r/594942 (https://phabricator.wikimedia.org/T241784) (owner: 10Cmjohnson) [12:45:07] (03PS2) 10Dzahn: cloud/devtools: fix phabricator domain name and cluster search [puppet] - 10https://gerrit.wikimedia.org/r/594915 [12:47:16] (03PS1) 10Gilles: Optimise all static PNGs losslessly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) [12:48:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [12:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:30] (03PS2) 10Dzahn: cloud/devtools: fix Hiera values for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/594916 [12:50:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:15] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Cmjohnson) [12:54:28] (03PS1) 10Arturo Borrero Gonzalez: wmcs: kubeadm: introduce support for selecting repository component [puppet] - 10https://gerrit.wikimedia.org/r/594945 (https://phabricator.wikimedia.org/T250866) [12:54:42] (03PS2) 10Elukey: Assign role::statistics::explorer to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/594941 (https://phabricator.wikimedia.org/T249754) [12:57:45] PROBLEM - Check that envoy is running on idp-test2001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [12:58:22] (03PS1) 10Dzahn: create profile to install a generic mariadb server [puppet] - 10https://gerrit.wikimedia.org/r/594946 [12:59:42] !log zpapierski@deploy1001 Started deploy [wdqs/wdqs@94906d0]: Deploy WDQS 0.3.28 + GUI - new servers [12:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:05] brennen and hashar: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200507T1300). [13:01:21] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-druid1002.... [13:01:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [13:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:25] !log zpapierski@deploy1001 Finished deploy [wdqs/wdqs@94906d0]: Deploy WDQS 0.3.28 + GUI - new servers (duration: 02m 43s) [13:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:35] (03PS3) 10Elukey: Assign role::statistics::explorer to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/594941 (https://phabricator.wikimedia.org/T249754) [13:02:49] PROBLEM - Check systemd state on ms-be1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:05] !log disabling puppet on all db hosts to control deployment of new paging alert T172489 [13:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:08] T172489: Monitor read_only on all databases, make it page on masters - https://phabricator.wikimedia.org/T172489 [13:04:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:19] PROBLEM - Check no envoy runtime configuration is left persistent on idp-test2001 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [13:07:10] (03CR) 10Jcrespo: [C: 03+2] mariadb: Enable read_only check to page on primary masters [puppet] - 10https://gerrit.wikimedia.org/r/594885 (https://phabricator.wikimedia.org/T172489) (owner: 10Jcrespo) [13:08:09] train time [13:08:12] (03PS1) 10Hashar: group1 wikis to 1.35.0-wmf.31 (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594947 (https://phabricator.wikimedia.org/T249963) [13:08:15] 10Operations, 10ops-eqiad: Degraded RAID on kafka-jumbo1001 - https://phabricator.wikimedia.org/T251586 (10Cmjohnson) @elukey I am not sure what was wrong yesterday but I didn't have any issues accessing the idrac today and downloaded the report. I opened the ticket with Dell. Ticket #SR1024511445 [13:09:16] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1001.eqiad.wmnet'] ` and were **ALL** successful. [13:09:40] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.35.0-wmf.31 (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594947 (https://phabricator.wikimedia.org/T249963) (owner: 10Hashar) [13:09:56] (03CR) 10Elukey: [C: 03+2] Add analytics pageview-actors data purge [puppet] - 10https://gerrit.wikimedia.org/r/594933 (https://phabricator.wikimedia.org/T247344) (owner: 10Joal) [13:10:15] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10Patch-For-Review: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` druid1007.eqi... [13:10:57] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.31 (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594947 (https://phabricator.wikimedia.org/T249963) (owner: 10Hashar) [13:12:14] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` druid1008.eqiad.wmnet ` The log can... [13:12:51] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided) [13:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:42] (03CR) 10Paladox: cloud/devtools: fix Hiera values for Gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594916 (owner: 10Dzahn) [13:19:02] 10Operations, 10SRE-Access-Requests: Requesting Access to sites from Google Search Console - https://phabricator.wikimedia.org/T251128 (10DZierten) Absolutely - approved. [13:19:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [13:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:57] (03CR) 10Dzahn: cloud/devtools: fix Hiera values for Gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594916 (owner: 10Dzahn) [13:22:13] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Jclark-ctr) @JHedden finished replacement of backplane, raid card, and drive 9 [13:26:05] (03PS1) 10Dzahn: cloud/devtools: change gerrit host name to include project [puppet] - 10https://gerrit.wikimedia.org/r/594949 [13:27:35] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-druid1002.eqiad.wmnet'] ` and were **ALL** successful. [13:28:04] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [13:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:17] (03CR) 10Paladox: [C: 03+1] cloud/devtools: change gerrit host name to include project [puppet] - 10https://gerrit.wikimedia.org/r/594949 (owner: 10Dzahn) [13:29:23] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: change gerrit host name to include project [puppet] - 10https://gerrit.wikimedia.org/r/594949 (owner: 10Dzahn) [13:30:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [13:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:09] (03CR) 10Ppchelko: [C: 03+1] changeprop: fix scoping issues in templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/594921 (owner: 10Hnowlan) [13:33:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:08] (03PS4) 10Giuseppe Lavagetto: Add the ability to consume from kafka [software/purged] - 10https://gerrit.wikimedia.org/r/594147 (https://phabricator.wikimedia.org/T133821) [13:34:10] (03PS6) 10Giuseppe Lavagetto: Add integration tests using docker-compose [software/purged] - 10https://gerrit.wikimedia.org/r/594148 (https://phabricator.wikimedia.org/T133821) [13:34:12] (03PS1) 10Giuseppe Lavagetto: Improve the kafka consumer interface [software/purged] - 10https://gerrit.wikimedia.org/r/594953 [13:34:56] (03CR) 10Dzahn: cloud/devtools: fix Hiera values for Gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594916 (owner: 10Dzahn) [13:36:13] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['druid1007.eqiad.wmnet'] ` and were **ALL** successful. [13:36:25] RECOVERY - Check no envoy runtime configuration is left persistent on idp-test2001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [13:37:06] (03CR) 10Ottomata: [C: 03+1] Assign role::statistics::explorer to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/594941 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [13:37:51] (03PS1) 10Dzahn: phabricator: in cloud, use a local database server [puppet] - 10https://gerrit.wikimedia.org/r/594954 [13:38:23] RECOVERY - Check systemd state on ms-be1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:52] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['druid1008.eqiad.wmnet'] ` and were **ALL** successful. [13:40:50] (03PS1) 10Jbond: tomcat: add RemoteIPValve [puppet] - 10https://gerrit.wikimedia.org/r/594956 [13:42:12] (03CR) 10Jbond: [C: 03+2] tomcat: add RemoteIPValve [puppet] - 10https://gerrit.wikimedia.org/r/594956 (owner: 10Jbond) [13:42:50] (03PS1) 10Elukey: Add fake keytabs for analytics-search on stat100x [labs/private] - 10https://gerrit.wikimedia.org/r/594957 [13:43:11] (03PS3) 10Ottomata: Add camus job event_dynamic_stream_configs [puppet] - 10https://gerrit.wikimedia.org/r/594565 (https://phabricator.wikimedia.org/T251609) [13:43:14] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake keytabs for analytics-search on stat100x [labs/private] - 10https://gerrit.wikimedia.org/r/594957 (owner: 10Elukey) [13:43:25] (03PS1) 10Papaul: Add thanos-fe200[1-3] MAC address, partman recipe and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/594958 (https://phabricator.wikimedia.org/T251635) [13:44:55] PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:28] (03PS4) 10Elukey: Assign role::statistics::explorer to stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/594941 (https://phabricator.wikimedia.org/T249754) [13:45:34] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22394/stat1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/594941 (https://phabricator.wikimedia.org/T249754) (owner: 10Elukey) [13:45:46] (03PS1) 10Jbond: ferm: change ferm-status owner/group to root:root [puppet] - 10https://gerrit.wikimedia.org/r/594959 [13:46:56] (03CR) 10Ottomata: [C: 03+2] Add camus job event_dynamic_stream_configs [puppet] - 10https://gerrit.wikimedia.org/r/594565 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:48:44] elukey: i see your keytab change to puppet-merge... ok to do so? [13:48:51] labs private [13:49:02] i think so, merging! [13:49:03] (03PS2) 10Papaul: Add thanos-fe200[1-3] MAC address, and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/594958 (https://phabricator.wikimedia.org/T251635) [13:49:46] ottomata: thanks! [13:51:38] 10Operations, 10ops-requests: GeoIP Puppet Module Fails in Labs - https://phabricator.wikimedia.org/T83447 (10Krenair) I wonder if `modules/puppetmaster/manifests/geoip.pp`'s `file { $geoip_destdir:` should set owner/groups. Right now: ` root@cloud-puppetmaster-03:~# ls -lh /var/lib/puppet/volatile total 8.0K... [13:52:22] (03CR) 10Papaul: [C: 03+2] Add thanos-fe200[1-3] MAC address, and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/594958 (https://phabricator.wikimedia.org/T251635) (owner: 10Papaul) [13:52:33] (03PS3) 10Papaul: Add thanos-fe200[1-3] MAC address, and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/594958 (https://phabricator.wikimedia.org/T251635) [13:52:36] (03CR) 10Papaul: [V: 03+2 C: 03+2] Add thanos-fe200[1-3] MAC address, and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/594958 (https://phabricator.wikimedia.org/T251635) (owner: 10Papaul) [13:52:47] (03CR) 10Dzahn: [C: 03+2] "This is just like the existing class made for quarry but under a generic name and added the 2 parameters." [puppet] - 10https://gerrit.wikimedia.org/r/594946 (owner: 10Dzahn) [13:53:31] (03PS13) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [13:54:33] (03CR) 10jerkins-bot: [V: 04-1] ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [13:55:13] papaul: i typed "multiple" and merged both [13:55:38] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Certificate *.wikipedia.org valid until 2020-06-20 - https://phabricator.wikimedia.org/T251726 (10Vgutierrez) if every LE certificate checked by that icinga check it's issued by acme-chief then yes, it's good [13:55:50] mutante: thanks [13:57:31] 10Operations, 10SRE-swift-storage, 10observability: swift backend decomms / rebalances are noisy - https://phabricator.wikimedia.org/T221904 (10CDanis) 05Open→03Resolved Optimistically resolving this ticket, I think both https://gerrit.wikimedia.org/r/c/operations/puppet/+/530080 and T222366 have fixed it. [13:57:59] (03PS2) 10Dzahn: phabricator: in cloud, use a local database server [puppet] - 10https://gerrit.wikimedia.org/r/594954 [13:58:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] ferm: change ferm-status owner/group to root:root [puppet] - 10https://gerrit.wikimedia.org/r/594959 (owner: 10Jbond) [13:58:39] (03PS7) 10Jbond: sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) [13:59:25] (03PS1) 10Andrew Bogott: Make cloudcontrol2004-dev the one openstack and glance controller [puppet] - 10https://gerrit.wikimedia.org/r/594960 (https://phabricator.wikimedia.org/T252121) [13:59:48] (03PS14) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [13:59:53] (03CR) 10Jbond: "updated thanks" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [14:00:06] (03CR) 10Jbond: [C: 03+2] ferm: change ferm-status owner/group to root:root [puppet] - 10https://gerrit.wikimedia.org/r/594959 (owner: 10Jbond) [14:00:21] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudcontrol2004-dev the one openstack and glance controller [puppet] - 10https://gerrit.wikimedia.org/r/594960 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [14:00:45] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [14:00:50] (03CR) 10jerkins-bot: [V: 04-1] ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [14:01:37] (03CR) 10Dzahn: [C: 03+2] phabricator: in cloud, use a local database server [puppet] - 10https://gerrit.wikimedia.org/r/594954 (owner: 10Dzahn) [14:02:00] (03PS1) 10ArielGlenn: Add link to Article Feedback dumps for downloaders [puppet] - 10https://gerrit.wikimedia.org/r/594961 (https://phabricator.wikimedia.org/T250715) [14:04:53] 10Operations, 10Wikidata, 10Patch-For-Review, 10User-notice, 10Wikimedia-Incident: Wikidata and dewiki databases locked - https://phabricator.wikimedia.org/T171928 (10jcrespo) [14:05:30] (03PS15) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [14:06:12] (03CR) 10Alexandros Kosiaris: Explicitly install both myspell-pt-pt and myspell-pt-br (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/586456 (https://phabricator.wikimedia.org/T249559) (owner: 10Halfak) [14:06:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] Explicitly install both myspell-pt-pt and myspell-pt-br [puppet] - 10https://gerrit.wikimedia.org/r/586456 (https://phabricator.wikimedia.org/T249559) (owner: 10Halfak) [14:06:30] (03CR) 10Hnowlan: [C: 03+2] changeprop: fix scoping issues in templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/594921 (owner: 10Hnowlan) [14:06:34] !log imported component/facter3 for jessie-wikimedia into "main" [14:06:35] (03CR) 10jerkins-bot: [V: 04-1] ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) (owner: 10Vgutierrez) [14:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:52] (03Merged) 10jenkins-bot: changeprop: fix scoping issues in templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/594921 (owner: 10Hnowlan) [14:06:55] FFS [14:07:24] 10Operations: debian-installer: partman doesn't allow lvm LVs to be reused when reimaging - https://phabricator.wikimedia.org/T252027 (10Kormat) After a lot more investigation, this is bordering on infeasible. It would require re-implementing a Lot of partman in order to make it work. Some notes: - partman is a... [14:07:26] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [14:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:28] (03PS2) 10JMeybohm: wikifeeds: Add TLS termination support [deployment-charts] - 10https://gerrit.wikimedia.org/r/594924 (https://phabricator.wikimedia.org/T235411) [14:07:41] (03PS1) 10Elukey: role::statistics::explorer: use kerberos for Search Timers [puppet] - 10https://gerrit.wikimedia.org/r/594963 [14:08:09] (03PS16) 10Vgutierrez: ATS: Add atstls.mtail program [puppet] - 10https://gerrit.wikimedia.org/r/594677 (https://phabricator.wikimedia.org/T244538) [14:09:24] (03CR) 10Elukey: [C: 03+2] role::statistics::explorer: use kerberos for Search Timers [puppet] - 10https://gerrit.wikimedia.org/r/594963 (owner: 10Elukey) [14:10:03] RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:33] (03PS1) 10Hnowlan: changeprop: package new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/594964 [14:11:21] (03PS1) 10Jbond: ferm-status: use ClassName for staticmethods [puppet] - 10https://gerrit.wikimedia.org/r/594965 [14:11:39] RECOVERY - Check that envoy is running on idp-test2001 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [14:16:01] (03CR) 10Hnowlan: [C: 03+2] changeprop: package new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/594964 (owner: 10Hnowlan) [14:16:19] (03Merged) 10jenkins-bot: changeprop: package new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/594964 (owner: 10Hnowlan) [14:17:31] (03CR) 10Jbond: [C: 03+2] ferm-status: use ClassName for staticmethods [puppet] - 10https://gerrit.wikimedia.org/r/594965 (owner: 10Jbond) [14:17:32] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [14:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] wikifeeds: Add TLS termination support [deployment-charts] - 10https://gerrit.wikimedia.org/r/594924 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [14:20:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Feel free to merge, this should be a noop" [deployment-charts] - 10https://gerrit.wikimedia.org/r/594924 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [14:21:39] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10Cmjohnson) [14:21:54] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops: (Need by: TBD) rack/setup/install an-druid100[12] and druid100[78] - https://phabricator.wikimedia.org/T245569 (10Cmjohnson) 05Open→03Resolved [14:22:42] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Better handling for one-hit-wonder objects - https://phabricator.wikimedia.org/T144187 (10ema) I have tested the exp admission policy based on probability exponentially decreasing with object size on cp2027 (cache_text) and cp2028 (ca... [14:24:43] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1007.eqiad.wmnet ` The log can be fou... [14:25:28] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1008.eqiad.wmnet ` The log can be fou... [14:25:30] (03PS2) 10Ema: vcl: set large_objects_cutoff in hiera [puppet] - 10https://gerrit.wikimedia.org/r/594896 (https://phabricator.wikimedia.org/T249809) [14:26:14] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1009.eqiad.wmnet ` The log can be fou... [14:27:02] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10Gehel) [14:28:12] 10Operations, 10Scap, 10Wikidata, 10Wikidata-Query-Service: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10Gehel) [14:28:45] (03Abandoned) 10Dzahn: icinga/mediawiki: update jobrunner monitoring, add command to use a POST request [puppet] - 10https://gerrit.wikimedia.org/r/566374 (https://phabricator.wikimedia.org/T243096) (owner: 10Dzahn) [14:29:46] (03CR) 10Ema: [C: 03+2] vcl: set large_objects_cutoff in hiera [puppet] - 10https://gerrit.wikimedia.org/r/594896 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [14:30:02] (03CR) 10JMeybohm: [C: 03+2] wikifeeds: Add TLS termination support [deployment-charts] - 10https://gerrit.wikimedia.org/r/594924 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [14:30:09] (03CR) 10jerkins-bot: [V: 04-1] wikifeeds: Add TLS termination support [deployment-charts] - 10https://gerrit.wikimedia.org/r/594924 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [14:30:40] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [14:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:54] (03PS1) 10Jbond: profile::idp: add devices dir to tomcat readwrite paths [puppet] - 10https://gerrit.wikimedia.org/r/594966 [14:33:19] (03CR) 10Jbond: [C: 03+2] profile::idp: add devices dir to tomcat readwrite paths [puppet] - 10https://gerrit.wikimedia.org/r/594966 (owner: 10Jbond) [14:33:25] (03PS3) 10JMeybohm: wikifeeds: Add TLS termination support [deployment-charts] - 10https://gerrit.wikimedia.org/r/594924 (https://phabricator.wikimedia.org/T235411) [14:35:15] (03CR) 10JMeybohm: [C: 03+2] wikifeeds: Add TLS termination support [deployment-charts] - 10https://gerrit.wikimedia.org/r/594924 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [14:35:48] (03PS1) 10Jbond: idp: remove trailing slash [puppet] - 10https://gerrit.wikimedia.org/r/594968 [14:35:56] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1010.eqiad.wmnet ` The log can be fou... [14:36:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: remove trailing slash [puppet] - 10https://gerrit.wikimedia.org/r/594968 (owner: 10Jbond) [14:36:36] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [14:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:21] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [14:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:08] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [14:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:51] (03PS1) 10Ema: vcl: move exp admission policy settings to hiera [puppet] - 10https://gerrit.wikimedia.org/r/594969 (https://phabricator.wikimedia.org/T144187) [14:39:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:38] cmjohnson1: FYI some of the downtimes might fail as they are multiple reimages at the same time and the time to run puppet on the icinga host is very long. Keep an eye on the END (PASS|FAIL) of the downtime cookbook here [14:40:02] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1011.eqiad.wmnet ` The log can be fou... [14:40:03] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:40:31] ^ Telia is still working on it [14:40:34] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [14:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:52] "Please note that the fault location is actually at Auburn and sorry for the confusion. " [14:41:08] ETA at 15:45 UTC today for the tech to be at the site. [14:41:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:47] PROBLEM - IPMI Sensor Status on dumpsdata1001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:42:09] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1009.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kubernetes1009.eqiad.wmnet'] ` [14:42:21] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:22] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1009.eqiad.wmnet ` The log can be fou... [14:44:11] (03PS1) 10Ema: cp3051: set large_objects_cutoff to 384K [puppet] - 10https://gerrit.wikimedia.org/r/594971 (https://phabricator.wikimedia.org/T249809) [14:44:25] (03CR) 10Elukey: [C: 04-1] "> > 1) What use cases do you have in mind? Is it only for oozie jobs" [puppet] - 10https://gerrit.wikimedia.org/r/589320 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [14:44:34] (03PS6) 10Cwhite: smart: add multiple hpsa controller support [puppet] - 10https://gerrit.wikimedia.org/r/588769 (https://phabricator.wikimedia.org/T199236) [14:45:05] PROBLEM - Host kubernetes1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:55] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:45:57] (03CR) 10Ema: "pcc lgtm: https://puppet-compiler.wmflabs.org/compiler1003/22396/" [puppet] - 10https://gerrit.wikimedia.org/r/594971 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [14:46:21] anybody working on kubernetes1008? [14:46:48] elukey: AFAICT is a reimage by cmjohnson1, see my last message above [14:47:01] ah okok nice [14:47:16] (03PS8) 10Gehel: sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [14:47:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [14:47:51] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1008.eqiad.wmnet'] ` and were **ALL** successful. [14:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:09] (03CR) 10Gehel: sre.wdqs.data-transfer: manage ferm rules required for transfer (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [14:48:14] RECOVERY - Host kubernetes1008 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:48:16] (03CR) 10Cwhite: [C: 03+2] smart: add multiple hpsa controller support [puppet] - 10https://gerrit.wikimedia.org/r/588769 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [14:48:33] 10Operations, 10DBA: Upgrade and restart s3 and s7 primary DB master: Thu 7th May - https://phabricator.wikimedia.org/T251158 (10Agusbou2015) [14:49:05] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [14:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:06] (03CR) 10jerkins-bot: [V: 04-1] sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [14:50:18] !log imported component/puppet5 for stretch-wikimedia into "main" [14:50:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:08] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Tobias Andersson to the ldap/wmde group - https://phabricator.wikimedia.org/T251997 (10RStallman-legalteam) The NDA is signed and on file. Feel free to proceed with access. Thanks! [14:51:55] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [14:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:09] (03CR) 10Jbond: sre.wdqs.data-transfer: manage ferm rules required for transfer (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [14:54:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:10] (03PS1) 10Hnowlan: changeprop: add cpjobqueue configuration switching [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) [14:55:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [14:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:03] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Nikolaos Gkountas - https://phabricator.wikimedia.org/T252100 (10colewhite) p:05Triage→03Medium a:03colewhite [14:56:05] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1010.eqiad.wmnet'] ` and were **ALL** successful. [14:57:41] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:36] (03PS1) 10Hnowlan: changeprop: enable kafka replay of purge messages [deployment-charts] - 10https://gerrit.wikimedia.org/r/594975 (https://phabricator.wikimedia.org/T248677) [14:58:38] (03PS1) 10Cwhite: admin: add Nikolaos Gkountas to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594974 (https://phabricator.wikimedia.org/T252100) [14:58:49] (03CR) 10Jbond: [C: 03+1] "lgtm, on commet but it really is just a gut feeling so feel free to ignore" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) (owner: 10Herron) [14:59:31] (03CR) 10Gehel: sre.wdqs.data-transfer: manage ferm rules required for transfer (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [14:59:48] !log imported component/facter3 for stretch-wikimedia into "main" [14:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:13] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1011.eqiad.wmnet'] ` and were **ALL** successful. [15:00:54] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) Thanks! I've imported the RAID config, restored the boot order settings and will verify it's fixed. [15:01:33] (03CR) 10Ppchelko: [C: 04-1] changeprop: enable kafka replay of purge messages (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594975 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:02:25] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1009.eqiad.wmnet'] ` and were **ALL** successful. [15:02:45] (03PS5) 10Mvolz: Citoid: Update service-runner to 2.7.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/594513 (https://phabricator.wikimedia.org/T239459) [15:02:51] (03PS3) 10Muehlenhoff: Remove component integration for Puppet 5 / Facter 3 on jessie/stretch [puppet] - 10https://gerrit.wikimedia.org/r/583028 [15:02:53] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1012.eqiad.wmnet ` The log can be fou... [15:03:11] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1013.eqiad.wmnet ` The log can be fou... [15:03:17] (03PS9) 10Gehel: sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [15:03:24] (03PS2) 10Hnowlan: changeprop: enable kafka replay of purge messages [deployment-charts] - 10https://gerrit.wikimedia.org/r/594975 (https://phabricator.wikimedia.org/T248677) [15:03:25] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [15:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:35] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1014.eqiad.wmnet ` The log can be fou... [15:03:55] (03CR) 10Ppchelko: [C: 03+2] changeprop: enable kafka replay of purge messages [deployment-charts] - 10https://gerrit.wikimedia.org/r/594975 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:04:13] (03Merged) 10jenkins-bot: changeprop: enable kafka replay of purge messages [deployment-charts] - 10https://gerrit.wikimedia.org/r/594975 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [15:04:53] (03CR) 10jerkins-bot: [V: 04-1] sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [15:05:20] (03CR) 10Mvolz: [C: 03+2] Citoid: Update service-runner to 2.7.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/594513 (https://phabricator.wikimedia.org/T239459) (owner: 10Mvolz) [15:06:25] (03PS4) 10Muehlenhoff: Remove component integration for Puppet 5 / Facter 3 on jessie/stretch [puppet] - 10https://gerrit.wikimedia.org/r/583028 [15:07:58] (03PS10) 10Jbond: sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) [15:09:39] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:42] (03PS1) 10Muehlenhoff: Remove late-install d-i hack for Puppet 5 / Facter 3 [puppet] - 10https://gerrit.wikimedia.org/r/594977 [15:12:22] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:20] PROBLEM - Check the last execution of git_pull_charts on deploy1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:13:30] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:34] (03CR) 10Jbond: [C: 03+1] sre.wdqs.data-transfer: manage ferm rules required for transfer (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [15:13:39] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10Cmjohnson) [15:14:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [15:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:06] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [15:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:09] (03PS1) 10Cwhite: Revert "smart: add multiple hpsa controller support" [puppet] - 10https://gerrit.wikimedia.org/r/594979 [15:15:47] (03CR) 10Ema: [C: 03+2] cp3051: set large_objects_cutoff to 384K [puppet] - 10https://gerrit.wikimedia.org/r/594971 (https://phabricator.wikimedia.org/T249809) (owner: 10Ema) [15:16:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Looks ok to me and makes it a tad easier to deploy a chart so that's nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/594922 (owner: 10JMeybohm) [15:16:22] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/594771 (https://phabricator.wikimedia.org/T251572) (owner: 10Ryan Kemper) [15:17:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Probably the same for k8s-staging?" [puppet] - 10https://gerrit.wikimedia.org/r/594919 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [15:19:14] (03PS5) 10Muehlenhoff: Remove component integration for Puppet 5 / Facter 3 on jessie/stretch [puppet] - 10https://gerrit.wikimedia.org/r/583028 [15:19:16] (03CR) 10Cwhite: [C: 03+2] Revert "smart: add multiple hpsa controller support" [puppet] - 10https://gerrit.wikimedia.org/r/594979 (owner: 10Cwhite) [15:19:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:59] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1012.eqiad.wmnet'] ` and were **ALL** successful. [15:26:08] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1013.eqiad.wmnet'] ` and were **ALL** successful. [15:26:39] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [15:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:57] !log rolling restart of ats-tls on text@esams - T249335 [15:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:59] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [15:28:41] !log mvolz@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' . [15:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:11] akosiaris, Pchelolo: okay to deploy now? [15:29:30] mvolz: fine with me [15:34:12] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1014.eqiad.wmnet ` The log can be fou... [15:36:15] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.31/extensions/Collection/includes/Specials/SpecialCollection.php: T251460 Set skin on BaseTemplates if you are using getSkin (duration: 01m 08s) [15:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:18] T251460: QuickTemplate: PHP Notice: Undefined index: skin - https://phabricator.wikimedia.org/T251460 [15:39:38] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10herron) [15:48:25] (03PS1) 10Jbond: interactive: add get_secret function [software/spicerack] - 10https://gerrit.wikimedia.org/r/594988 [15:50:07] RECOVERY - Device not healthy -SMART- on labstore1007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad+prometheus/ops [15:50:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/594977 (owner: 10Muehlenhoff) [15:51:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/583028 (owner: 10Muehlenhoff) [15:51:48] !log mvolz@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'citoid' for release 'production' . [15:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:57] (03PS1) 10Cwhite: smart: add multiple hpsa controller support [puppet] - 10https://gerrit.wikimedia.org/r/594989 (https://phabricator.wikimedia.org/T199236) [15:52:20] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Remove North Korea from data quality traffic entropy reports - https://phabricator.wikimedia.org/T251546 (10Milimetric) p:05Triage→03High [15:53:33] (03PS2) 10Cwhite: smart: add multiple hpsa controller support [puppet] - 10https://gerrit.wikimedia.org/r/594989 (https://phabricator.wikimedia.org/T199236) [15:57:21] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) [15:59:31] !log mvolz@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'citoid' for release 'production' . [15:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200507T1600). Please do the needful. [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:51] (03PS3) 10Jcrespo: mariadb: Enable read_only monitoring on misc hosts [puppet] - 10https://gerrit.wikimedia.org/r/594905 (https://phabricator.wikimedia.org/T172489) [16:02:48] (03Abandoned) 10Thcipriani: Gerrit 2.16.16 [software/gerrit] (deploy/wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/574092 (https://phabricator.wikimedia.org/T200739) (owner: 10Thcipriani) [16:02:53] (03PS2) 10Filippo Giunchedi: prometheus: add thanos sidecar to k8s instances [puppet] - 10https://gerrit.wikimedia.org/r/594919 (https://phabricator.wikimedia.org/T233956) [16:03:00] (03CR) 10Filippo Giunchedi: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/594919 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [16:03:49] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1014.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kubernetes1014.eqiad.wmnet'] ` [16:04:06] (03PS1) 10Elukey: Increase JVM heap size for the Hadoop Yarn Nodemanagers [puppet] - 10https://gerrit.wikimedia.org/r/594992 [16:07:17] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10Cmjohnson) All but 1014 have been image, I think I have a bad network cable for 1014. I have scheduled a quick trip to the data center this afternoon to t... [16:07:50] (03PS6) 10Mvolz: Citoid: Update service-runner to 2.7.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/594513 (https://phabricator.wikimedia.org/T239459) [16:08:37] (03CR) 10Mvolz: [C: 03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/594513 (https://phabricator.wikimedia.org/T239459) (owner: 10Mvolz) [16:09:01] (03Merged) 10jenkins-bot: Citoid: Update service-runner to 2.7.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/594513 (https://phabricator.wikimedia.org/T239459) (owner: 10Mvolz) [16:09:40] (03PS1) 10Rush: WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [16:10:02] (03CR) 10jerkins-bot: [V: 04-1] WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [16:10:25] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: `... [16:10:39] 10Operations, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research intern - https://phabricator.wikimedia.org/T252129 (10Miriam) [16:11:16] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: `... [16:12:09] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: `... [16:20:59] 10Operations, 10cloud-services-team (Kanban): Migrate remaining self-hosted puppet masters to Puppet 5 / facter 3 - https://phabricator.wikimedia.org/T241719 (10Jdforrester-WMF) [16:22:56] (03PS2) 10Rush: WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [16:23:16] akosiaris: this says this is merged, but the new version isn't showing up when I source .hfenv helmfile diff https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/594513/ [16:23:16] (03CR) 10jerkins-bot: [V: 04-1] WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [16:23:20] !log hnowlan@deploy1001 Started deploy [changeprop/deploy@6c65779]: Enabling consumption of purges topic [16:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:32] (03PS2) 10Elukey: Increase JVM heap size for the Hadoop Yarn Nodemanagers [puppet] - 10https://gerrit.wikimedia.org/r/594992 [16:23:42] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1028.eqiad.wmnet'] ` Of which those **FAIL... [16:23:44] !log hnowlan@deploy1001 Finished deploy [changeprop/deploy@6c65779]: Enabling consumption of purges topic (duration: 00m 24s) [16:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:06] mvolz: /me looking [16:24:15] !log hnowlan@deploy1001 Started deploy [changeprop/deploy@cd1386e]: Enabling consumption of purges topic [16:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:32] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1029.eqiad.wmnet'] ` Of which those **FAIL... [16:25:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [16:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:25] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1030.eqiad.wmnet'] ` Of which those **FAIL... [16:26:00] !log hnowlan@deploy1001 Finished deploy [changeprop/deploy@cd1386e]: Enabling consumption of purges topic (duration: 01m 45s) [16:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:08] hnowlan: I 'll have to undo your local staging/changeprop/values.yaml on deploy1001, it blocks git pull [16:26:12] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [16:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:54] akosiaris: agh, my bad. please do [16:27:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [16:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:48] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:42] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:59] mvolz: changes weren't pooled on the deployment server. That's fixed now [16:29:08] cool thanks [16:29:56] !log mvolz@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'citoid' for release 'staging' . [16:29:57] (03PS3) 10Elukey: Increase JVM heap size for the Hadoop Yarn Nodemanagers [puppet] - 10https://gerrit.wikimedia.org/r/594992 [16:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:14] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:42] (03PS2) 10Arturo Borrero Gonzalez: wmcs: kubeadm: introduce support for selecting repository component [puppet] - 10https://gerrit.wikimedia.org/r/594945 (https://phabricator.wikimedia.org/T250866) [16:31:10] (03CR) 10Joal: [C: 03+1] "LGTM !" [puppet] - 10https://gerrit.wikimedia.org/r/594992 (owner: 10Elukey) [16:31:30] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10Reedy) [16:32:06] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:25] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/22398/an-worker1080.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/594992 (owner: 10Elukey) [16:32:27] (03CR) 10Elukey: [C: 03+2] Increase JVM heap size for the Hadoop Yarn Nodemanagers [puppet] - 10https://gerrit.wikimedia.org/r/594992 (owner: 10Elukey) [16:32:40] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10Reedy) [16:36:10] 10Operations, 10Traffic: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [16:36:46] !log mvolz@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'citoid' for release 'production' . [16:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:40] RECOVERY - Check the last execution of git_pull_charts on deploy1001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:38:28] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10hashar) Must be a stale cache somewhere. It would help to have the headers. I haven't looked at t... [16:39:19] 10Operations, 10Cloud-Services, 10Traffic, 10Wikimedia-Incident, 10cloud-services-team (Kanban): Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10CDanis) Are these issues still ongoing? [16:41:48] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10Reedy) ` HTTP/2 200 OK date: Thu, 07 May 2020 15:42:30 GMT server: Apache content-security-policy... [16:41:58] (03CR) 10Volans: "Thanks for the patch, really appreciated. Few general comments, we can chat about them offline:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/594988 (owner: 10Jbond) [16:42:59] !log mvolz@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'citoid' for release 'production' . [16:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:42] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install thanos-be200[1-4] - https://phabricator.wikimedia.org/T251634 (10Papaul) [16:43:53] (03CR) 10Krinkle: "What matters to me is that the numbers add up correctly. Every req must be in one of the respective buckets. So it ends up in short, long " [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [16:45:48] !log hnowlan@deploy1001 Started deploy [changeprop/deploy@cd1386e]: Rollback varnish consumption [16:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:53] !log hnowlan@deploy1001 Finished deploy [changeprop/deploy@cd1386e]: Rollback varnish consumption (duration: 01m 05s) [16:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:56] (03CR) 10Krinkle: mtail: update varnishrls compatibility with rc35 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [16:48:02] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10Reedy) ` HTTP/2 200 OK date: Thu, 07 May 2020 16:46:02 GMT server: Apache content-security-policy... [16:48:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:49:29] (03CR) 10Krinkle: "@Filippo: I only see s-maxage and max-age, no maxage?" [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [16:49:58] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:51:26] (03CR) 10Cwhite: mtail: update varnishrls compatibility with rc35 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) (owner: 10Cwhite) [16:51:39] (03CR) 10Mvolz: "2.7.7 is now deployed for citoid." [deployment-charts] - 10https://gerrit.wikimedia.org/r/594492 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris) [17:00:04] halfak and accraze: How many deployers does it take to do Services – Graphoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200507T1700). [17:03:31] (03PS1) 10BearND: wikifeeds: Update iOS survey announcement [deployment-charts] - 10https://gerrit.wikimedia.org/r/595002 (https://phabricator.wikimedia.org/T251839) [17:03:32] PROBLEM - Device not healthy -SMART- on labstore1007 is CRITICAL: cluster=wmcs device={cciss,14,cciss,15,cciss,16,cciss,17,cciss,18,cciss,19,cciss,20,cciss,21,cciss,22,cciss,23} instance=labstore1007:9100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad+prometheus/ops [17:03:43] (03CR) 10jerkins-bot: [V: 04-1] wikifeeds: Update iOS survey announcement [deployment-charts] - 10https://gerrit.wikimedia.org/r/595002 (https://phabricator.wikimedia.org/T251839) (owner: 10BearND) [17:04:09] (03CR) 10Nuria: [C: 03+1] Add analytics pageview-actors data purge [puppet] - 10https://gerrit.wikimedia.org/r/594933 (https://phabricator.wikimedia.org/T247344) (owner: 10Joal) [17:08:16] (03PS2) 10Cwhite: mtail: update varnishrls compatibility with rc35 [puppet] - 10https://gerrit.wikimedia.org/r/594316 (https://phabricator.wikimedia.org/T251466) [17:09:01] (03CR) 10Cwhite: [C: 03+2] smart: add multiple hpsa controller support [puppet] - 10https://gerrit.wikimedia.org/r/594989 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [17:09:31] 10Operations, 10Traffic: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) p:05Triage→03Medium [17:12:31] ACKNOWLEDGEMENT - Device not healthy -SMART- on labstore1007 is CRITICAL: cluster=wmcs device={cciss,14,cciss,15,cciss,16,cciss,17,cciss,18,cciss,19,cciss,20,cciss,21,cciss,22,cciss,23} instance=labstore1007:9100 job=node site=eqiad Bstorm Not sure why this alerted again. The smart monitor doesnt work with multiple shelves T199248 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-o [17:12:31] r=labstore1007&var-datasource=eqiad+prometheus/ops [17:18:23] (03PS3) 10Arturo Borrero Gonzalez: wmcs: kubeadm: introduce support for selecting repository component [puppet] - 10https://gerrit.wikimedia.org/r/594945 (https://phabricator.wikimedia.org/T250866) [17:18:39] (03CR) 10BearND: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/595002 (https://phabricator.wikimedia.org/T251839) (owner: 10BearND) [17:19:07] no idea why CI for ^ failed earlier [17:20:37] ah, good. it worked this time [17:21:03] (03CR) 10BearND: [C: 03+2] wikifeeds: Update iOS survey announcement [deployment-charts] - 10https://gerrit.wikimedia.org/r/595002 (https://phabricator.wikimedia.org/T251839) (owner: 10BearND) [17:21:26] (03Merged) 10jenkins-bot: wikifeeds: Update iOS survey announcement [deployment-charts] - 10https://gerrit.wikimedia.org/r/595002 (https://phabricator.wikimedia.org/T251839) (owner: 10BearND) [17:22:21] (had some git issues earlier i guess. bad pack header) [17:23:04] SO suggests it ran out of memory on the server [17:25:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus: add thanos sidecar to k8s instances [puppet] - 10https://gerrit.wikimedia.org/r/594919 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [17:26:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: kubeadm: introduce support for selecting repository component [puppet] - 10https://gerrit.wikimedia.org/r/594945 (https://phabricator.wikimedia.org/T250866) (owner: 10Arturo Borrero Gonzalez) [17:27:30] (03Abandoned) 10Arturo Borrero Gonzalez: wmcs: kubeadm: introduce hiera support for selecting repo component [puppet] - 10https://gerrit.wikimedia.org/r/594926 (https://phabricator.wikimedia.org/T250866) (owner: 10Arturo Borrero Gonzalez) [17:27:47] (03Abandoned) 10Arturo Borrero Gonzalez: kubeadm: remove package_from_component define [puppet] - 10https://gerrit.wikimedia.org/r/594925 (https://phabricator.wikimedia.org/T251297) (owner: 10Arturo Borrero Gonzalez) [17:30:06] 10Operations, 10Cloud-Services, 10Traffic, 10Wikimedia-Incident, 10cloud-services-team (Kanban): Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10DB111) In my case last time at 5 a.m. (UTC) of 04/12, so running fine for nearly one mon... [17:34:11] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [17:35:13] (03CR) 10Ryan Kemper: [C: 03+2] sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [17:37:05] (03PS4) 10Ryan Kemper: icinga: grant 'Ryan Kemper' access to web UI [puppet] - 10https://gerrit.wikimedia.org/r/594771 (https://phabricator.wikimedia.org/T251572) [17:38:45] !log otto@deploy1001 Started deploy [analytics/refinery@4a2c530]: (no justification provided) [17:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: add cpjobqueue configuration switching (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [17:40:12] (03CR) 10Ryan Kemper: [C: 03+2] icinga: grant 'Ryan Kemper' access to web UI (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/594771 (https://phabricator.wikimedia.org/T251572) (owner: 10Ryan Kemper) [17:43:12] (03CR) 10Ppchelko: [C: 04-1] "This is so much better then all our prior attempts." (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [17:44:16] !log otto@deploy1001 Finished deploy [analytics/refinery@4a2c530]: (no justification provided) (duration: 05m 31s) [17:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:56] !log bsitzmann@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [17:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:22] (03CR) 10Ppchelko: [C: 04-1] "Also, maybe _config.yaml should be renamed into _changeprop.yaml?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/594973 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [17:46:56] !log bsitzmann@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [17:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Awesome, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/594492 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris) [17:50:11] !log bsitzmann@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [17:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:52] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10RKemper) [17:51:03] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10RKemper) Icinga access is working for me [17:56:07] (03PS1) 10Jdlrobson: Update project icons to refreshed SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595009 (https://phabricator.wikimedia.org/T249047) [17:56:25] (03PS3) 10Rush: WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [17:56:47] (03CR) 10jerkins-bot: [V: 04-1] WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [17:56:54] (03CR) 10jerkins-bot: [V: 04-1] Update project icons to refreshed SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595009 (https://phabricator.wikimedia.org/T249047) (owner: 10Jdlrobson) [17:57:33] RECOVERY - Device not healthy -SMART- on labstore1007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1007&var-datasource=eqiad+prometheus/ops [18:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Morning SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200507T1800). [18:00:04] Tchanders: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:01:04] (03PS2) 10Jdlrobson: Update project icons to refreshed SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595009 (https://phabricator.wikimedia.org/T249047) [18:01:06] (03PS11) 10Ryan Kemper: sre.wdqs.data-transfer: manage ferm rules required for transfer [cookbooks] - 10https://gerrit.wikimedia.org/r/589289 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [18:01:36] o/ i'm also in SWAT. I guess I added too late to get a jouncebot call out [18:01:36] Tchanders: I'm here, but feel free to deploy that yourself :-) [18:02:13] 10Operations, 10observability, 10User-fgiunchedi: Handle SMART for multiple shelves and controllers - https://phabricator.wikimedia.org/T199236 (10colewhite) 05Open→03Resolved Deployed multiple hpsa controller support and things are looking good. Will continue to monitor over the next few days. [18:02:16] (03PS4) 10Rush: WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [18:02:18] 10Operations, 10observability, 10Epic, 10User-fgiunchedi: Monitor and alarm on SMART attributes [tracking] - https://phabricator.wikimedia.org/T86552 (10colewhite) [18:03:38] Urbanecm: Would you mind doing it? I haven't actually done one yet! [18:03:48] Tchanders: ah, I see. Sure, I can do it for you! [18:03:54] Jdlrobson: noted, I can do yours as well [18:03:55] (03PS5) 10Rush: WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [18:04:00] Urbanecm: Thanks, I appreciate it [18:04:12] (03PS4) 10Urbanecm: Add the investigate right to the checkuser group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594881 (https://phabricator.wikimedia.org/T251932) (owner: 10JJMC89) [18:04:18] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594881 (https://phabricator.wikimedia.org/T251932) (owner: 10JJMC89) [18:04:20] thanks Urbanecm :) [18:05:01] (03CR) 10jerkins-bot: [V: 04-1] Add the investigate right to the checkuser group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594881 (https://phabricator.wikimedia.org/T251932) (owner: 10JJMC89) [18:05:55] (03CR) 10Urbanecm: [C: 03+2] "SWAT, re-+2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594881 (https://phabricator.wikimedia.org/T251932) (owner: 10JJMC89) [18:06:03] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Datasets-General-or-Unknown, 10Patch-For-Review: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10Bstorm) [18:06:07] (03CR) 10Urbanecm: [C: 03+2] "SWAT, re-+2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594881 (https://phabricator.wikimedia.org/T251932) (owner: 10JJMC89) [18:06:30] (03Merged) 10jenkins-bot: Add the investigate right to the checkuser group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594881 (https://phabricator.wikimedia.org/T251932) (owner: 10JJMC89) [18:06:42] Jdlrobson: Thanks for the offer! [18:07:13] Tchanders: syncing [18:07:26] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595009 (https://phabricator.wikimedia.org/T249047) (owner: 10Jdlrobson) [18:07:40] (03PS3) 10Urbanecm: Update project icons to refreshed SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595009 (https://phabricator.wikimedia.org/T249047) (owner: 10Jdlrobson) [18:07:43] (03CR) 10Urbanecm: Update project icons to refreshed SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595009 (https://phabricator.wikimedia.org/T249047) (owner: 10Jdlrobson) [18:07:47] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595009 (https://phabricator.wikimedia.org/T249047) (owner: 10Jdlrobson) [18:08:19] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 54bd2f1: Add the investigate right to the checkuser group on testwiki (T251932) (duration: 01m 08s) [18:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:22] T251932: Lower access restriction for Special:Investigate on testwiki [Small] - https://phabricator.wikimedia.org/T251932 [18:08:29] (03Merged) 10jenkins-bot: Update project icons to refreshed SVGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595009 (https://phabricator.wikimedia.org/T249047) (owner: 10Jdlrobson) [18:08:57] Tchanders: done. I see that change at https://test.wikipedia.org/wiki/Special:ListGroupRights now. Please let me know if it doesn't work for some weird reason :-) [18:09:25] (03PS6) 10Rush: WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [18:09:26] Jdlrobson: yours is on mwdebug1001, please let me know :). [18:09:36] Urbanecm: We can't test as we're not CheckUsers, but looks great - thanks [18:09:57] (We being AHT) [18:10:00] Urbanecm: on it [18:10:04] Tchanders: ah, gotcha :). [18:10:26] (03CR) 10jerkins-bot: [V: 04-1] WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [18:11:33] Urbanecm: LGTM! [18:11:38] thanks,. syncing! [18:11:58] (03PS7) 10Rush: WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) [18:12:58] (03CR) 10jerkins-bot: [V: 04-1] WIP: peek: security team PM tooling [puppet] - 10https://gerrit.wikimedia.org/r/594993 (https://phabricator.wikimedia.org/T251784) (owner: 10Rush) [18:13:25] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/: SWAT: 899c175: Update project icons to refreshed SVGs (T249047; part 1/2) (duration: 01m 08s) [18:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:28] T249047: [Site Config] Make new logos available in production in preparation for T246170 - https://phabricator.wikimedia.org/T249047 [18:14:57] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 899c175: Update project icons to refreshed SVGs (T249047; part 2/2) (duration: 01m 06s) [18:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:00] Jdlrobson: done [18:15:27] thanks Urbanecm ! [18:15:32] happy to help! [18:15:35] !log Morning SWAT done [18:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:16] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work): SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10herron) [18:17:50] (03CR) 10Krinkle: "For future ref, which modes were enabled, and which strength/optim level? (esp for Zopfli)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [18:18:30] (03PS1) 10Jdlrobson: WIP: Update production wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595014 (https://phabricator.wikimedia.org/T252143) [18:19:18] (03CR) 10jerkins-bot: [V: 04-1] WIP: Update production wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595014 (https://phabricator.wikimedia.org/T252143) (owner: 10Jdlrobson) [18:23:46] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@383fba5]: Enable both purging types T252142 [18:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:49] T252142: [Bug] mobile-html and summary (possibly other) endpoints not getting updated after edits - https://phabricator.wikimedia.org/T252142 [18:25:03] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@383fba5]: Enable both purging types T252142 (duration: 01m 17s) [18:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:07] (03CR) 10Gilles: "This was done with every available compressor enabled, and the maximum optimisation setting ("insane"). It took a while to process it all " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/594943 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [18:31:42] (03PS1) 10Mstyles: Update ML models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595019 [18:42:21] (03PS1) 10Gehel: wdqs: remove wdqs200[78] from "role(insetup)" [puppet] - 10https://gerrit.wikimedia.org/r/595022 [18:42:46] (03CR) 10jerkins-bot: [V: 04-1] wdqs: remove wdqs200[78] from "role(insetup)" [puppet] - 10https://gerrit.wikimedia.org/r/595022 (owner: 10Gehel) [18:43:56] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install thanos-be200[1-4] - https://phabricator.wikimedia.org/T251634 (10Papaul) [18:46:00] (03PS2) 10Gehel: wdqs: remove wdqs200[78] from "role(insetup)" [puppet] - 10https://gerrit.wikimedia.org/r/595022 [18:47:41] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) [18:48:19] (03CR) 10Ottomata: [C: 03+1] Increase JVM heap size for the Hadoop Yarn Nodemanagers [puppet] - 10https://gerrit.wikimedia.org/r/594992 (owner: 10Elukey) [18:48:46] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) @Marostegui thanks for updating the operations/puppet update portion. [18:52:06] (03PS1) 10Ottomata: Add eventlogging_Test to wgEventStreams config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595025 (https://phabricator.wikimedia.org/T238230) [19:00:05] brennen and hashar: Your horoscope predicts another unfortunate Mediawiki train - American+European Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200507T1900). [19:01:00] (03CR) 10Ottomata: [C: 03+2] Add eventlogging_Test to wgEventStreams config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595025 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:02:13] (03PS1) 10Brennen Bearnes: all wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595029 [19:02:15] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595029 (owner: 10Brennen Bearnes) [19:02:54] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595029 (owner: 10Brennen Bearnes) [19:04:03] (03PS2) 10Cwhite: admin: add jgiannelos to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594757 (https://phabricator.wikimedia.org/T251899) [19:05:22] (03CR) 10Cwhite: [C: 03+2] admin: add jgiannelos to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/594757 (https://phabricator.wikimedia.org/T251899) (owner: 10Cwhite) [19:05:23] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.31 [19:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:55] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add `Jgiannelos` to `wmf` LDAP group - https://phabricator.wikimedia.org/T251899 (10colewhite) 05Open→03Resolved Group membership updated. Please feel free to reopen if you encounter any related issue. [19:08:22] i'm seeing a spike of "Main slot of revision ... not found in database!", rolling this back unless it subsides momentarily. [19:09:18] !log Upgrade Routinator 3000 to 0.7.0 on rpki1001 - T252010 [19:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:21] T252010: Upgrade Routinator 3000 to 0.7.0 - https://phabricator.wikimedia.org/T252010 [19:09:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:09:50] !log rolling 1.35.0-wmf.31 back to group1 [19:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:58] 10Operations, 10Scap, 10Wikidata, 10Wikidata-Query-Service: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10colewhite) p:05Triage→03Medium [19:12:53] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert group2 wikis to 1.35.0-wmf.30 [19:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:01] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:16:03] (03PS1) 10Brennen Bearnes: Revert "all wikis to 1.35.0-wmf.31" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595030 (https://phabricator.wikimedia.org/T249963) [19:16:05] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "all wikis to 1.35.0-wmf.31" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595030 (https://phabricator.wikimedia.org/T249963) (owner: 10Brennen Bearnes) [19:16:43] (03Merged) 10jenkins-bot: Revert "all wikis to 1.35.0-wmf.31" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595030 (https://phabricator.wikimedia.org/T249963) (owner: 10Brennen Bearnes) [19:18:33] (03CR) 10Gehel: [C: 04-1] "don't merge yet, I'll do that together with Ryan" [puppet] - 10https://gerrit.wikimedia.org/r/595022 (owner: 10Gehel) [19:22:40] 10Operations, 10Scap, 10Wikidata, 10Wikidata-Query-Service: Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10thcipriani) The `scap::dsh::groups` hiera configuration variable is capable of populating dsh group files from conftoo... [19:26:10] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10HMarcus) Thank you @faidon for the additional context and background behind the technical pieces of this. This would exp... [19:37:00] (03PS1) 10Ottomata: Set wgEventLoggingStreamNames with initial streams EventLogging is allowed to produce [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595032 (https://phabricator.wikimedia.org/T238230) [19:38:05] (03CR) 10jerkins-bot: [V: 04-1] Set wgEventLoggingStreamNames with initial streams EventLogging is allowed to produce [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595032 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:39:24] (03CR) 10Ottomata: "This will not cause EventLogging to start producing these streams to EventGate, it will just allow it to do so. To switch over an eventlo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595032 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:40:25] (03PS2) 10Ottomata: Set wgEventLoggingStreamNames with initial streams EventLogging is allowed to produce [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595032 (https://phabricator.wikimedia.org/T238230) [19:43:27] (03CR) 10Ottomata: [C: 03+2] Set wgEventLoggingStreamNames with initial streams EventLogging is allowed to produce [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595032 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:43:47] 10Operations, 10FR-MW-Vagrant, 10Fundraising-Backlog, 10MediaWiki-Vagrant: Package XDebug 2.9 for apt.wikimedia.org - https://phabricator.wikimedia.org/T220406 (10jgleeson) [19:44:18] (03Merged) 10jenkins-bot: Set wgEventLoggingStreamNames with initial streams EventLogging is allowed to produce [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595032 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:44:20] (03CR) 10Ottomata: "Merging to get in beta. Nothing is using this in Prod yet, although it will cause EventLogging to start requesting stream config. Will te" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595032 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:44:24] (03PS1) 10Bstorm: toolforge: fix docker imagebuilder for the new puppet structure [puppet] - 10https://gerrit.wikimedia.org/r/595033 (https://phabricator.wikimedia.org/T251297) [19:49:37] (03CR) 10Bstorm: "Adding CC so it is on your radar Arturo! This profile uses the kubeadm repos as well." [puppet] - 10https://gerrit.wikimedia.org/r/595033 (https://phabricator.wikimedia.org/T251297) (owner: 10Bstorm) [19:49:50] (03CR) 10Bstorm: [C: 03+2] toolforge: fix docker imagebuilder for the new puppet structure [puppet] - 10https://gerrit.wikimedia.org/r/595033 (https://phabricator.wikimedia.org/T251297) (owner: 10Bstorm) [20:10:07] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingStreamNames: set initial stream names, as yet unused - T238230 (duration: 01m 07s) [20:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:11] T238230: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 [20:13:01] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Caching of https://doc.wikimedia.org/cover/mediawiki-libs-IPUtils/IPUtils.php.html is inconsistent - https://phabricator.wikimedia.org/T252131 (10colewhite) p:05Triage→03Medium [20:13:11] thanks for quick check DannyS712. have CCed a few more folks on that ticket. [20:14:12] no problem [20:14:18] Sorry for so many breakages [20:15:25] 10Operations, 10SRE-Access-Requests: Access to analytics-privatedata-users for Research intern - https://phabricator.wikimedia.org/T252129 (10colewhite) p:05Triage→03Medium a:03colewhite [20:16:39] (03PS1) 10EBernhardson: [WIP] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 [20:17:40] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (owner: 10EBernhardson) [20:19:32] (03PS1) 10Bstorm: toolforge: another fixup for the kubeadm refactor [puppet] - 10https://gerrit.wikimedia.org/r/595043 (https://phabricator.wikimedia.org/T251297) [20:22:07] @shdubsh - I wonder if this could be related to this week's train or a recent SWAT deployment? https://phabricator.wikimedia.org/T252165 [20:24:26] (03PS1) 10Ottomata: Set wgEventLoggingServiceUri in beta and production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595047 (https://phabricator.wikimedia.org/T238230) [20:24:50] Although I haven't seen it happening on English Wikipedia yet. [20:25:12] kaldari: That's a dupe [20:25:17] (03CR) 10jerkins-bot: [V: 04-1] Set wgEventLoggingServiceUri in beta and production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595047 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:25:30] Oh good [20:25:48] I swear I searched first :P [20:26:39] Which is seemingly a dupe of another too [20:29:44] (03PS2) 10Ottomata: Set wgEventLoggingServiceUri in beta and production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595047 (https://phabricator.wikimedia.org/T238230) [20:29:49] (03CR) 10Bstorm: "Actually checking PCC this time looks good :) https://puppet-compiler.wmflabs.org/compiler1002/22400/" [puppet] - 10https://gerrit.wikimedia.org/r/595043 (https://phabricator.wikimedia.org/T251297) (owner: 10Bstorm) [20:30:32] (03CR) 10jerkins-bot: [V: 04-1] Set wgEventLoggingServiceUri in beta and production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595047 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:31:01] Reedy: the problem may be present in wmf.31? [20:31:20] (03PS3) 10Ottomata: Set wgEventLoggingServiceUri in beta and production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595047 (https://phabricator.wikimedia.org/T238230) [20:33:22] shdubsh: I'm still seeing the problem throughout group 1 wikis, even though they were rolled back (but not in group 2 wikis). [20:33:45] (03PS2) 10EBernhardson: [WIP] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 [20:33:57] but there may be caching issues. [20:34:33] kaldari: right, which this issue appears to be associated with wmf.32 which isn't scheduled to deploy until the week of the 11th [20:34:48] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Role for SDoC WDQS [puppet] - 10https://gerrit.wikimedia.org/r/595041 (owner: 10EBernhardson) [20:35:32] if I'm reading the schedule correctly, wmf.31 should on group0 and group1 [20:35:44] schedule != reality [20:35:45] https://versions.toolforge.org/ [20:35:46] :) [20:36:39] normally, but in this case schedule == reality? [20:37:02] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [20:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:05] (03CR) 10Ottomata: [C: 03+2] Set wgEventLoggingServiceUri in beta and production on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595047 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:37:20] sure, but I wouldn't usually go by what the schedule says [20:37:23] much easier to check the facts ;) [20:37:31] So are we currently in the process of rolling back group 1 wikis to wmf.30? https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/594822/ [20:37:56] kaldari: best ask someone from releng who is running the train ;) [20:37:56] * shdubsh bookmarks link [20:38:04] good point :) [20:38:25] hello. :) [20:39:02] Wouldn't this mean that T252079 should have parent of wmf.31? [20:39:02] T252079: mw.wikibase.getLabelByLang('Q1','en') returning nil today - https://phabricator.wikimedia.org/T252079 [20:39:34] kaldari, shdubsh: the wmf.31 train is currently stalled on group1 due to T252156 [20:39:35] T252156: Increase in "Main slot of revision [number] not found in database!" after deploy of 1.35.0-wmf.31 to all wikis - https://phabricator.wikimedia.org/T252156 [20:40:17] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:13] brennen: Correct me if I'm wrong, but I think what kaldari has found is that wmf.31 also presents the symptoms of T252079. [20:41:41] correct, and I think brennen was working on a rollback due to that, right? [20:42:08] please see history on train blocker task: https://phabricator.wikimedia.org/T249963 [20:42:51] group1 was rolled back to .30 overnight, waiting a fix for T252079, and rolled forward after that issue was (believed to be) resolved. [20:44:12] (or rather, believed to be resolved on .31, at least.) [20:44:14] brennen - cool, looks like you're on top of it. Just want to make sure this gets rolled back quickly, as it affects probably tens of thousands of pages at least, some of which are currently showing big red Lua errors. [20:45:07] kaldari: sorry, can i get some clarification that that issue is not in fact resolved on the .31 branch? [20:45:43] @brennen T252156 fix should be ready soon - https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Scribunto/+/595049/ is ready for review [20:45:44] T252156: Increase in "Main slot of revision [number] not found in database!" after deploy of 1.35.0-wmf.31 to all wikis - https://phabricator.wikimedia.org/T252156 [20:46:28] addshore should know [20:46:40] * addshore reads up [20:47:03] (03PS1) 10Cwhite: admin: add Tobias Andersson to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/595051 (https://phabricator.wikimedia.org/T251997) [20:47:15] addshore, do we need to roll back wmf.31 on all the group 1 and group 0 wikis due to the wikidata Lua bug? [20:47:28] https://phabricator.wikimedia.org/T252079 is fixed on the deployed branch [20:48:05] the train rolled back yesterday because of it, and earlier today we backported a fix [20:48:05] so the remaining instances are due to caching perhaps? [20:48:29] Yes things may be wrong in the parser cache, if a purge doesnt work though, then it will be something else perhaps [20:48:33] backport is only to wmf.31, right? group2 is still at wmf.30 [20:48:45] addshore thanks! [20:48:46] DannyS712: the bug does not exist on .30 [20:49:00] DannyS712: let me double check that though (but I guess we would have spotted it last week otherwise) [20:49:16] brennen nevermind, the bug is persisting due to parser caching, but the underlying issue is fixed in the deployed version of wmf.31. [20:49:27] kaldari: thanks for clarifying. [20:49:29] yes, first seen in .31 [20:51:38] Hey all - would like to deploy a quick security update to PS.php if nobody's in there rn... [20:51:56] sbassett: urgent? otherwise please hold. [20:52:11] brennan: it can wait [20:52:18] brennen, even :) [20:52:26] addshore FWIW, I'm still seeing the issue after purging affected pages on Commons and Wikisource, e.g.: https://commons.wikimedia.org/wiki/Category:Elvis_Presley [20:52:49] sbassett: thanks, juggling a bit of stuff right now, should be sorted shortly. [20:52:53] * addshore looks [20:53:04] no problem [20:53:05] but some other instances on meta seem to be working now. [20:53:11] kaldari: any idea which module that is coming from? [20:53:50] Module:Infobox, which is probably pulling some some other module [20:54:53] a simpler case is probably https://en.wikisource.org/wiki/Module:Edition on English Wikisource. [20:55:09] It tries to pull in the labels for badges [20:55:25] hmm, okay it is definitely some cache somewhere, as I can confirm =mw.wikibase.getLabelByLang('Q545360','en') works [20:55:44] cool, I'll take your word for it :) [20:56:17] Amir1: ^^ any thoughts on what cache? [20:56:22] and why purging doesnt fix it? [21:00:12] kaldari: I suspect it is also cached in memcached [21:00:27] just trying to confirm.... [21:02:05] Question: should zuul be running tests for https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Scribunto/+/595049/ for both PS3 and PS5 at the same time right now? [21:02:25] kaldari: which would be a 24 hour cache ttl [21:03:02] got it [21:03:17] sorry, this cache layer was only added in feb so it didnt spring to mind! [21:04:07] That would also explain why it takes so long for updates in Wikidata to show up on those Commons infoboxes. [21:04:40] well, updates should be fine and should appear, infact if you go and edit one of these wikidata entities, the cache key will change and the new correct value will be fetched [21:05:19] but before the train was rolled back, any lookups that results in nil values will have been cached with the 24 hour ttl. The revid of the wikidata entity is part of the cache key [21:06:00] looking at when the rollback happened, that should mean everything will cleanup in the coming 5 hours [21:06:11] if there is a list of the bad entities I can add a flood flag and make null edits [21:06:37] that list would be pretty impossible to come up with [21:06:46] nvm then [21:07:12] as it is every entity that has been used in a page render in group1 for a 5 hour period :P [21:07:17] that also was not already cached xD [21:07:22] testing this now... [21:09:20] yup, so i just fixed native language on https://commons.wikimedia.org/wiki/Category:Elvis_Presley [21:09:41] with a not so subtle edit and revert https://www.wikidata.org/w/index.php?title=Q1860&type=revision&diff=1176090388&oldid=1166356952 [21:09:54] addshore So I just made this change on wikidata, https://www.wikidata.org/w/index.php?title=Q26657&type=revision&diff=1176090113&oldid=1175242731, and refreshed the page cache on Commons, but the infobox didn't update: https://commons.wikimedia.org/wiki/Category:Regulus_regulus (See iNaturalist taxon ID). [21:10:11] Last time I tried this, I noticed it took over an hour to update. [21:11:17] I see the "iNaturalist taxon ID" thing [21:11:40] It should be 117099 now instead of 793469 on Commons [21:11:52] https://usercontent.irccloud-cdn.com/file/nAuF93O3/image.png [21:12:16] so, thats the change dispatching process, which ends up triggering the page purge [21:12:19] ha, well it works for you! [21:12:34] https://grafana.wikimedia.org/d/000000156/wikidata-dispatch?orgId=1 [21:12:44] and now it's changed for me as well [21:12:52] I just had to make you look at it! [21:12:58] generally the wiki gets told the page should be updated in 30-60seconds, but the job to do the rest can get a bit stuck in the job queue [21:13:08] makes sense [21:13:25] this process did used to have a lot more lag in it [21:14:37] I'll go notify some of the projects that it could persist for up to 24 hours and they can manually force an update if they absolutely have to. [21:14:52] We can probably come up with a mechanism to increment the cache keys for this term cache for situations like this, so that we can resolve it a bit quicker [21:15:02] should only be ~5 hours more now [21:15:07] oh good [21:15:12] for most things [21:15:27] i guess group0 didnt rollback and in theroy it would have been poluting the same cache [21:15:38] but i bet thats not very many entirues [21:16:13] I'll write something on the tickeet too [21:16:18] For the revision errors, I've filed T252170 to mitigate future issues [21:16:23] T252170: Expand code coverage in the includes/Revision/ directory - https://phabricator.wikimedia.org/T252170 [21:21:54] kaldari: i wrote https://phabricator.wikimedia.org/T252079#6117724 [21:22:06] kaldari: thanks for noticing that it was persisting! [21:24:25] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) 05Open→03Resolved The virtual drive rebuild process was MUCH faster, the firmware upgrades completed successfully and all drives have r... [21:24:28] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10JHedden) [21:30:53] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:08] ^ This was caused by me, this is expected and I'm working on putting the maintainance window on the instance [21:34:29] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:34:29] is there an equivalent to hideDeprecated for wfWarn? [21:35:44] Also, backport for the Revision UBN: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Scribunto/+/595054/ [21:42:11] (03CR) 10Bstorm: [C: 03+2] toolforge: another fixup for the kubeadm refactor [puppet] - 10https://gerrit.wikimedia.org/r/595043 (https://phabricator.wikimedia.org/T251297) (owner: 10Bstorm) [21:42:39] (03CR) 10Bstorm: "Going ahead and merging because I think the PCC actually said what I needed to know?" [puppet] - 10https://gerrit.wikimedia.org/r/595043 (https://phabricator.wikimedia.org/T251297) (owner: 10Bstorm) [21:59:16] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work): SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10RKemper) [22:03:14] DannyS712: patch is on mwdebug1001, assuming testable... [22:04:41] looking [22:05:15] not sure how I should be testing - what is the user facing impact? [22:05:57] yeah, good question. i guess my understanding is that edits were failing, but i don't know how to know which ones. [22:06:11] okay, let me try something [22:07:15] tried to reproduce at ruwiki, which is the url you posted, but thats now at wmf.30 [22:07:23] hrm, yeah. [22:07:56] not sure what else to do [22:08:25] https://test2.wikipedia.org/wiki/Main_Page ? [22:08:50] (I can't edit the main page, its protected) what am I looking for? [22:08:52] It was a job\ [22:15:23] ah, right. deferred update -> lua... yeah, ok, i have no idea how to trigger. [22:16:53] Is this the wikidata one? Or the category change job? [22:19:45] Reedy: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/595054/ https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/595049/ [22:22:15] Yeah, CategoryMembershipChangeJob [22:22:35] I guess adding/removing categories? [22:22:50] (03PS1) 10Ryan Kemper: icinga: Add rkemper to wdqs-admins, sms [puppet] - 10https://gerrit.wikimedia.org/r/595059 (https://phabricator.wikimedia.org/T251572) [22:22:52] But I don't know if that'll be enough [22:23:00] I've been adding categories on wikinews (writing https://en.wikinews.org/wiki/United_States_Supreme_Court_overturns_Bridgegate_convictions_in_Kelly_v._United_States) - anything in the logs? [22:24:27] That's not going to work is it? [22:24:32] The code is failing on the job runners [22:24:43] Which aren't running the latest code, because it's not been deployed everywhere [22:25:07] so, what can I do to help test? [22:25:21] i think possibly we roll forward and see what happens. [22:25:31] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Discovery-Search (Current work), 10Patch-For-Review: SRE Onboarding - Ryan Kemper, Search Platform team - https://phabricator.wikimedia.org/T251572 (10RKemper) QUESTIONS to get answered tomorrow Initial context: These questions are around s... [22:25:37] I would suggest if it's code run on job runners, yeah as brennen days [22:25:43] Deploy it and watch the logs [22:25:46] s/days/says/ [22:25:48] yep, going ahead. [22:25:57] will be prepared to roll back quickly. [22:26:28] Is there anything specific I can do to generate more log entries? Delete a category? [22:27:00] Based on the logstash graph [22:27:03] Just wait a little bit [22:27:19] It's not clear from the stack what change was actually being made [22:27:40] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/595059 (https://phabricator.wikimedia.org/T251572) (owner: 10Ryan Kemper) [22:27:48] brennen: https://phabricator.wikimedia.org/T172480 [22:27:59] heh [22:31:19] !log brennen@deploy1001 Synchronized php-1.35.0-wmf.31/extensions/Scribunto/includes/engines/LuaCommon/TitleLibrary.php: [[gerrit:595054|Handle RevisionAccessException with try-catch (T252156)]] (duration: 01m 08s) [22:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:24] T252156: Increase in "Main slot of revision [number] not found in database!" after deploy of 1.35.0-wmf.31 to all wikis - https://phabricator.wikimedia.org/T252156 [22:33:25] (03PS1) 10Brennen Bearnes: all wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595060 [22:33:27] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595060 (owner: 10Brennen Bearnes) [22:34:05] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595060 (owner: 10Brennen Bearnes) [22:35:35] (03PS1) 10Ryan Kemper: sre.wdqs.data-transfer: fix missing commas [cookbooks] - 10https://gerrit.wikimedia.org/r/595061 [22:35:59] (03PS2) 10Ryan Kemper: sre.wdqs.data-transfer: fix missing commas [cookbooks] - 10https://gerrit.wikimedia.org/r/595061 (https://phabricator.wikimedia.org/T206951) [22:36:13] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.31 [22:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:20] DannyS712: so far so quiet. [22:37:49] glad to here; I figured reviewers would be hesitant to +2 any more Revision patches until the latest UBN was closed :) [22:39:20] hrm [22:39:22] .31 e/F/i/C/AbstractCollection:147 Revisions for vlxs575yheads0h7 could not be found [22:39:35] ring any bells? [22:39:45] extension/Flow/ something [22:39:46] looking [22:40:18] extensions/Flow/includes/Collection/AbstractCollection.php:147 [22:40:37] if this is new i'll roll back and call it on the train for today / this week. [22:40:52] don't think it was me? https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Flow/ shows nothing since https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Flow/+/424e4e24d5ce7e5cb649a7bf15420c98d5f9de6d of mine [22:41:38] Plus that exception isn't the one we just added - getAllRevisions throws it itself [22:43:09] yeah, may be an existing thing. [22:43:45] (03PS1) 10Bstorm: jessie: try to make jessie containers build at least one more time [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/595062 (https://phabricator.wikimedia.org/T197930) [22:48:07] i am seeing some "main slot of revision [n] not found in database" trickling in, but those may be the pre-existing bug. [22:49:25] indeed - there was no logging previously [22:50:14] Should we add logging for Revision::getSize and getContent now? (and getSha1, though that is already hard deprecated)? [22:54:48] Question: when is the planned cut of 1.35/1.36? [22:55:27] DannyS712: i think that is to-be-determined. [22:55:50] re: logging for getSize and getContent, i am probably not qualified to answer that. [22:56:04] okay; one of the reasons for the UBNs is probably trying to move quickly to fully hard deprecate the Revision class before the branch is cut, so it can be removed in 1.36 [23:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200507T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:12:14] (03PS1) 10Alexandros Kosiaris: redis::instance: Deduplicate base_settings [puppet] - 10https://gerrit.wikimedia.org/r/595064 [23:12:16] (03PS1) 10Alexandros Kosiaris: redis: Allow override of instance config [puppet] - 10https://gerrit.wikimedia.org/r/595065 [23:12:18] (03PS1) 10Alexandros Kosiaris: ores: Pass the current config to redis::misc [puppet] - 10https://gerrit.wikimedia.org/r/595066 [23:18:55] (03PS2) 10Alexandros Kosiaris: ores: Pass the current config to redis::misc [puppet] - 10https://gerrit.wikimedia.org/r/595066 [23:54:03] @brennen please see T252179 for my latest mistake that went out with the branch [23:54:03] T252179: Edits saved via PageUpdater need autopatrol status set - https://phabricator.wikimedia.org/T252179 [23:54:09] patches coming now [23:55:19] DannyS712: looking [23:57:18] DannyS712: does this merit a rollback? if so i'm going to roll back now and call the train stopped here for the day. [23:58:34] no, I don't think so; it just means that edits made via translate, babel autocreate, wikimedia maintencae, migrateCampaigns, and template styles aren't autopatrolled if they should be