[00:08:56] 10Operations, 10Privacy Engineering, 10Wikimedia-Logstash, 10Privacy: Production logstash should be protected by two-factor auth, at the least - https://phabricator.wikimedia.org/T237630 (10JFishback_WMF) [00:20:10] PROBLEM - MariaDB Slave Lag: s3 on db1095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1205.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:27:43] (03PS1) 10Krinkle: mediawiki: Fix reqId in php7-fatal-error.php to consider X-Request-Id [puppet] - 10https://gerrit.wikimedia.org/r/582948 (https://phabricator.wikimedia.org/T247786) [01:47:40] RECOVERY - MariaDB Slave Lag: s3 on db1095 is OK: OK slave_sql_lag Replication lag: 0.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:51:44] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1292.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:32:47] 10Operations, 10ops-codfw, 10Patch-For-Review: codfw pdu phase inbalances: audit and correct - https://phabricator.wikimedia.org/T163339 (10RobH) 05Open→03Resolved [03:03:05] (03CR) 10Jforrester: [C: 03+1] "Good to go whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582939 (https://phabricator.wikimedia.org/T248311) (owner: 10RhinosF1) [03:21:42] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:58:18] (03PS1) 10KartikMistry: apertium-isl-eng: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-isl-eng] - 10https://gerrit.wikimedia.org/r/582956 (https://phabricator.wikimedia.org/T247585) [06:01:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1087, vslow s8, with weight 1 as it originally had', diff saved to https://phabricator.wikimedia.org/P10753 and previous config saved to /var/cache/conftool/dbconfig/20200324-060133-marostegui.json [06:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:22] (03CR) 10Marostegui: [C: 03+1] "As agreed yesterday, let's deploy this today" [puppet] - 10https://gerrit.wikimedia.org/r/582643 (owner: 10Muehlenhoff) [06:16:57] !log Create empty database testreduce on m5 master T245408 [06:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:03] T245408: testreduce_vd database in m5 still in use? - https://phabricator.wikimedia.org/T245408 [06:24:26] (03PS1) 10Marostegui: Revert "install_server: Allow reimage db1077" [puppet] - 10https://gerrit.wikimedia.org/r/582958 [06:26:15] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage db1077" [puppet] - 10https://gerrit.wikimedia.org/r/582958 (owner: 10Marostegui) [06:35:01] (03PS3) 10Muehlenhoff: Switch dbproxy1019 to public Ferm profile [puppet] - 10https://gerrit.wikimedia.org/r/582643 [06:38:27] (03CR) 10Muehlenhoff: [C: 03+2] Switch dbproxy1019 to public Ferm profile [puppet] - 10https://gerrit.wikimedia.org/r/582643 (owner: 10Muehlenhoff) [06:42:53] !log Reboot dbproxy1019 [06:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:06] (03PS1) 10Muehlenhoff: Switch dbproxy1019 to public Ferm profile [puppet] - 10https://gerrit.wikimedia.org/r/582960 [06:44:41] (03PS2) 10Muehlenhoff: Switch dbproxy1018 to public Ferm profile [puppet] - 10https://gerrit.wikimedia.org/r/582960 [06:48:06] (03CR) 10Marostegui: [C: 03+1] Switch dbproxy1018 to public Ferm profile [puppet] - 10https://gerrit.wikimedia.org/r/582960 (owner: 10Muehlenhoff) [06:51:12] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/21548/dbproxy1018.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/582960 (owner: 10Muehlenhoff) [06:52:41] (03CR) 10Muehlenhoff: [C: 03+2] Switch dbproxy1018 to public Ferm profile [puppet] - 10https://gerrit.wikimedia.org/r/582960 (owner: 10Muehlenhoff) [06:53:25] 10Operations, 10MediaWiki-General, 10observability, 10serviceops: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Joe) >>! In T240685#5993676, @colewhite wrote: > [[ https://docs.datadoghq.com/developers/dogstatsd/datagram_shell/?tab=metrics | DogStatsD ]] shows some promise her... [06:55:13] !log Reboot dbproxy1018 [06:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:06] (03PS1) 10Muehlenhoff: Make profile::mariadb::ferm_public actually public [puppet] - 10https://gerrit.wikimedia.org/r/582961 [07:10:44] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/21549/dbproxy1019.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/582961 (owner: 10Muehlenhoff) [07:14:48] (03CR) 10Marostegui: [C: 03+1] Make profile::mariadb::ferm_public actually public [puppet] - 10https://gerrit.wikimedia.org/r/582961 (owner: 10Muehlenhoff) [07:17:47] (03CR) 10Muehlenhoff: [C: 03+2] Make profile::mariadb::ferm_public actually public [puppet] - 10https://gerrit.wikimedia.org/r/582961 (owner: 10Muehlenhoff) [07:33:41] !log restart update-openstack-mirror.service on sodium [07:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:12] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:50] (03PS1) 10Muehlenhoff: Remove component integration for Puppet 5 / Facter 3 on jessie/stretch [puppet] - 10https://gerrit.wikimedia.org/r/583028 [07:59:18] (03CR) 10Elukey: [C: 03+2] profile::analytics::client::limits: add slice settings for cron [puppet] - 10https://gerrit.wikimedia.org/r/582840 (owner: 10Elukey) [07:59:58] (03PS1) 10Muehlenhoff: Switch Cloud VPS/Toolforge to Puppet 5 / Facter 3 [puppet] - 10https://gerrit.wikimedia.org/r/583030 [08:09:34] (03CR) 10DCausse: [C: 03+1] wdqs: added monitoring to data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/582784 (owner: 10Gehel) [08:18:49] (03CR) 10Jcrespo: "This will break labsdbs- this change as it is makes no sense, as it is the same than profile/mariadb/ferm" [puppet] - 10https://gerrit.wikimedia.org/r/582961 (owner: 10Muehlenhoff) [08:19:24] (03CR) 10Marostegui: [C: 03+1] "Break labsdb?" [puppet] - 10https://gerrit.wikimedia.org/r/582961 (owner: 10Muehlenhoff) [08:20:22] (03CR) 10Jcrespo: "Sorry, I read the patch the other way- it was broken before, but it should work now." [puppet] - 10https://gerrit.wikimedia.org/r/582961 (owner: 10Muehlenhoff) [08:22:12] (03CR) 10Muehlenhoff: [C: 03+2] "Yeah, we noticed that when testing db1018/1019 after they were enabled earlier the morning." [puppet] - 10https://gerrit.wikimedia.org/r/582961 (owner: 10Muehlenhoff) [08:25:33] !log Rename wikidatawiki.wb_terms on db1104 - T248086 [08:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:39] T248086: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 [08:28:04] jouncebot: next [08:28:04] In 2 hour(s) and 31 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200324T1100) [08:39:23] !log Rename nova database tables on db1133 (m5 master) - T248313 [08:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:28] T248313: Drop nova and nova_api databases from m5 - https://phabricator.wikimedia.org/T248313 [08:54:16] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:02:34] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22369 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:05:01] * RhinosF1 is here all day pretty much, i'll shout when swat starts [09:26:15] * yuvipanda waves and sends lots of love to people [09:26:32] * volans waves back to yuvipanda! [09:27:31] * RhinosF1 waves at volans and yuvipanda [09:34:51] 10Operations, 10Cloud-VPS, 10Shinken, 10Graphite, 10cloud-services-team (Kanban): Clean up labs graphite datapoints - https://phabricator.wikimedia.org/T111540 (10hashar) 05Open→03Resolved Supposedly that is fixed by //Shinken now allows undefined data points// https://gerrit.wikimedia.org/r/#/c/oper... [09:39:05] 10Operations, 10Wikimedia-Mailing-lists: Please decom reading-wmf mailing list - https://phabricator.wikimedia.org/T248126 (10Volans) 05Open→03Resolved a:03Volans Thanks all, resolving. Feel free to follow up for the off-boarding part separately. [09:40:48] 10Operations, 10Commons, 10Wikimedia-Site-requests, 10Patch-For-Review: Enforce upload rate limits for bots on commons - https://phabricator.wikimedia.org/T248177 (10Volans) p:05Triage→03Medium [09:40:50] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 (10Volans) p:05Triage→03Medium [09:41:19] !log Deploy schema change on db2076 (s6) [09:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:00] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) >>! In T247731#5991622, @Volans wrote: > Change merged, added users `theseer`, `spriebsch` and `sebastianbergmann` to the NDA LDAP group. > Ple... [09:51:23] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) 05Open→03Resolved a:03darthmon_wmde [09:59:46] PROBLEM - MariaDB Slave Lag: s6 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1244.83 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [10:08:04] RECOVERY - MariaDB Slave Lag: s6 on db2095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [10:08:40] !log restart blazegraph and updater on wdqs1004 [10:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:58] 10Operations, 10netops: Fix LibreNMS alert "CDR bills over 75% used" - https://phabricator.wikimedia.org/T247949 (10ayounsi) 05Open→03Resolved Done, with a threshold moved to 90%. And runbook updated. The only limitation is that it will alert for every devices that are part of that traffic bill, which is a... [10:51:13] (03PS1) 10Jcrespo: mariadb-backups: Stop replication for s3 when snapshotting [puppet] - 10https://gerrit.wikimedia.org/r/583049 (https://phabricator.wikimedia.org/T138562) [10:53:01] 10Operations, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10ayounsi) Port down for 7 days `name=asw2-d-eqiad #1: ge-1/0/7; interface - ge-1/0/7; #2: ge-1/0/7; interface - ge-1/0/7; #3: ge-1/0/7; interface - ge-1/0/7; #4: ge-1/0/7; interface - ge-1/0/7; #5: ge-... [10:55:45] jouncebot: next [10:55:45] In 0 hour(s) and 4 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200324T1100) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200324T1100). [11:00:04] RhinosF1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] * RhinosF1 here [11:02:24] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 (10jbond) >>! In T247956#5993291, @Volans wrote: > I've added the unique list of keys in the task description. I think those with the `role::` p... [11:03:44] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 (10Volans) @jbond yes, AFAIK it's when two different roles include the same profile but want to pass different value to the profile parameter ba... [11:06:19] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 (10jbond) >>! In T247956#5994487, @Volans wrote: > @jbond yes, AFAIK it's when two different roles include the same profile but want to pass dif... [11:09:44] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 (10Volans) Yeah, sorry I replied too fast, that's what I was thinking actually, and indeed, we should not have those with `role::` at all. [11:11:36] RhinosF1: I can SWAT today! [11:11:48] Urbanecm: ready when you are [11:12:24] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582939 (https://phabricator.wikimedia.org/T248311) (owner: 10RhinosF1) [11:12:31] RhinosF1: I'll ping you when it's ready [11:13:18] (03Merged) 10jenkins-bot: Enable visualeditor on hewiktionary by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/582939 (https://phabricator.wikimedia.org/T248311) (owner: 10RhinosF1) [11:14:45] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Stop replication for s3 when snapshotting [puppet] - 10https://gerrit.wikimedia.org/r/583049 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [11:16:33] Urbanecm: thanks [11:16:58] RhinosF1: mwdebug1001 has the patch, could you have a look? [11:17:06] * RhinosF1 on it [11:18:11] ack [11:18:34] (03PS1) 10Marostegui: production-m5.sql: Remove nova DB grants [puppet] - 10https://gerrit.wikimedia.org/r/583052 (https://phabricator.wikimedia.org/T248313) [11:19:09] 10Operations: Add favicon to icinga and tendril - https://phabricator.wikimedia.org/T204110 (10jcrespo) Icinga is now showing a favicon. Only tendril is left! [11:19:26] Urbanecm: working logged out but not logged in, checking I don't have an override in place though. [11:19:37] hmm interesting [11:20:38] Urbanecm: ?uselang=en doesn't work so this could be fun [11:20:50] it does for me? [11:20:52] https://he.wiktionary.org/wiki/%D7%A9%D7%98%D7%A3_%D7%93%D7%9D?uselang=en [11:21:22] and VE works as well logged in or not [11:21:31] RhinosF1: can you still reproduce this issue? [11:22:03] Urbanecm: I'm getting text editor logged in [11:22:22] hmm, weird [11:22:37] going to deploy anyway, it doesn't throw errors and works for at least one of us [11:22:44] https://usercontent.irccloud-cdn.com/file/TqJmvddq/Screenshot%202020-03-24%20at%2011.22.32.png [11:23:03] and is there an edit tab, apart from edit source? [11:23:10] try logging in in inkognito too [11:23:17] (or purge cache of that page) [11:23:21] just 'edit' [11:24:12] try what i suggested, this will go live in the meanwhile [11:25:03] !log urbanecm@deploy1001 Synchronized dblists/visualeditor-nondefault.dblist: SWAT: e28c819: Enable visualeditor on hewiktionary by default (T248311) (duration: 01m 03s) [11:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:10] T248311: Enable Visual Editor by default in the Hebrew Wiktionary - https://phabricator.wikimedia.org/T248311 [11:25:18] * RhinosF1 working on it [11:25:31] (03CR) 10Jbond: [C: 03+1] Switch Cloud VPS/Toolforge to Puppet 5 / Facter 3 [puppet] - 10https://gerrit.wikimedia.org/r/583030 (owner: 10Muehlenhoff) [11:26:31] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb-backups: Stop replication for s3 when snapshotting [puppet] - 10https://gerrit.wikimedia.org/r/583049 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [11:28:07] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: e28c819: Enable visualeditor on hewiktionary by default (T248311) (duration: 00m 59s) [11:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:21] Urbanecm: working on a new account so I'm guessing it's my prefrences. [11:28:25] probably [11:28:35] Do you want to lock the test account I just created? [11:28:53] I don't think it's necessary [11:29:24] meh, it's got shit password [11:29:55] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: e28c819: Enable visualeditor on hewiktionary by default (T248311; take II) (duration: 00m 59s) [11:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:09] RhinosF1: change it to a random set of chars and discard ;) [11:30:16] Urbanecm: will do [11:30:44] (03PS13) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) [11:31:13] RhinosF1: should work with prod now [11:31:50] nice! [11:32:39] LGTM [11:34:20] !log EU SWAT done [11:34:21] thanks! [11:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: support canonical redirects in urlproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [11:42:03] Urbanecm: Thanks to you! Scrambled the password in the best way possible, hit the keyboard. [11:42:10] nice! [11:55:37] !log Deploy schema change on db2087 db2089 db2097 [11:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:34] (03PS1) 10Jbond: openstack::nova::compute::service: allow managing service [puppet] - 10https://gerrit.wikimedia.org/r/583054 [11:57:54] !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Romy merdeka' 'Romy_Dwi_Laksono' (T248371) [11:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:59] T248371: Unblock stuck global renames of Copperqueen, Romy Dwi Laksono, Toroidt, and 사형제 형님 - https://phabricator.wikimedia.org/T248371 [11:58:05] (03PS2) 10Jbond: openstack::nova::compute::service: allow managing service [puppet] - 10https://gerrit.wikimedia.org/r/583054 [11:58:44] (03PS3) 10Jbond: openstack::nova::compute::service: allow managing service [puppet] - 10https://gerrit.wikimedia.org/r/583054 [11:58:57] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [12:01:19] (03CR) 10jerkins-bot: [V: 04-1] openstack::nova::compute::service: allow managing service [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [12:02:50] 10Operations, 10Wikimedia-Mailing-lists: Delete email addresses with privileged @domain names from mailing lists at offboarding - https://phabricator.wikimedia.org/T248384 (10dr0ptp4kt) [12:06:21] (03CR) 10WMDE-Fisch: [C: 04-1] "Since the setting in the one file depends on the setting in the other file this should not be part of one patch. I'll split it up so we ca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581991 (https://phabricator.wikimedia.org/T244863) (owner: 10Andrew-WMDE) [12:09:22] (03PS1) 10Jbond: 0.7.1: Ensure we extract tags [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/583055 [12:10:30] (03CR) 10Jbond: [C: 03+2] 0.7.1: Ensure we extract tags [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/583055 (owner: 10Jbond) [12:10:55] !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwiki --logwiki=metawiki 'Erika Greenberg' 'Copperqueen' (T248371) [12:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:01] T248371: Unblock stuck global renames of Copperqueen, Romy Dwi Laksono, Toroidt, and 사형제 형님 - https://phabricator.wikimedia.org/T248371 [12:12:00] 10Operations, 10netops, 10Wikimedia-Incident: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10ayounsi) LGTM! I forgot one step: write doc :) - There should be a wiki page (eg on https://wikitech.wikimedia.org/wiki/Storm_control) that explains what it is, where/how it's de... [12:12:44] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [12:15:09] (03PS4) 10Jbond: openstack::nova::compute::service: allow managing service [puppet] - 10https://gerrit.wikimedia.org/r/583054 [12:16:16] !log mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Toroid~huwiki' 'Toroidt' (T248371) [12:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:22] T248371: Unblock stuck global renames of Copperqueen, Romy Dwi Laksono, Toroidt, and 사형제 형님 - https://phabricator.wikimedia.org/T248371 [12:16:59] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [12:18:09] (03CR) 10jerkins-bot: [V: 04-1] openstack::nova::compute::service: allow managing service [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [12:21:36] (03PS5) 10Jbond: openstack::nova::compute::service: allow managing service [puppet] - 10https://gerrit.wikimedia.org/r/583054 [12:24:25] (03PS6) 10Jbond: openstack::nova::compute::service: allow managing service [puppet] - 10https://gerrit.wikimedia.org/r/583054 [12:25:13] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [12:30:19] (03CR) 10Jbond: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/21550" [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [12:30:33] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [12:40:15] (03PS1) 10Hnowlan: changeprop: enable nutcracker in production, reenable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/583056 (https://phabricator.wikimedia.org/T213193) [12:55:01] 10Operations, 10Cloud-VPS (Project-requests): Request creation of SRE VPS project - https://phabricator.wikimedia.org/T247517 (10jbond) @bd808 thanks for the detailed history this does help and i think the reasons given and how we arrived here all seems logical. It sounds like historicity team projects where... [12:57:22] 10Operations, 10Analytics, 10Research, 10Traffic, and 2 others: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Miriam) THanks @elukey for this summary. There are two macro use-cases for the release/simplified access to article pageview... [12:58:13] 10Operations, 10Analytics, 10Research, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10elukey) [13:04:07] !log installing maridb-10.1 updates from Stretch point release (client/tools/libraries as packaged by Debian, different from wmf-mariadb) [13:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:48] (03CR) 10Holger Knust: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/582948 (https://phabricator.wikimedia.org/T247786) (owner: 10Krinkle) [13:13:34] (03PS1) 10KartikMistry: apertium-mk-bg: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-mk-bg] - 10https://gerrit.wikimedia.org/r/583060 (https://phabricator.wikimedia.org/T247585) [13:14:10] (03CR) 10jerkins-bot: [V: 04-1] apertium-mk-bg: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-mk-bg] - 10https://gerrit.wikimedia.org/r/583060 (https://phabricator.wikimedia.org/T247585) (owner: 10KartikMistry) [13:17:46] 10Operations: Integrate Stretch 9.12 point update - https://phabricator.wikimedia.org/T244695 (10MoritzMuehlenhoff) [13:22:51] !log installing glib2.0 updates from Stretch point release [13:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:49] (03CR) 10Jhedden: [C: 03+2] openstack: Configure WSGI to use horizon virtualenv [puppet] - 10https://gerrit.wikimedia.org/r/582907 (https://phabricator.wikimedia.org/T240852) (owner: 10Jhedden) [13:33:51] (03PS1) 10Elukey: Avoid overriding Hadoop's core files to allow IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/583065 (https://phabricator.wikimedia.org/T244499) [13:35:30] (03PS2) 10Elukey: Avoid overriding Hadoop's core files to allow IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/583065 (https://phabricator.wikimedia.org/T244499) [13:35:43] (03CR) 10CDanis: php-admin: remove dead code for partial opcache invalidation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/577652 (owner: 10Ori.livneh) [13:36:20] (03PS2) 10CDanis: dbctl: schema: allow es4 and es5 sections [software/conftool] - 10https://gerrit.wikimedia.org/r/574272 (https://phabricator.wikimedia.org/T245806) [13:36:38] 10Operations: Integrate Stretch 9.12 point update - https://phabricator.wikimedia.org/T244695 (10MoritzMuehlenhoff) [13:36:53] 10Operations: Integrate Stretch 9.12 point update - https://phabricator.wikimedia.org/T244695 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [13:38:44] (03PS3) 10Elukey: Avoid overriding Hadoop's core files to allow IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/583065 (https://phabricator.wikimedia.org/T244499) [13:39:02] (03CR) 10CDanis: [C: 03+2] dbctl: schema: allow es4 and es5 sections [software/conftool] - 10https://gerrit.wikimedia.org/r/574272 (https://phabricator.wikimedia.org/T245806) (owner: 10CDanis) [13:40:25] (03CR) 10Elukey: [C: 03+2] Avoid overriding Hadoop's core files to allow IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/583065 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [13:41:26] (03Merged) 10jenkins-bot: dbctl: schema: allow es4 and es5 sections [software/conftool] - 10https://gerrit.wikimedia.org/r/574272 (https://phabricator.wikimedia.org/T245806) (owner: 10CDanis) [13:48:48] (03PS1) 10Elukey: Restore CDH settings for Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/583069 (https://phabricator.wikimedia.org/T244499) [13:50:44] (03PS2) 10Alexandros Kosiaris: eventstreams: Remove all conftool data [puppet] - 10https://gerrit.wikimedia.org/r/566773 (https://phabricator.wikimedia.org/T238658) [13:53:00] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:53:15] (03CR) 10Jhedden: [C: 03+2] shinken: remove puppet agent checks [puppet] - 10https://gerrit.wikimedia.org/r/582612 (https://phabricator.wikimedia.org/T210993) (owner: 10Jhedden) [13:56:05] 10Operations, 10CX-cxserver, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog, and 2 others: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10WDoranWMF) [13:59:34] (03PS1) 10Alexandros Kosiaris: eventstreams: Switch scb LVS to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/583071 (https://phabricator.wikimedia.org/T238658) [14:00:21] 10Operations, 10CX-cxserver, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog, and 2 others: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10Mholloway) What's the expected behavior? [14:01:03] (03PS1) 10Jbond: profile::idp::client::httpd: add check for sso redirect [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) [14:01:24] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) (owner: 10Jbond) [14:03:06] (03PS1) 10Alexandros Kosiaris: eventstreams: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/583072 (https://phabricator.wikimedia.org/T238658) [14:03:08] (03PS1) 10Alexandros Kosiaris: eventstreams: Switch to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/583073 (https://phabricator.wikimedia.org/T238658) [14:03:10] (03PS1) 10Alexandros Kosiaris: eventstreams: Remove old lvs service [puppet] - 10https://gerrit.wikimedia.org/r/583074 (https://phabricator.wikimedia.org/T238658) [14:03:12] (03PS1) 10Alexandros Kosiaris: lvs: Rename eventstreams-tls to eventstreams [puppet] - 10https://gerrit.wikimedia.org/r/583075 (https://phabricator.wikimedia.org/T238658) [14:03:14] (03PS1) 10Alexandros Kosiaris: eventstreams: Remove from scb role [puppet] - 10https://gerrit.wikimedia.org/r/583076 (https://phabricator.wikimedia.org/T238658) [14:03:16] (03PS1) 10Alexandros Kosiaris: eventstreams: Remove profile [puppet] - 10https://gerrit.wikimedia.org/r/583077 (https://phabricator.wikimedia.org/T238658) [14:03:47] (03Abandoned) 10Alexandros Kosiaris: DNM: eventstreams: switch lvs to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/566772 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [14:05:21] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is OK: HTTP OK: HTTP/1.0 200 OK - 22331 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:06:36] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 114.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [14:08:00] 10Operations, 10SRE-Access-Requests, 10serviceops-radar, 10Core Platform Team (Icebox): Onboarding Hugh Nowlan - https://phabricator.wikimedia.org/T242309 (10WDoranWMF) [14:09:38] (03PS2) 10Jbond: profile::idp::client::httpd: add check for sso redirect [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) [14:09:56] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) (owner: 10Jbond) [14:10:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Switch scb LVS to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/583071 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [14:15:59] 10Operations, 10CX-cxserver, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog, and 2 others: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) @Mholloway I 'd says at least the sta... [14:16:01] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) (owner: 10Jbond) [14:16:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop: enable nutcracker in production, reenable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/583056 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [14:21:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] mediawiki: Fix reqId in php7-fatal-error.php to consider X-Request-Id [puppet] - 10https://gerrit.wikimedia.org/r/582948 (https://phabricator.wikimedia.org/T247786) (owner: 10Krinkle) [14:21:50] so hmm [14:21:51] *cough* [14:21:54] time to cut the branch [14:22:05] cut away! [14:22:58] (03CR) 10Jbond: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/21551/console" [puppet] - 10https://gerrit.wikimedia.org/r/583078 (https://phabricator.wikimedia.org/T245743) (owner: 10Jbond) [14:24:22] 10Operations, 10MediaWiki-Debug-Logger, 10observability, 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: MWExceptionHandler reqId sometimes differs from php-wmerrors reqId - https://phabricator.wikimedia.org/T247786 (10Anomie) 05Open→03Resolved a:05holger.knust→03Krinkle [14:26:22] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:26:54] !log Branching wmf/1.35.0-wmf.25 # T233873 [14:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:00] T233873: 1.35.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T233873 [14:27:52] 10Operations, 10Puppet, 10Release-Engineering-Team-TODO, 10puppet-compiler, 10Release-Engineering-Team (CI & Testing services): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond) >>! In T166066#5039654, @hashar wrote: > We have a Jenkins job T975... [14:27:54] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [14:27:54] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 100.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [14:28:11] !log Deploy schema change on db2117 (s6) [14:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:18] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:32:15] ah! the cloudelastic icinga link works [14:32:18] finally [14:33:56] :) [14:36:45] (03PS2) 10WMDE-Fisch: TwoColConflict: Limited default deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581991 (https://phabricator.wikimedia.org/T244863) (owner: 10Andrew-WMDE) [14:36:47] (03PS1) 10WMDE-Fisch: TwoColConflict: Limited default deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583084 (https://phabricator.wikimedia.org/T244863) [14:36:48] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 108.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [14:45:47] (03CR) 10Arturo Borrero Gonzalez: "I wonder why the service fails to start in the first place. It should be running!" [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [14:47:24] (03CR) 10Jbond: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [14:47:33] 10Operations, 10CX-cxserver, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog, and 2 others: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10Mholloway) Thanks @akosiaris. I did some diggin... [14:48:00] 10Operations, 10Puppet, 10Release-Engineering-Team-TODO, 10puppet-compiler, 10Release-Engineering-Team (CI & Testing services): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond) >>! In T166066#5994914, @jbond wrote: >>>! In T166066#5039654, @has... [14:50:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Bump tls.image_version for envoy update to 1.13.1. [deployment-charts] - 10https://gerrit.wikimedia.org/r/582880 (https://phabricator.wikimedia.org/T246868) (owner: 10RLazarus) [14:52:10] (03PS2) 10RLazarus: Bump tls.image_version for envoy update to 1.13.1. [deployment-charts] - 10https://gerrit.wikimedia.org/r/582880 (https://phabricator.wikimedia.org/T246868) [14:54:24] (03CR) 10RLazarus: [C: 03+2] Bump tls.image_version for envoy update to 1.13.1. [deployment-charts] - 10https://gerrit.wikimedia.org/r/582880 (https://phabricator.wikimedia.org/T246868) (owner: 10RLazarus) [14:54:41] (03Merged) 10jenkins-bot: Bump tls.image_version for envoy update to 1.13.1. [deployment-charts] - 10https://gerrit.wikimedia.org/r/582880 (https://phabricator.wikimedia.org/T246868) (owner: 10RLazarus) [14:55:07] (03PS3) 10WMDE-Fisch: TwoColConflict: Limited default deployment InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/581991 (https://phabricator.wikimedia.org/T244863) (owner: 10Andrew-WMDE) [14:55:29] !log depooling wdqs1006 to catch up on lag [14:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:42] (03PS2) 10WMDE-Fisch: TwoColConflict: Limited default deployment CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583084 (https://phabricator.wikimedia.org/T244863) [14:55:48] (03PS3) 10WMDE-Fisch: TwoColConflict: Limited default deployment CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583084 (https://phabricator.wikimedia.org/T244863) [14:56:56] 10Operations, 10Security-Team, 10User-jbond: Banning IPs / subnets from accessing login/validation endpoint - https://phabricator.wikimedia.org/T233945 (10jbond) >>! In T233945#5772273, @chasemp wrote: > I wonder if this work supercedes {T224887} as the sole purpose of those rules is to ban IPs / subnets fro... [14:59:43] !log scap prep 1.35.0-wmf.25 # T233873 [14:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:49] T233873: 1.35.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T233873 [14:59:54] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [15:02:56] (03PS6) 10Giuseppe Lavagetto: mw1261: switch to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/580343 [15:02:58] (03PS1) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: allow defining timeouts, disable retries [puppet] - 10https://gerrit.wikimedia.org/r/583086 [15:03:12] !log Applied patches to 1.35.0-wmf.25 # T233873 [15:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:35] (03PS1) 10Hashar: Group 0 to 1.35.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583087 (https://phabricator.wikimedia.org/T233873) [15:05:43] jouncebot: now [15:05:43] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [15:05:49] jouncebot: next [15:05:49] In 0 hour(s) and 54 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200324T1600) [15:06:06] 10Operations, 10Cloud-VPS (Project-requests): Request creation of SRE VPS project - https://phabricator.wikimedia.org/T247517 (10bd808) >>! In T247517#5994682, @jbond wrote: > > The project would be solely used for prototyping and trialling new software/services once the prototyping phase is over i would env... [15:07:14] (03PS7) 10Giuseppe Lavagetto: mw1261: switch to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/580343 [15:08:15] 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10WDoranWMF) [15:17:36] 10Operations, 10Privacy Engineering, 10Traffic, 10Privacy: Disable WMF-Last-Access cookies for wmfusercontent.org - https://phabricator.wikimedia.org/T210167 (10JFishback_WMF) [15:17:37] !log Cleaning old MediaWiki deployments # T233873 [15:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:43] T233873: 1.35.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T233873 [15:17:49] (03CR) 10Elukey: [C: 03+2] Restore CDH settings for Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/583069 (https://phabricator.wikimedia.org/T244499) (owner: 10Elukey) [15:21:17] (03PS1) 10Jbond: base::firewall: add a new global abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/583090 (https://phabricator.wikimedia.org/T233945) [15:29:29] !log hashar@deploy1001 Pruned MediaWiki: 1.35.0-wmf.21 (duration: 24m 00s) [15:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:38] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 73.63 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [15:30:40] 10Operations, 10Privacy Engineering, 10WMF-Legal, 10Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104 (10JFishback_WMF) [15:31:42] !log hashar@deploy1001 Pruned MediaWiki: 1.35.0-wmf.22 (duration: 02m 02s) [15:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:13] (03CR) 10Hnowlan: [C: 03+2] changeprop: enable nutcracker in production, reenable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/583056 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [15:33:22] (03PS2) 10Hnowlan: changeprop: enable nutcracker in production, reenable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/583056 (https://phabricator.wikimedia.org/T213193) [15:33:28] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop: enable nutcracker in production, reenable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/583056 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [15:33:34] 10Operations, 10Cloud-VPS (Project-requests): Request creation of SRE VPS project - https://phabricator.wikimedia.org/T247517 (10jbond) >! In T247517#5995043, @bd808 wrote: > > I am pretty sure I understand your use case and desire, but I currently don't see how this will end differently than other loosely sc... [15:33:54] (03CR) 10Andrew Bogott: [C: 03+1] Switch Cloud VPS/Toolforge to Puppet 5 / Facter 3 [puppet] - 10https://gerrit.wikimedia.org/r/583030 (owner: 10Muehlenhoff) [15:34:02] (03Merged) 10jenkins-bot: changeprop: enable nutcracker in production, reenable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/583056 (https://phabricator.wikimedia.org/T213193) (owner: 10Hnowlan) [15:34:35] (03CR) 10Andrew Bogott: [C: 03+1] production-m5.sql: Remove nova DB grants [puppet] - 10https://gerrit.wikimedia.org/r/583052 (https://phabricator.wikimedia.org/T248313) (owner: 10Marostegui) [15:34:55] 10Operations, 10netops: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) [15:35:42] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:08] (03CR) 10Andrew Bogott: [C: 04-1] "I created https://phabricator.wikimedia.org/T248395 and will look at it soon." [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [15:39:32] (03PS1) 10Andrew Bogott: Revert "Neutron l3_agent: refresh on config changes" [puppet] - 10https://gerrit.wikimedia.org/r/583093 [15:40:25] (03CR) 10jerkins-bot: [V: 04-1] Revert "Neutron l3_agent: refresh on config changes" [puppet] - 10https://gerrit.wikimedia.org/r/583093 (owner: 10Andrew Bogott) [15:41:49] (03PS1) 10Hashar: testwiki to 1.35.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583094 (https://phabricator.wikimedia.org/T233873) [15:43:15] Why are TestWiki and group0 done in separate patches? [15:46:40] (03Abandoned) 10Dzahn: bastionhost: replace auth1001 with auth1002 in pam-sshd config [puppet] - 10https://gerrit.wikimedia.org/r/582147 (https://phabricator.wikimedia.org/T234909) (owner: 10Dzahn) [15:47:38] (03CR) 10Hashar: [C: 03+2] testwiki to 1.35.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583094 (https://phabricator.wikimedia.org/T233873) (owner: 10Hashar) [15:48:50] (03Merged) 10jenkins-bot: testwiki to 1.35.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583094 (https://phabricator.wikimedia.org/T233873) (owner: 10Hashar) [15:49:36] (03PS1) 10Andrew Bogott: nova-compute: require python3-libvirt for Queens [puppet] - 10https://gerrit.wikimedia.org/r/583095 (https://phabricator.wikimedia.org/T248395) [15:51:11] (03CR) 10Andrew Bogott: [C: 03+2] nova-compute: require python3-libvirt for Queens [puppet] - 10https://gerrit.wikimedia.org/r/583095 (https://phabricator.wikimedia.org/T248395) (owner: 10Andrew Bogott) [15:52:32] (03PS2) 10Andrew Bogott: Revert "Neutron l3_agent: refresh on config changes" [puppet] - 10https://gerrit.wikimedia.org/r/583093 [15:52:56] (03CR) 10Andrew Bogott: [C: 04-1] "that service is fixed on cloudvirt2003-dev. That maybe means this patch can be abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [15:53:26] (03PS2) 10Dzahn: microsites::httpd: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/572352 [15:53:28] (03PS1) 10Dzahn: DHCP/netboot: add otrs1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/583096 (https://phabricator.wikimedia.org/T248028) [15:53:51] (03PS2) 10Dzahn: DHCP/netboot: add otrs1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/583096 (https://phabricator.wikimedia.org/T248028) [15:54:09] (03CR) 10Jbond: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [15:54:21] (03CR) 10Dzahn: [C: 03+2] DHCP/netboot: add otrs1001.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/583096 (https://phabricator.wikimedia.org/T248028) (owner: 10Dzahn) [15:54:25] (03Abandoned) 10Jbond: openstack::nova::compute::service: allow managing service [puppet] - 10https://gerrit.wikimedia.org/r/583054 (owner: 10Jbond) [15:54:59] !log hashar@deploy1001 Started scap: testwiki to 1.35.0-wmf.25 and rebuild l10n cache # T233873 [15:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:04] T233873: 1.35.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T233873 [15:56:17] (03PS1) 10RobH: updating mainboard sku new r440 sku for mainboard v2 [software] - 10https://gerrit.wikimedia.org/r/583097 [15:56:56] (03CR) 10RobH: [C: 03+2] updating mainboard sku new r440 sku for mainboard v2 [software] - 10https://gerrit.wikimedia.org/r/583097 (owner: 10RobH) [15:57:24] (03Merged) 10jenkins-bot: updating mainboard sku new r440 sku for mainboard v2 [software] - 10https://gerrit.wikimedia.org/r/583097 (owner: 10RobH) [15:58:32] RECOVERY - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [15:58:35] !log installing OS on otrs1001.eqiad.wmnet (T248028) [15:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:41] T248028: eqiad: 1 VM request for OTRS - https://phabricator.wikimedia.org/T248028 [15:59:43] (03PS3) 10Dzahn: microsites::httpd: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/572352 [16:00:04] godog and _joe_: It is that lovely time of the day again! You are hereby commanded to deploy Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200324T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:30] looks at the calendar for that .. it's empty [16:00:44] (03PS1) 10Arturo Borrero Gonzalez: dynamicproxy: add support for dynamic XFF per FQDN [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) [16:00:45] nothing to swat, jouncebot [16:01:25] yeah, the "No GERRIT patches" message is the sign of an empty wiki section for that. [16:01:40] PROBLEM - puppet last run on people1001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:01:55] we decided at some point that the announce was good even when the queue is empty, but that could be revisited [16:02:18] ^^ fixing people [16:03:24] RECOVERY - puppet last run on people1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:07:10] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, 10SRE-swift-storage, and 3 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10JTannerWMF) There doesn't seem to be an action for t... [16:09:17] 10Operations, 10netops: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10ayounsi) Discussed it with Chris on IRC, LGTM. [16:09:58] 10Operations, 10Deployments, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Deploy rights for Daniel Kinzler - https://phabricator.wikimedia.org/T248324 (10greg) Hello #sre-access-requests : Daniel needs to be added to t... [16:10:04] 10Operations, 10Deployments, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Deploy rights for Daniel Kinzler - https://phabricator.wikimedia.org/T248324 (10greg) p:05Triage→03Medium [16:10:10] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, 10SRE-swift-storage, and 3 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Reedy) >>! In T230245#5995376, @JTannerWMF wrote: >... [16:10:24] wikibugs is lagging [16:17:28] (03PS1) 10Muehlenhoff: Add deneb.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/583100 (https://phabricator.wikimedia.org/T248165) [16:25:23] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, 10Traffic: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Ottomata) a:05Ottomata→03ema [16:26:22] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10fgiunchedi) Myself, @akosiaris @colewhite and @Ottomata met today to bikesh^W understand better what `service` means an... [16:27:03] Reedy: looks like it is doing better now? got maybe 5s lag for that latest update ^ [16:27:16] maybe :) [16:31:14] !log installing linux-perf-4.19 updates on buster [16:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/583100 (https://phabricator.wikimedia.org/T248165) (owner: 10Muehlenhoff) [16:40:39] (03PS2) 10Arturo Borrero Gonzalez: dynamicproxy: add support for dynamic XFF per FQDN [puppet] - 10https://gerrit.wikimedia.org/r/583098 (https://phabricator.wikimedia.org/T135046) [16:41:32] !log installing linux-perf updates on stretch [16:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:21] (03PS1) 10Filippo Giunchedi: icinga: update contactgroups and cgi permissions [puppet] - 10https://gerrit.wikimedia.org/r/583102 [16:52:30] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:53:20] (03CR) 10Muehlenhoff: [C: 03+2] Add deneb.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/583100 (https://phabricator.wikimedia.org/T248165) (owner: 10Muehlenhoff) [16:54:15] 10Operations, 10observability, 10Performance-Team (Radar): Decide on `service-runner` aggregated prometheus metrics and use of `service` label - https://phabricator.wikimedia.org/T247820 (10Ottomata) Another example: `lang=yaml chart: eventgate app: eventgate-wikimedia # (or maybe just eventgate?) service:... [16:54:49] (03PS1) 10Dzahn: site: add insetup role to otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/583103 (https://phabricator.wikimedia.org/T248028) [16:58:33] (03PS2) 10Filippo Giunchedi: icinga: remove zhousquared from contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/583102 [17:00:04] halfak and accraze: Dear deployers, time to do the Services – Graphoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200324T1700). [17:00:25] 10Operations: Add favicon to icinga and tendril - https://phabricator.wikimedia.org/T204110 (10Privacybatm) a:03Privacybatm [17:00:33] (03CR) 10Dzahn: [C: 03+2] site: add insetup role to otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/583103 (https://phabricator.wikimedia.org/T248028) (owner: 10Dzahn) [17:00:41] (03PS2) 10Dzahn: site: add insetup role to otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/583103 (https://phabricator.wikimedia.org/T248028) [17:01:02] (03PS1) 10Muehlenhoff: Add deneb to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/583104 [17:02:47] (03CR) 10Dzahn: [C: 03+1] "https://phabricator.wikimedia.org/T248165" [puppet] - 10https://gerrit.wikimedia.org/r/583104 (owner: 10Muehlenhoff) [17:04:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/583072 (https://phabricator.wikimedia.org/T238658) (owner: 10Alexandros Kosiaris) [17:04:52] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is OK: HTTP OK: HTTP/1.0 200 OK - 22337 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:06:50] (03PS1) 10DannyS712: Enable Special:Investigate on testwiki, and add `investigate` right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583105 (https://phabricator.wikimedia.org/T247645) [17:08:35] (03PS1) 10DCausse: [cirrus] force cloudelastic replica count to 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583106 (https://phabricator.wikimedia.org/T231517) [17:08:52] (03PS2) 10DannyS712: Enable Special:Investigate on testwiki, and add `investigate` right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583105 (https://phabricator.wikimedia.org/T247645) [17:10:38] !log update cloudelastic-chi replica counts from 2 to 1 T231517 [17:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:46] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [17:10:49] (03PS1) 10Muehlenhoff: Remove puppet-common [puppet] - 10https://gerrit.wikimedia.org/r/583108 [17:11:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/583102 (owner: 10Filippo Giunchedi) [17:11:34] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: remove zhousquared from contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/583102 (owner: 10Filippo Giunchedi) [17:12:03] 10Operations, 10vm-requests: eqiad: 1 VM request for OTRS - https://phabricator.wikimedia.org/T248028 (10Dzahn) 05Open→03Resolved a:03Dzahn VM has been created and added to puppet with insetup role. [17:12:07] 10Operations, 10OTRS: Migrate mendelevium/OTRS host to Stretch/Buster - https://phabricator.wikimedia.org/T224590 (10Dzahn) [17:12:51] !log hashar@deploy1001 Finished scap: testwiki to 1.35.0-wmf.25 and rebuild l10n cache # T233873 (duration: 77m 52s) [17:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:59] T233873: 1.35.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T233873 [17:13:43] (03PS1) 10Volans: sre.dns.netbox: deploy the changes to gdnsd [cookbooks] - 10https://gerrit.wikimedia.org/r/583109 (https://phabricator.wikimedia.org/T233183) [17:14:52] (03CR) 10Volans: sre.dns.netbox: deploy the changes to gdnsd (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/583109 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [17:15:32] 10Operations, 10Deployments, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Deploy rights for Daniel Kinzler - https://phabricator.wikimedia.org/T248324 (10Dzahn) Checked the Gerrit side and Daniel is already member of [... [17:17:39] (03PS1) 10Dzahn: admins: add Daniel Kinzler to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/583110 (https://phabricator.wikimedia.org/T248324) [17:18:52] (03CR) 10Dzahn: [C: 03+1] "has approval from Greg on ticket, already has shell access, already in wmf-deployment group in Gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/583110 (https://phabricator.wikimedia.org/T248324) (owner: 10Dzahn) [17:18:56] (03PS1) 10Hnowlan: changeprop: Move prometheus query to use regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/583111 (https://phabricator.wikimedia.org/T213193) [17:19:49] (03PS2) 10Hashar: Group 0 to 1.35.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583087 (https://phabricator.wikimedia.org/T233873) [17:20:22] (03PS2) 10Hnowlan: changeprop: Move prometheus query to use regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/583111 (https://phabricator.wikimedia.org/T213193) [17:21:37] 10Operations, 10Wikimedia-Logstash: elk7: fields indexed without position data; cannot run PhraseQuery - https://phabricator.wikimedia.org/T248400 (10herron) [17:26:30] (03PS1) 10Herron: elk7: remove index_options:docs from logstash v7 template [puppet] - 10https://gerrit.wikimedia.org/r/583112 (https://phabricator.wikimedia.org/T248400) [17:28:15] 10Operations, 10Wikimedia-Logstash: elk7: fields indexed without position data; cannot run PhraseQuery - https://phabricator.wikimedia.org/T248400 (10herron) [17:28:37] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: elk7: fields indexed without position data; cannot run PhraseQuery - https://phabricator.wikimedia.org/T248400 (10herron) [17:29:47] (03PS1) 10Dzahn: site: decom mw125[0-3] and mw123[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/583114 (https://phabricator.wikimedia.org/T247780) [17:31:45] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: elk7: fields indexed without position data; cannot run PhraseQuery - https://phabricator.wikimedia.org/T248400 (10herron) In my testing simply removing instances of `"index_options":"docs"` from the logstash template addresses the issue, please see http... [17:33:41] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 77.51 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [17:43:03] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Hardware): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10JHedden) Any updates on this? [17:48:42] (03PS2) 10KartikMistry: apertium-mk-bg: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-mk-bg] - 10https://gerrit.wikimedia.org/r/583060 (https://phabricator.wikimedia.org/T247585) [17:49:20] 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list Deutschschweiz - https://phabricator.wikimedia.org/T247737 (10Volans) p:05Triage→03Medium [17:57:49] (03CR) 10Volans: [C: 03+2] admins: add Daniel Kinzler to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/583110 (https://phabricator.wikimedia.org/T248324) (owner: 10Dzahn) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200324T1800) [18:00:39] 10Operations, 10Deployments, 10SRE-Access-Requests, 10Patch-For-Review, and 2 others: Deploy rights for Daniel Kinzler - https://phabricator.wikimedia.org/T248324 (10Volans) 05Open→03Resolved a:03Volans The change was merged. Let at least 30 minutes pass to make sure that it has been applied everywhe... [18:01:11] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 76.27 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [18:19:55] (03PS1) 10Jforrester: Set wgTmhUseBetaFeatures in IS and vary it for test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583119 [18:19:57] (03PS1) 10Jforrester: [Beta Cluster] Force the videojs player for all users and disable the beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583120 (https://phabricator.wikimedia.org/T100106) [18:21:16] jouncebot: now [18:21:16] For the next 0 hour(s) and 38 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200324T1800) [18:21:21] Boo. [18:21:49] (03CR) 10jerkins-bot: [V: 04-1] Set wgTmhUseBetaFeatures in IS and vary it for test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583119 (owner: 10Jforrester) [18:32:22] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583119 (owner: 10Jforrester) [18:44:21] !lo repooling wdqs1006, catched u pon lag [18:49:05] !log repooling wdqs1006, catched up on lag [18:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:44] James_F: hop [18:49:48] Hoo* [18:50:27] Hey. [18:51:24] James_F: Thanks for last night btw, it got deployed and works! I’m going to copy what I actually ended up doing to a guide somewhere for next time. [18:51:38] Cool. [18:55:17] (03CR) 10Andrew Bogott: [C: 03+1] Remove puppet-common [puppet] - 10https://gerrit.wikimedia.org/r/583108 (owner: 10Muehlenhoff) [19:00:04] twentyafterfound and dduvall: May I have your attention please! Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200324T1900) [19:00:45] (03CR) 10BBlack: [C: 03+1] sre.dns.netbox: deploy the changes to gdnsd [cookbooks] - 10https://gerrit.wikimedia.org/r/583109 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [19:10:59] 10Operations, 10ops-codfw, 10Traffic: (Need by: TBD) rack/setup/install cp202[7-9], cp203[0-9], cp204[0-2] - https://phabricator.wikimedia.org/T247340 (10Papaul) [19:21:26] (03CR) 10Dbarratt: Enable Special:Investigate on testwiki, and add `investigate` right (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583105 (https://phabricator.wikimedia.org/T247645) (owner: 10DannyS712) [19:21:39] (03PS1) 1020after4: group0 wikis to 1.35.0-wmf.25 refs T233873 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583126 [19:21:41] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.35.0-wmf.25 refs T233873 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583126 (owner: 1020after4) [19:22:50] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.25 refs T233873 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583126 (owner: 1020after4) [19:23:03] (03CR) 10DannyS712: Enable Special:Investigate on testwiki, and add `investigate` right (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583105 (https://phabricator.wikimedia.org/T247645) (owner: 10DannyS712) [19:23:44] (03PS1) 10Jhedden: Revert "openstack: switch cloudvirt101[56] to ceph storage" [puppet] - 10https://gerrit.wikimedia.org/r/583127 [19:28:21] !log twentyafterfour@deploy1001 scap failed: average error rate on 7/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [19:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:22] !log rolling back to wmf.24 due to high error rate refs T233873 [19:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:27] T233873: 1.35.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T233873 [19:29:39] !log twentyafterfour@deploy1001 scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [19:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:02] (03CR) 10Jhedden: [C: 03+2] Revert "openstack: switch cloudvirt101[56] to ceph storage" [puppet] - 10https://gerrit.wikimedia.org/r/583127 (owner: 10Jhedden) [19:30:11] Uncaught ExtensionDependencyError: EventLogging requires EventStreamConfig to be installed. [19:30:19] (03PS2) 10Jhedden: Revert "openstack: switch cloudvirt101[56] to ceph storage" [puppet] - 10https://gerrit.wikimedia.org/r/583127 [19:35:16] twentyafterfour: Oops. it isn't? [19:35:20] * James_F looks. [19:35:46] James_F: apparently not [19:35:51] Eurgh, yeah. [19:35:57] maybe something went wrong with branching? [19:35:58] * James_F blames himself, but also ottomata. [19:36:03] hahah [19:36:09] ottomata: Is it OK to enable EventStreamConfig on all wikis? [19:36:16] It just provides the config, right? [19:36:20] (03CR) 10Dbarratt: Enable Special:Investigate on testwiki, and add `investigate` right (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583105 (https://phabricator.wikimedia.org/T247645) (owner: 10DannyS712) [19:36:43] (03CR) 10Herron: [C: 03+2] "moving forward with this to put a fix in place before indices roll over at midnight" [puppet] - 10https://gerrit.wikimedia.org/r/583112 (https://phabricator.wikimedia.org/T248400) (owner: 10Herron) [19:38:41] (03PS1) 10Jforrester: Enable EventStreamConfig everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583131 (https://phabricator.wikimedia.org/T248409) [19:38:41] twentyafterfour: ^^ that'll fix it, but… [19:39:04] there's always a but... [19:40:31] James_F is it better to revert the dependency or expand deployment of EventStreamConfig? [19:41:25] (03CR) 10DannyS712: Enable Special:Investigate on testwiki, and add `investigate` right (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583105 (https://phabricator.wikimedia.org/T247645) (owner: 10DannyS712) [19:45:00] 10Operations, 10Deployments, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Deploy rights for Daniel Kinzler - https://phabricator.wikimedia.org/T248324 (10daniel) Thank you! [20:07:39] DannyS712: Probably expand the deployment. [20:07:58] okay. I figured reverting would be safer [20:09:57] (03CR) 10CDanis: [C: 03+1] icinga: remove zhousquared from contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/583102 (owner: 10Filippo Giunchedi) [20:12:12] (03PS1) 10CDanis: depool eqsin for router maintenance [dns] - 10https://gerrit.wikimedia.org/r/583134 (https://phabricator.wikimedia.org/T248394) [20:12:19] James_F: you never finished your "but..." [20:12:27] is there something I should be aware of? [20:12:45] twentyafterfour: No, just that I'm worried that we didn't do this at first. [20:12:56] I was hoping that ottomata would show up and give a go/no-go. [20:13:07] In his absence, I'm reading the code to try to work out if it'll be bad. [20:14:01] FWICT, it's safe to deploy. [20:14:05] Let's do it? [20:14:13] (03CR) 10CDanis: [C: 03+2] depool eqsin for router maintenance [dns] - 10https://gerrit.wikimedia.org/r/583134 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [20:15:12] twentyafterfour: Want me to deploy? [20:15:28] James_F: I can deploy it [20:15:42] Cool. [20:15:54] Note that you'll want to sync CS before IS. [20:17:31] (03PS2) 10Herron: ELK7: require disktype "hdd" for new indices [puppet] - 10https://gerrit.wikimedia.org/r/579338 (https://phabricator.wikimedia.org/T247376) [20:19:11] (03CR) 1020after4: [C: 03+2] Enable EventStreamConfig everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583131 (https://phabricator.wikimedia.org/T248409) (owner: 10Jforrester) [20:19:24] !log eqsin depooled for router maintenance at 16:15 [20:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:09] (03Merged) 10jenkins-bot: Enable EventStreamConfig everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583131 (https://phabricator.wikimedia.org/T248409) (owner: 10Jforrester) [20:20:35] PROBLEM - Host analytics1044 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:06] note: I didn't deploy anything, ^ is unrelated to train [20:21:31] * volans looking [20:21:43] could be related to router maintenance? [20:23:16] so far no ping no ssh and black console [20:23:18] no [20:23:20] I'd say it just failed [20:23:50] elukey, ottomata: analytics1044 is unresponding (ping, ssh, console). Ok to powercycle? [20:25:32] seems a broken disk that tripped the controller [20:25:45] Disk 2 in Backplane 1 of Integrated RAID Controller 1 was reset. [20:25:57] (03CR) 10Ottomata: "<3 thank you! Sorry about that!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583131 (https://phabricator.wikimedia.org/T248409) (owner: 10Jforrester) [20:26:23] mmmh it's loging that message since a month [20:26:34] volans: yes please thank you! [20:26:37] !log commit flow-table-size on cr2-eqsin T248394 [20:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:43] T248394: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 [20:27:33] !log force rebooting analytics1044 from console, host down and unreachable (ping, ssh, console) [20:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:14] !log twentyafterfour@deploy1001 Synchronized wmf-config/CommonSettings.php: sync CommonSettings before InitialiseSettings refs T248409 (duration: 00m 58s) [20:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:19] T248409: Uncaught ExtensionDependencyError: EventLogging requires EventStreamConfig to be installed. in /srv/mediawiki/php-1.35.0-wmf.25/includes/registration/ExtensionRegistry.php:398 - https://phabricator.wikimedia.org/T248409 [20:28:56] (03CR) 10Ori.livneh: php-admin: remove dead code for partial opcache invalidation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/577652 (owner: 10Ori.livneh) [20:30:27] !log twentyafterfour@deploy1001 Synchronized wmf-config: Now sync InitializeSettings* refs T248409 (duration: 00m 59s) [20:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:34] mmmh ottomata no good news, VD 03 is mising at reboot [20:31:09] (03CR) 10Nuria: [C: 03+1] "Nice, we need this for our next step. Super thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583131 (https://phabricator.wikimedia.org/T248409) (owner: 10Jforrester) [20:31:16] !log rebooting cr2-eqsin T248394 [20:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:32] James_F: twentyafterfour hi sorry, was doing an interview [20:31:41] yes, enabling everywhere should be fine [20:32:03] ottomata: good because that's what we have done ;) [20:32:06] we just haven't gone farther because we hadn't needed to yet, we should have done that before I merged that EL change [20:32:10] great, thank ypou ery much [20:32:13] !log twentyafterfour@deploy1001 Synchronized wmf-config: Now touch and sync again because of settings cache rache condition. refs T248409 (duration: 00m 59s) [20:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:24] volans: ahhhh well. :/ its a hadoop worker so it'll just be missing for a whhlie. [20:32:33] that is also a node we are going to replace next quarter [20:32:46] volans: no task yet right? shall I create one? [20:33:00] 10Operations, 10ops-eqiad, 10Analytics: analytics1044 hardware failure - https://phabricator.wikimedia.org/T248413 (10Volans) p:05Triage→03Medium [20:33:01] ottomata: ^^^ [20:33:09] that's what the console is telling me [20:33:28] we can proceed without the VD or go into the config utility [20:35:57] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: Attempt #2: group0 wikis to 1.35.0-wmf.25 refs T233873 [20:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:02] T233873: 1.35.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T233873 [20:36:29] oh, volans i betcha w can proceed...the vd is likely one of the data disks, which we can tolerate a failure or two of [20:36:48] worth a try at least [20:37:01] the disks are in JBOD too [20:37:02] mind to comment on task? ok worse case we loose the node :) [20:37:04] yeah [20:37:05] k [20:37:15] 10Operations, 10Cassandra, 10Core Platform Team (Icebox), 10Patch-For-Review, 10User-Eevans: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10CCicalese_WMF) [20:37:16] commenting [20:38:25] (03CR) 10Hashar: "recheck" [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553741 (owner: 10Hashar) [20:38:34] (03CR) 10Hashar: "recheck" [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553735 (owner: 10Hashar) [20:39:05] 10Operations, 10ops-eqiad, 10Analytics: analytics1044 hardware failure - https://phabricator.wikimedia.org/T248413 (10Ottomata) I say proceed! This should be one of the Hadoop data disks, which just uses JBOD. Hadoop is configured to tolerate 1 (or some?) disk failures (the data on the failed disk should b... [20:39:53] (03CR) 10Hashar: "recheck" [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736 (owner: 10Hashar) [20:39:55] ottomata: lol, the press any key led me to the config anyway :D [20:41:32] (03PS1) 10CDanis: Revert "depool eqsin for router maintenance" [dns] - 10https://gerrit.wikimedia.org/r/583140 (https://phabricator.wikimedia.org/T248394) [20:42:03] twentyafterfour: Prod clear? [20:42:23] James_F: seems like it [20:43:07] Cool. [20:43:13] (03CR) 10Hashar: "That is due to docker-pkg 2.0.0 :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580124 (owner: 10Hashar) [20:43:15] (03PS2) 10Jforrester: Set wgTmhUseBetaFeatures in IS and vary it for test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583119 [20:43:39] (03CR) 10Hashar: "I have added as reviewers a few people that have touched images/python* files." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580128 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [20:44:13] ottomata: ok, re-rebooting now, in theory without that VD, and confirmed that the RAID1 of the OS is in optimal state [20:44:34] (03Abandoned) 10Hashar: Group 0 to 1.35.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583087 (https://phabricator.wikimedia.org/T233873) (owner: 10Hashar) [20:45:30] 10Operations, 10ops-eqiad, 10Analytics: analytics1044 hardware failure - https://phabricator.wikimedia.org/T248413 (10Volans) Ack, the host led me into the config anyway, even pressing any key. The configuration was already skipping VD3, and confirmed that the RAID1 where the OS resides is in optimal state.... [20:46:00] (03CR) 10Jforrester: [C: 03+2] Set wgTmhUseBetaFeatures in IS and vary it for test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583119 (owner: 10Jforrester) [20:46:10] k, once it is up i'll log in and check a couple of things [20:46:17] (03PS2) 10Jforrester: [Beta Cluster] Force the videojs player for all users and disable the beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583120 (https://phabricator.wikimedia.org/T100106) [20:47:03] (03Merged) 10jenkins-bot: Set wgTmhUseBetaFeatures in IS and vary it for test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583119 (owner: 10Jforrester) [20:48:20] (03CR) 10jerkins-bot: [V: 04-1] [Beta Cluster] Force the videojs player for all users and disable the beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583120 (https://phabricator.wikimedia.org/T100106) (owner: 10Jforrester) [20:49:08] 10Operations, 10ops-eqiad, 10Analytics: analytics1044 hardware failure - https://phabricator.wikimedia.org/T248413 (10Volans) Ok, now the message is more readable: ` There are offline or missing virtual drives with preserved cache. Please check the cables and ensure that all drives are present. Press any key... [20:49:11] ottomata: new issue ^^^ [20:49:18] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set wgTmhUseBetaFeatures to vary by wiki (duration: 01m 06s) [20:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:34] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 07s) [20:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:37] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Force the videojs player for all users and disable the beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583120 (https://phabricator.wikimedia.org/T100106) (owner: 10Jforrester) [20:52:01] (03CR) 10CDanis: [C: 03+2] Revert "depool eqsin for router maintenance" [dns] - 10https://gerrit.wikimedia.org/r/583140 (https://phabricator.wikimedia.org/T248394) (owner: 10CDanis) [20:52:16] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for cp2027 to cp2042 [dns] - 10https://gerrit.wikimedia.org/r/583144 [20:52:30] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Don't hard-set wgTmhUseBetaFeatures to true, let it vary by wiki (duration: 01m 07s) [20:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:00] volans: hm ok...? [20:53:04] log repool eqsin [20:53:06] !log repool eqsin [20:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:10] (03Merged) 10jenkins-bot: [Beta Cluster] Force the videojs player for all users and disable the beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583120 (https://phabricator.wikimedia.org/T100106) (owner: 10Jforrester) [20:59:40] ottomata: I'm checking with dcops, as the disk is seen as foreign and there is an option to clear that foreign status, but was to check that's ok first [21:00:30] ok [21:00:48] that usually happens if you replace a disk with another disk that was in another raid config [21:00:49] volans: yeah i dunno much about this, but since this is a hadoop worker node you can't mess too much up :p [21:00:52] 10Operations, 10netops, 10Patch-For-Review: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) [21:00:54] ah interesting [21:00:56] ahahah ok [21:01:27] there are 53 of these nodes, losing one for a bit will be ok :) [21:02:18] * volans feels 1,88% of the pressure :-P [21:02:26] hahahha [21:02:31] hahah [21:02:32] lol [21:14:48] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10WDoranWMF) 05Stalled→03Declined [21:14:50] 10Operations, 10Release Pipeline, 10serviceops, 10Goal, 10Release-Engineering-Team (Pipeline): Self-service Deployment Pipeline - https://phabricator.wikimedia.org/T228676 (10WDoranWMF) [21:14:56] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10WDoranWMF) [21:15:02] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10WDoranWMF) 05Open→03Declined [21:15:06] 10Operations, 10Release Pipeline, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10WDoranWMF) [21:15:44] (03PS1) 10Krinkle: Convert test2wiki from EventLogging repo to client-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583149 (https://phabricator.wikimedia.org/T196309) [21:17:37] (03CR) 10jerkins-bot: [V: 04-1] Convert test2wiki from EventLogging repo to client-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583149 (https://phabricator.wikimedia.org/T196309) (owner: 10Krinkle) [21:18:46] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: unify horizon glance policy with actual glance policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/582846 (https://phabricator.wikimedia.org/T247575) (owner: 10Andrew Bogott) [21:19:44] 10Operations, 10netops: roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - https://phabricator.wikimedia.org/T248394 (10CDanis) Deployed on cr2-eqsin: `cdanis@cr2-eqsin> show services accounting status inline-jflow fpc-slot 0 Status information FPC Slot: 0 IPV4 ex... [21:23:37] (03PS8) 10Andrew Bogott: Horizon: integrate Horizon policy with actual designate policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/582847 (https://phabricator.wikimedia.org/T247575) [21:24:56] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: integrate Horizon policy with actual designate policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/582847 (https://phabricator.wikimedia.org/T247575) (owner: 10Andrew Bogott) [21:25:33] (03PS8) 10Andrew Bogott: Horizon: integrate neutron policy with the actual Neutron policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/582848 (https://phabricator.wikimedia.org/T247575) [21:30:15] (03PS2) 10Krinkle: Convert test2wiki from EventLogging repo to client-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583149 (https://phabricator.wikimedia.org/T196309) [21:33:11] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: integrate neutron policy with the actual Neutron policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/582848 (https://phabricator.wikimedia.org/T247575) (owner: 10Andrew Bogott) [21:33:47] (03PS3) 10Andrew Bogott: Keystone policy: merge keystone policy with horizon identity policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/582899 (https://phabricator.wikimedia.org/T247575) [21:35:56] James_F: when is https://phabricator.wikimedia.org/T248418 going ahead? I’ll add it to Tech/News if you want [21:38:47] ottomata: it's finally be booting up right now, should be available very shortly for ssh [21:38:54] if you could have a look [21:39:14] at login now [21:39:16] (03CR) 10Andrew Bogott: [C: 03+2] Keystone policy: merge keystone policy with horizon identity policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/582899 (https://phabricator.wikimedia.org/T247575) (owner: 10Andrew Bogott) [21:39:37] RECOVERY - Host analytics1044 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [21:41:22] 10Operations, 10ops-eqiad, 10Analytics: analytics1044 hardware failure - https://phabricator.wikimedia.org/T248413 (10Volans) This is related to T245910, the same disk failed then and from that moment it started logging into racadm some failure that I think led the controller fail today. I've cleared the for... [21:41:45] ottomata: more details in the task ^^^ [21:41:59] RhinosF1: Not immediately. [21:42:34] James_F: should it be in the not ready column then? [21:43:25] RECOVERY - MegaRAID on analytics1044 is OK: OK: optimal, 12 logical, 13 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:46:47] RhinosF1: Yeah. [21:47:06] Or we could announce that it's coming and then announce that it's done. [21:47:08] * James_F shrugs. [21:48:18] 10Operations, 10Citoid, 10serviceops, 10Core Platform Team (Icebox): Citoid is logging all request / response headers as separate fields - https://phabricator.wikimedia.org/T239713 (10WDoranWMF) [21:49:36] James_F: I’ll leave a note for JohanJ [21:50:19] * James_F nods. [21:51:55] (03PS1) 10Andrew Bogott: Keystone policy: duplicate Queens policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/583153 (https://phabricator.wikimedia.org/T247575) [21:53:24] (03CR) 10Andrew Bogott: [C: 03+2] Keystone policy: duplicate Queens policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/583153 (https://phabricator.wikimedia.org/T247575) (owner: 10Andrew Bogott) [22:01:55] PROBLEM - Host mc2023 is DOWN: PING CRITICAL - Packet loss = 100% [22:03:04] (03PS1) 10Andrew Bogott: nova policy: add sudorule policy rules for Horizon [puppet] - 10https://gerrit.wikimedia.org/r/583157 (https://phabricator.wikimedia.org/T247575) [22:03:29] RECOVERY - Host mc2023 is UP: PING OK - Packet loss = 0%, RTA = 36.23 ms [22:03:51] host got rebooted [22:05:00] (03CR) 10Andrew Bogott: [C: 03+2] nova policy: add sudorule policy rules for Horizon [puppet] - 10https://gerrit.wikimedia.org/r/583157 (https://phabricator.wikimedia.org/T247575) (owner: 10Andrew Bogott) [22:12:13] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:26] 10Operations, 10ops-eqiad, 10Analytics: analytics1044 hardware failure - https://phabricator.wikimedia.org/T248413 (10Ottomata) Looks ok from here! Both hadoop-hdfs-datanode and hadoop-yarn-nodemanager seem ok. [22:14:01] thanks for all that volans :) [22:15:44] (03CR) 10Bstorm: [C: 03+2] k8s: purge flannel from the environment [puppet] - 10https://gerrit.wikimedia.org/r/582090 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [22:18:48] yw ottomata, hope the host is ok [22:24:39] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200324T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:07:26] (03PS1) 10Bstorm: toolforge: clean up the now-unneeded ferm_handlers [puppet] - 10https://gerrit.wikimedia.org/r/583165 (https://phabricator.wikimedia.org/T246689) [23:09:07] (03PS1) 10Mholloway: Update config template for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/583166 [23:11:52] (03PS2) 10Mholloway: Update config template for mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/583166 [23:22:04] (03CR) 10Bstorm: "PCC https://puppet-compiler.wmflabs.org/compiler1003/21555/tools-proxy-05.tools.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/583165 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [23:27:23] (03CR) 10Bstorm: [C: 03+2] toolforge: clean up the now-unneeded ferm_handlers [puppet] - 10https://gerrit.wikimedia.org/r/583165 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [23:46:36] (03PS1) 10Bstorm: toolforge cleanup: remove the ferm_handlers profile [puppet] - 10https://gerrit.wikimedia.org/r/583170 (https://phabricator.wikimedia.org/T246689)