[00:00:36] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [00:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:25] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:50] (03PS4) 10CRusnov: base/apt-upgrade-activity.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/624732 (https://phabricator.wikimedia.org/T247364) [00:23:05] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2034.codfw.wmnet'] ` and were **ALL** successful. [00:25:32] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [00:26:27] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) 05Open→03Resolved @Marostegui all yours [00:29:11] (03PS1) 10Dzahn: site: add parsoid role to all new parsoid hardware [puppet] - 10https://gerrit.wikimedia.org/r/626521 (https://phabricator.wikimedia.org/T247441) [00:31:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:31:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:31:39] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [00:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:31:48] !log milimetric@deploy1001 Started deploy [analytics/refinery@7f5a6ca]: Regular analytics weekly train [analytics/refinery@7f5a6ca] [00:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [00:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:59] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:46] !log generating mcrouter certs for parse2001 - parse2019 - mcrouter_generate_certs on puppetmaster1001 (T247441) [00:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:52] T247441: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 [00:40:13] !log milimetric@deploy1001 Finished deploy [analytics/refinery@7f5a6ca]: Regular analytics weekly train [analytics/refinery@7f5a6ca] (duration: 08m 25s) [00:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:12] !log milimetric@deploy1001 Started deploy [analytics/refinery@7f5a6ca] (thin): Regular analytics weekly train THIN [analytics/refinery@7f5a6ca] [00:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:20] !log milimetric@deploy1001 Finished deploy [analytics/refinery@7f5a6ca] (thin): Regular analytics weekly train THIN [analytics/refinery@7f5a6ca] (duration: 00m 08s) [00:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:46] (03PS1) 10Dzahn: add fake keys for new parsoid hosts [labs/private] - 10https://gerrit.wikimedia.org/r/626523 (https://phabricator.wikimedia.org/T247441) [00:46:27] (03PS2) 10Dzahn: add fake keys for new parsoid hosts [labs/private] - 10https://gerrit.wikimedia.org/r/626523 (https://phabricator.wikimedia.org/T247441) [00:46:42] (03CR) 10Dzahn: [C: 03+2] add fake keys for new parsoid hosts [labs/private] - 10https://gerrit.wikimedia.org/r/626523 (https://phabricator.wikimedia.org/T247441) (owner: 10Dzahn) [00:46:45] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake keys for new parsoid hosts [labs/private] - 10https://gerrit.wikimedia.org/r/626523 (https://phabricator.wikimedia.org/T247441) (owner: 10Dzahn) [00:58:49] (03PS2) 10Dzahn: site: add parsoid role to all new parsoid hardware [puppet] - 10https://gerrit.wikimedia.org/r/626521 (https://phabricator.wikimedia.org/T247441) [01:01:44] (03CR) 10Dzahn: [C: 03+2] site: add parsoid role to all new parsoid hardware [puppet] - 10https://gerrit.wikimedia.org/r/626521 (https://phabricator.wikimedia.org/T247441) (owner: 10Dzahn) [01:24:10] !log milimetric@deploy1001 Started deploy [analytics/refinery@6057f20]: Simple hql syntax fix [01:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:15] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [01:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:19] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [01:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:31] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:19] !log milimetric@deploy1001 Finished deploy [analytics/refinery@6057f20]: Simple hql syntax fix (duration: 08m 09s) [01:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:30] !log milimetric@deploy1001 Started deploy [analytics/refinery@6057f20] (thin): Simple hql syntax fix [01:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:37] !log milimetric@deploy1001 Finished deploy [analytics/refinery@6057f20] (thin): Simple hql syntax fix (duration: 00m 07s) [01:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:20] !log initial puppet runs on parse2001 - parse2010, staggered, not in production yet, new hardware, setup WIP (T247441) [01:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:26] T247441: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 [01:42:54] !log mw2296 - systemctl restart apache2 - rescheduled icinga alerts for apache and php-fpm [01:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:06] RECOVERY - PHP7 rendering on mw2296 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:45:06] RECOVERY - Apache HTTP on mw2296 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:45:44] !log mw2296 - restarted php7.2-fpm [01:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10Icinga: icinga alert cleanup for power switches in eqiad - https://phabricator.wikimedia.org/T262629 (10Dzahn) [01:50:02] ACKNOWLEDGEMENT - Host ps1-d3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T262629 [01:50:07] ACKNOWLEDGEMENT - Host ps1-d4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T262629 [01:51:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10Icinga: icinga alert cleanup for power switches in eqiad - https://phabricator.wikimedia.org/T262629 (10Dzahn) [01:53:31] !log initial puppet runs on parse2010 - parse2020, staggered, not in production yet, new hardware, setup WIP (T247441) [01:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:53:37] T247441: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 [01:53:56] !log ACKed alerts for eqiad power switches after making T262629 [01:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:03] T262629: icinga alert cleanup for power switches in eqiad - https://phabricator.wikimedia.org/T262629 [01:54:38] !log downtimes 48h for parse* hosts not in production yet but getting icinga checks from applied role [01:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [01:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [02:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [02:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [02:31:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [02:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [02:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [02:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:04] off now. (parsoid hosts are not in production in any way yet also looking mostly good in icinga and are downtimed) [02:34:16] (parse* that is) [03:08:34] Does https://phabricator.wikimedia.org/T262628 qualify as a security bug that should be made not-public? [04:59:28] 10Operations, 10DBA, 10observability: Prometheus/MariaDB counts a 'SELECT ... FOR UPDATE' query as an UPDATE query - https://phabricator.wikimedia.org/T262579 (10Marostegui) p:05Triage→03Medium [05:02:56] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) Thank you @Papaul - they all look good to me. [05:10:41] (03PS1) 10Marostegui: install_server: Do not reimage es20[26-34] [puppet] - 10https://gerrit.wikimedia.org/r/626551 (https://phabricator.wikimedia.org/T261717) [05:11:55] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage es20[26-34] [puppet] - 10https://gerrit.wikimedia.org/r/626551 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [05:23:13] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:23:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:27] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:08:27] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:29:04] * elukey doesn't see alter table commands running, must be friday [06:37:19] (03CR) 10Elukey: [C: 03+1] "Left a comment about versioning, but feel free to move forward if needed. The rest looks very good to me." (031 comment) [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/626177 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [06:41:27] mmm the OSPF alerts seem to be Zayo related [06:42:52] (03CR) 10Volans: [C: 04-1] "Thanks!Few minor things inline, with some of those the package builds correctly on deneb although it seems that there is a minor issue wit" (038 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/626380 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [06:43:52] ah ok not there is an email in noc@, unscheduled maintenance [06:50:16] 10Operations, 10Puppet: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609 (10Volans) @MoritzMuehlenhoff would forcing the first puppet run to use the v4 address that is not changing an acceptable workaround? I did a quick try and using `--sourceaddress $IPv4`... [06:53:01] 10Operations, 10Puppet: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609 (10MoritzMuehlenhoff) That's a good idea, AFAICT that should reliably bypass the issue! [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200911T0700) [07:02:08] (03CR) 10Muehlenhoff: [C: 03+2] Install git instead of git-core [puppet] - 10https://gerrit.wikimedia.org/r/626406 (owner: 10Muehlenhoff) [07:02:40] !log remove git-core from stretch systems, it's a transition package no longer provided by the 2.20 backport from Buster [07:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:45] (03PS1) 10Volans: wmf-auto-reimage: first Puppet run, force IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/626553 (https://phabricator.wikimedia.org/T262609) [07:05:53] (03CR) 10jerkins-bot: [V: 04-1] wmf-auto-reimage: first Puppet run, force IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/626553 (https://phabricator.wikimedia.org/T262609) (owner: 10Volans) [07:06:46] (03PS2) 10Volans: wmf-auto-reimage: first Puppet run, force IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/626553 (https://phabricator.wikimedia.org/T262609) [07:07:29] 10Operations, 10DBA, 10observability: Prometheus/MariaDB counts a 'SELECT ... FOR UPDATE' query as an UPDATE query - https://phabricator.wikimedia.org/T262579 (10jcrespo) Both global and session status seem to be doing the right thing (global status checked stopping replication), so it is not the server (x1... [07:08:14] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db2141 after being provisioned [puppet] - 10https://gerrit.wikimedia.org/r/626554 (https://phabricator.wikimedia.org/T257551) [07:17:10] !log rebootin ldap-corp server for kernel update [07:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:53] (03CR) 10Gehel: [C: 03+1] "LGTM, trivial enough" [software/cumin] - 10https://gerrit.wikimedia.org/r/626389 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [07:23:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:00] (03CR) 10Gehel: [C: 03+1] "LGTM, trivial enough" [software/cumin] - 10https://gerrit.wikimedia.org/r/626390 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [07:26:46] (03CR) 10Volans: "Few comments inline, in general looks good although I have very little knowledge of the related context. Two general questions:" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T256972) (owner: 10Jbond) [07:37:11] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:37:17] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:57] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:53] 10Operations, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Gilles) [07:44:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:34] 10Operations, 10DBA, 10observability: Prometheus/MariaDB counts a 'SELECT ... FOR UPDATE' query as an UPDATE query - https://phabricator.wikimedia.org/T262579 (10jcrespo) Discarding also Grafana dashboards: ` irate(mysql_global_status_commands_total{instance="$server:$port",command="update"}[5m... [07:50:52] 10Operations, 10DBA, 10observability: Prometheus/MariaDB counts a 'SELECT ... FOR UPDATE' query as an UPDATE query - https://phabricator.wikimedia.org/T262579 (10jcrespo) However, while I can see the related updates later on, they are around 1 per second, not the large amount shown on the master, and not eno... [07:53:37] (03PS1) 10JMeybohm: Add ::profile::base::linux419 to set up kernel 4.19 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) [07:54:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, let's give it a shot." [puppet] - 10https://gerrit.wikimedia.org/r/626553 (https://phabricator.wikimedia.org/T262609) (owner: 10Volans) [07:57:12] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable notifications on db2141 after being provisioned [puppet] - 10https://gerrit.wikimedia.org/r/626554 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [07:58:10] (03CR) 10JMeybohm: "PCC https://puppet-compiler.wmflabs.org/compiler1003/25031/kubestage1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [07:58:25] (03CR) 10Muehlenhoff: Add ::profile::base::linux419 to set up kernel 4.19 on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [07:59:37] !log remove BGP to AS64271 in AMS-IX (see peering@ email) [07:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:48] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [08:00:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:30] (03PS2) 10JMeybohm: Add ::profile::base::linux419 to set up kernel 4.19 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) [08:01:49] (03CR) 10JMeybohm: Add ::profile::base::linux419 to set up kernel 4.19 on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [08:03:33] (03PS1) 10Elukey: profile::oozie::server: set admin list for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/626595 [08:04:41] (03CR) 10jerkins-bot: [V: 04-1] profile::oozie::server: set admin list for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/626595 (owner: 10Elukey) [08:05:57] (03PS2) 10Elukey: profile::oozie::server: set admin list for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/626595 [08:06:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/626389 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [08:08:25] (03PS3) 10Elukey: profile::oozie::server: set admin list for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/626595 [08:13:18] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/25033/" [puppet] - 10https://gerrit.wikimedia.org/r/626595 (owner: 10Elukey) [08:13:22] (03CR) 10Elukey: [C: 03+2] profile::oozie::server: set admin list for the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/626595 (owner: 10Elukey) [08:16:02] (03CR) 10Elukey: [C: 03+1] wmf-auto-reimage: first Puppet run, force IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/626553 (https://phabricator.wikimedia.org/T262609) (owner: 10Volans) [08:23:27] (03CR) 10Elukey: Add geoip::data::puppet to profile::piwik::instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626481 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [08:24:05] (03CR) 10Jbond: [C: 03+2] admin: Update ebernhardson home files [puppet] - 10https://gerrit.wikimedia.org/r/626474 (owner: 10Ebernhardson) [08:25:40] (03CR) 10Elukey: "Very nice start, I added some comments but it looks very good." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626481 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [08:32:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice! Couple of comments inline as I 'd like us to be able to gauge if this really helped with e.g. the throttling." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [08:32:14] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: first Puppet run, force IPv4 [puppet] - 10https://gerrit.wikimedia.org/r/626553 (https://phabricator.wikimedia.org/T262609) (owner: 10Volans) [08:34:12] (03CR) 10Volans: [C: 03+2] cli: add a -n/--no-colors option [software/cumin] - 10https://gerrit.wikimedia.org/r/626389 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [08:35:22] (03CR) 10Volans: [C: 03+2] cli: in dry-run send the list of hosts to stdout [software/cumin] - 10https://gerrit.wikimedia.org/r/626390 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [08:36:17] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10ayounsi) 05Resolved→03Open Re-opening that task to ACK the alert it in Icinga, it has been cluttering the active alert list for 64 days :) https://icinga.wikimedia.org/cgi-bin/icinga... [08:36:29] (03Merged) 10jenkins-bot: cli: add a -n/--no-colors option [software/cumin] - 10https://gerrit.wikimedia.org/r/626389 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [08:37:41] (03Merged) 10jenkins-bot: cli: in dry-run send the list of hosts to stdout [software/cumin] - 10https://gerrit.wikimedia.org/r/626390 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [08:45:26] 10Operations, 10Puppet, 10Patch-For-Review: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609 (10jbond) >>! In T262609#6452359, @MoritzMuehlenhoff wrote: > From what I can tell that's a known issue/longstanding behaviour with enabling the mapped addressed, t... [08:46:58] (03PS3) 10JMeybohm: Add ::profile::base::linux419 to set up kernel 4.19 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) [08:47:27] (03PS9) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) [08:48:06] (03CR) 10jerkins-bot: [V: 04-1] Add ::profile::base::linux419 to set up kernel 4.19 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [08:48:32] (03CR) 10JMeybohm: Add ::profile::base::linux419 to set up kernel 4.19 on stretch (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [08:48:34] (03CR) 10jerkins-bot: [V: 04-1] role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [08:49:24] (03PS4) 10JMeybohm: Add ::profile::base::linux419 to set up kernel 4.19 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) [08:50:50] (03CR) 10Muehlenhoff: Add ::profile::base::linux419 to set up kernel 4.19 on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [08:51:44] (03PS5) 10JMeybohm: Add ::profile::base::linux419 to set up kernel 4.19 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) [08:52:14] (03CR) 10JMeybohm: Add ::profile::base::linux419 to set up kernel 4.19 on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [08:53:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [08:54:51] (03PS6) 10JMeybohm: Add ::profile::base::linux419 to set up kernel 4.19 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) [08:56:14] (03PS1) 10Hashar: git: allow multiple calls to git::systemconfig [take 2] [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) [08:57:24] (03CR) 10jerkins-bot: [V: 04-1] git: allow multiple calls to git::systemconfig [take 2] [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [08:58:08] (03PS1) 10Volans: Add support for Python 3.8 [software/cumin] - 10https://gerrit.wikimedia.org/r/626599 [08:58:54] (03CR) 10Gehel: [C: 03+1] "LGTM for the parts I understand" [software/cumin] - 10https://gerrit.wikimedia.org/r/626599 (owner: 10Volans) [08:58:59] (03PS10) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) [09:00:23] (03CR) 10Hashar: "Thank you for the check, verification and revert!" [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [09:00:56] (03CR) 10Volans: [C: 03+2] Add support for Python 3.8 [software/cumin] - 10https://gerrit.wikimedia.org/r/626599 (owner: 10Volans) [09:02:16] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10wiki_willy) @ayounsi - what exactly needs to be done from dc-ops? When I look at the Netbox accounting report, everything appears to be be alerting because the accounting team hasn't upd... [09:05:34] (03CR) 10Hashar: "That is a redone of https://gerrit.wikimedia.org/r/c/operations/puppet/+/625848 , I had a look at:" [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [09:07:36] (03PS2) 10Hashar: git: allow multiple calls to git::systemconfig [take 2] [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) [09:08:52] 10Operations, 10User-jbond: Manage apt sources via puppet? - https://phabricator.wikimedia.org/T158562 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [09:09:08] 10Operations, 10User-jbond: Manage apt sources via puppet - https://phabricator.wikimedia.org/T158562 (10MoritzMuehlenhoff) p:05Low→03Medium [09:09:30] 10Operations: Provide failover capacity for package installations from main mirror - https://phabricator.wikimedia.org/T262647 (10MoritzMuehlenhoff) [09:10:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor syntax issue, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [09:13:29] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10ayounsi) I re-opened it instead of opening a new one to keep context. And It's better to discuss it than have that long standing alert. I think it depends on what we have control over, I... [09:13:59] (03PS7) 10JMeybohm: Add ::profile::base::linux419 to set up kernel 4.19 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) [09:16:59] (03PS11) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) [09:17:45] (03CR) 10Jbond: "Thanks" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:19:36] (03PS1) 10Jcrespo: mariadb.py: Remove redundant code already present on WMFMariaDB class [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626602 [09:20:37] (03CR) 10jerkins-bot: [V: 04-1] mariadb.py: Remove redundant code already present on WMFMariaDB class [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626602 (owner: 10Jcrespo) [09:20:54] (03PS2) 10Jcrespo: mariadb.py: Remove redundant code already present on WMFMariaDB class [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626602 [09:20:59] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10wiki_willy) Yes, my issue with this accounting report is that while it's useful to compare the discrepancies between Netbox and the Accounting spreadsheet, it's constantly red because we'... [09:21:46] (03CR) 10jerkins-bot: [V: 04-1] mariadb.py: Remove redundant code already present on WMFMariaDB class [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626602 (owner: 10Jcrespo) [09:22:16] (03PS3) 10Jcrespo: mariadb.py: Remove redundant code already present on WMFMariaDB class [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626602 [09:31:03] (03PS4) 10Jcrespo: mariadb.py: Remove redundant code already present on WMFMariaDB class [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626602 [09:31:05] (03PS1) 10Jcrespo: resolve: Allow connections with :
, in addition to port [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626603 [09:31:52] (03CR) 10jerkins-bot: [V: 04-1] resolve: Allow connections with :
, in addition to port [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626603 (owner: 10Jcrespo) [09:32:46] (03PS2) 10Jcrespo: resolve: Allow connections with :
, in addition to port [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626603 [09:33:17] (03PS1) 10Alexandros Kosiaris: Remove ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/626604 (https://phabricator.wikimedia.org/T187984) [09:35:15] (03PS1) 10Alexandros Kosiaris: Revert "Temporarily add ticket-test.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/626626 (https://phabricator.wikimedia.org/T187984) [09:35:24] (03CR) 10Marostegui: [C: 03+1] resolve: Allow connections with :
, in addition to port [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626603 (owner: 10Jcrespo) [09:35:48] (03PS1) 10Muehlenhoff: Add CAS-enabled vhost for editors/admins (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) [09:36:56] (03CR) 10jerkins-bot: [V: 04-1] Add CAS-enabled vhost for editors/admins (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [09:39:34] (03CR) 10JMeybohm: "Thanks. I will not merge this for now as we first need the backported buster version of rasdaemon back." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [09:42:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/626604 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [09:42:56] (03PS2) 10Muehlenhoff: Add CAS-enabled vhost for editors/admins (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) [09:43:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "Temporarily add ticket-test.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/626626 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [09:44:04] (03CR) 10jerkins-bot: [V: 04-1] Add CAS-enabled vhost for editors/admins (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [09:44:08] (03PS3) 10Jcrespo: resolve: Allow connections with :
, in addition to port [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/626603 [09:45:01] (03CR) 10Jbond: "See comments inline" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [09:45:42] (03CR) 10Jbond: "forgot to say this change doesn't actually add" [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [09:48:51] (03PS1) 10Alexandros Kosiaris: Switch ticket.discovery.wmnet to otrs1001 [dns] - 10https://gerrit.wikimedia.org/r/626629 (https://phabricator.wikimedia.org/T187984) [09:49:08] (03PS3) 10Muehlenhoff: Add CAS-enabled vhost for editors/admins (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) [09:50:30] (03PS1) 10Alexandros Kosiaris: exim: Switch OTRS exim to otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/626630 (https://phabricator.wikimedia.org/T187984) [09:53:30] (03PS2) 10Alexandros Kosiaris: exim: Switch OTRS exim to otrs1001 [puppet] - 10https://gerrit.wikimedia.org/r/626630 (https://phabricator.wikimedia.org/T187984) [09:53:32] (03PS1) 10Alexandros Kosiaris: Promote otrs1001 as the main otrs host [puppet] - 10https://gerrit.wikimedia.org/r/626631 (https://phabricator.wikimedia.org/T187984) [09:53:39] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [10:01:03] (03PS19) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) [10:04:57] (03CR) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for all write queues to clear (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [10:07:09] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 71.19 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [10:10:41] 10Operations, 10observability, 10Patch-For-Review: Evaluate/integrate rasdaemon as a replacement for mcelog - https://phabricator.wikimedia.org/T205396 (10MoritzMuehlenhoff) >>! In T205396#6451116, @JMeybohm wrote: > We would like to roll out Kernel 4.19 on some stretch hosts and if I got this right we will... [10:13:08] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10git-protocol-v2: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 (10MoritzMuehlenhoff) Deployment servers and some other cl... [10:14:21] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) Layout of the process for quick reference on Monday: * Disable puppet on mendelevium. Command for that `sudo disable-puppet "P... [10:14:47] (03PS4) 10Ryan Kemper: elasticsearch: Store which dcs to query in class [software/spicerack] - 10https://gerrit.wikimedia.org/r/626240 (https://phabricator.wikimedia.org/T261239) [10:16:58] 10Operations, 10observability, 10Patch-For-Review: Evaluate/integrate rasdaemon as a replacement for mcelog - https://phabricator.wikimedia.org/T205396 (10JMeybohm) >>! In T205396#6453580, @MoritzMuehlenhoff wrote: >>>! In T205396#6451116, @JMeybohm wrote: >> We would like to roll out Kernel 4.19 on some str... [10:17:11] (03CR) 10Ryan Kemper: elasticsearch: Store which dcs to query in class (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/626240 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [10:17:34] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Store which dcs to query in class [software/spicerack] - 10https://gerrit.wikimedia.org/r/626240 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [10:18:22] 10Operations, 10netops, 10Sustainability (Incident Followup): ospf link-protection - https://phabricator.wikimedia.org/T167306 (10ayounsi) a:03ayounsi [10:34:11] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:17] PROBLEM - OTRS SMTP on otrs1001 is CRITICAL: connect to address 10.64.16.39 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/OTRS%23Troubleshooting [10:39:22] akoopal: ^^^ [10:39:29] oops, was for akosiaris [10:40:07] (03PS1) 10Muehlenhoff: Add IDP service registration for grafana [puppet] - 10https://gerrit.wikimedia.org/r/626639 (https://phabricator.wikimedia.org/T262512) [10:55:52] !log starting snapshot of m2 from db1117 [10:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:22] (03CR) 10Hnowlan: [C: 03+2] Api-gateway: various improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/626395 (owner: 10Ppchelko) [10:58:41] (03Merged) 10jenkins-bot: Api-gateway: various improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/626395 (owner: 10Ppchelko) [11:05:35] volans: yeah, scheduling downtime [11:07:45] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [11:07:48] (03PS4) 10Muehlenhoff: Add CAS-enabled vhost for editors/admins (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) [11:08:09] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [11:08:09] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [11:08:27] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [11:09:03] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [11:09:11] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [11:09:53] jynus: that's you I assume? ^ [11:11:31] yes, it is db1117, which is probably getting saturated - I have seen that before, so not an issue [11:12:31] netbox is me, fixing [11:12:39] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [11:12:55] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:13:05] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:13:35] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:15:11] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) >>! In T187984#6435616, @jcrespo wrote: > @akosiaris This is my proposal, db-wise: > > * Disable https://ticket-test.wikimedia... [11:15:55] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:15:56] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:18:13] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:18:55] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [11:19:32] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10faidon) Broadly speaking: - We shouldn't have outstanding alerts open (or even acknowledged) for more than a few days. If there is an alert, it means there is an abnormal condition that r... [11:19:39] (03PS1) 10Muehlenhoff: Streamline Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/626647 [11:20:51] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:26:37] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [11:27:59] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10git-protocol-v2: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 (10hashar) >>! In T262244#6453583, @MoritzMuehlenhoff wrot... [11:28:35] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:30:09] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:36:08] (03PS1) 10Hnowlan: api-gateway: add Authorization header to access-control-allow-headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/626649 (https://phabricator.wikimedia.org/T254906) [11:39:02] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/25036/" [puppet] - 10https://gerrit.wikimedia.org/r/626647 (owner: 10Muehlenhoff) [11:42:26] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add Authorization header to access-control-allow-headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/626649 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [11:43:41] (03Merged) 10jenkins-bot: api-gateway: add Authorization header to access-control-allow-headers [deployment-charts] - 10https://gerrit.wikimedia.org/r/626649 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [11:45:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/626639 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [11:45:42] (03CR) 10JMeybohm: "> Patch Set 7:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [11:45:58] (03PS5) 10Muehlenhoff: Add CAS-enabled vhost for editors/admins [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) [11:48:40] (03PS1) 10BBlack: geo-maps: Update for various WMCS spaces [dns] - 10https://gerrit.wikimedia.org/r/626654 [11:49:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/25039/ is happy as well, +1" [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [11:49:08] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [11:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:09] (03PS6) 10Muehlenhoff: Add CAS-enabled vhost for editors/admins [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) [11:50:19] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it" [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [11:50:35] (03PS1) 10BBlack: discovery-map: update geoip for wmcs spaces [puppet] - 10https://gerrit.wikimedia.org/r/626655 [11:52:09] (03CR) 10JMeybohm: [C: 03+2] "🚢" [puppet] - 10https://gerrit.wikimedia.org/r/626592 (https://phabricator.wikimedia.org/T262527) (owner: 10JMeybohm) [11:53:52] (03CR) 10Alexandros Kosiaris: [C: 03+1] "ah, nice! Better than the approach I had in mind!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625869 (owner: 10JMeybohm) [11:55:16] (03CR) 10BBlack: [C: 03+2] geo-maps: Update for various WMCS spaces [dns] - 10https://gerrit.wikimedia.org/r/626654 (owner: 10BBlack) [11:55:40] (03CR) 10JMeybohm: [C: 03+2] admin: Patch system:node clusterrolebinding on initialize_cluster.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/625869 (owner: 10JMeybohm) [11:57:09] (03Merged) 10jenkins-bot: admin: Patch system:node clusterrolebinding on initialize_cluster.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/625869 (owner: 10JMeybohm) [12:01:39] (03PS1) 10BBlack: geo-resources: create text-next for NEL [dns] - 10https://gerrit.wikimedia.org/r/626656 (https://phabricator.wikimedia.org/T261340) [12:02:01] (03CR) 10jerkins-bot: [V: 04-1] geo-resources: create text-next for NEL [dns] - 10https://gerrit.wikimedia.org/r/626656 (https://phabricator.wikimedia.org/T261340) (owner: 10BBlack) [12:03:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove etcd100[123] hosts [puppet] - 10https://gerrit.wikimedia.org/r/626274 (https://phabricator.wikimedia.org/T239835) (owner: 10JMeybohm) [12:04:44] (03PS1) 10Hnowlan: api-gateway: correct cors http header list [deployment-charts] - 10https://gerrit.wikimedia.org/r/626657 (https://phabricator.wikimedia.org/T254906) [12:05:56] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25041/" [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [12:11:49] 10Operations, 10DNS, 10Traffic: Verify diff.wikimedia.org ownership for Facebook - https://phabricator.wikimedia.org/T259807 (10BBlack) @Ckoerner - Sorry I missed this earlier! We (WMF SRE) can't create a `TXT` record for `diff.wikimedia.org` in our DNS due to limitations in how DNS itself works, combined w... [12:11:57] (03CR) 10Hnowlan: [C: 03+2] api-gateway: correct cors http header list [deployment-charts] - 10https://gerrit.wikimedia.org/r/626657 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [12:12:19] (03PS1) 10Gehel: Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) [12:13:13] (03Merged) 10jenkins-bot: api-gateway: correct cors http header list [deployment-charts] - 10https://gerrit.wikimedia.org/r/626657 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [12:14:45] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [12:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:40] (03CR) 10jerkins-bot: [V: 04-1] Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [12:27:42] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [12:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:17] (03CR) 10Jbond: "LGTM curious why port 8080 though?" [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [12:33:51] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32997840 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:39] (03CR) 10Muehlenhoff: "Good catch, this was meant to use 80 actually, fixing." [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [12:34:53] (03PS1) 10Hnowlan: api-gateway: add CORS header auth to all sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/626662 [12:35:45] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2616 and 99 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:36:03] (03PS7) 10Muehlenhoff: Add CAS-enabled vhost for editors/admins [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) [12:38:50] (03PS2) 10Gehel: Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) [12:39:23] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add CORS header auth to all sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/626662 (owner: 10Hnowlan) [12:40:37] (03Merged) 10jenkins-bot: api-gateway: add CORS header auth to all sections [deployment-charts] - 10https://gerrit.wikimedia.org/r/626662 (owner: 10Hnowlan) [12:44:51] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [12:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:03] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [12:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:27] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [12:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [12:57:59] (03PS2) 10Muehlenhoff: Add IDP service registration for grafana [puppet] - 10https://gerrit.wikimedia.org/r/626639 (https://phabricator.wikimedia.org/T262512) [13:02:15] (03CR) 10Jbond: [C: 03+1] "lgtm see comment for future thoughts" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626639 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [13:02:45] 10Operations, 10SRE-swift-storage, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [13:03:59] (03CR) 10Hashar: "I was so focusing on writing that dumb shell script that I forgot to think outside the box. The cat * is definitely easier to handle. Tha" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [13:04:06] 10Operations, 10SRE-swift-storage, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) p:05Triage→03High [13:04:14] 10Operations, 10SRE-swift-storage, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [13:05:22] (03PS3) 10Hashar: git: allow multiple calls to git::systemconfig [take 2] [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) [13:05:39] (03PS3) 10Hashar: base: enable git protocol version2 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) [13:06:38] 10Operations, 10SRE-swift-storage, 10Goal: Plan logical and physical design for media backups - https://phabricator.wikimedia.org/T262669 (10jcrespo) [13:06:45] 10Operations, 10SRE-swift-storage, 10Goal: Plan logical and physical design for media backups - https://phabricator.wikimedia.org/T262669 (10jcrespo) p:05Triage→03High [13:08:01] (03CR) 10jerkins-bot: [V: 04-1] base: enable git protocol version2 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [13:12:03] (03PS5) 10Matěj Suchánek: Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 [13:19:45] 10Operations, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Dumps-Generation: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001 (10jcrespo) There is current ongoing discussion to setup offline backups of media originals at T262669. While public dump... [13:19:50] 10Operations, 10DBA, 10SRE-swift-storage, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10jcrespo) [13:25:33] (03CR) 10Hashar: [C: 04-1] "And the next change adds git::globalconfig for protocol v2 to profile::base but that fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [13:37:32] (03CR) 10Gehel: [C: 04-1] "see minor comments inline." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [13:38:55] (03CR) 10Hashar: [C: 04-1] "I am such a spammer, I forgot to use compile.with_all_deps in the spec to validate that everything works fine together." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [13:40:09] (03PS4) 10Hashar: git: allow multiple calls to git::systemconfig [take 2] [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) [13:40:13] 10Operations, 10Wikimedia-Mailing-lists: Create functionaries-ru Mailing list. - https://phabricator.wikimedia.org/T262525 (10Carn) [13:40:47] (03PS4) 10Hashar: base: enable git protocol version2 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) [13:44:56] (03CR) 10Gehel: [C: 04-1] "Minor comments inline. We're almost there!" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/626240 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [13:46:56] (03PS1) 10Jcrespo: mariadb-backups: Temporarilly disable logical backups of m2 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/626690 (https://phabricator.wikimedia.org/T187984) [13:49:53] (03PS2) 10Jcrespo: mariadb-backups: Temporarilly disable logical backups of m2 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/626690 (https://phabricator.wikimedia.org/T187984) [13:50:18] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Temporarilly disable logical backups of m2 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/626690 (https://phabricator.wikimedia.org/T187984) (owner: 10Jcrespo) [13:51:09] (03PS1) 10Jcrespo: Revert "mariadb-backups: Temporarilly disable logical backups of m2 on eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/626669 (https://phabricator.wikimedia.org/T187984) [13:51:33] (03CR) 10Jcrespo: [C: 04-2] "Not yet." [puppet] - 10https://gerrit.wikimedia.org/r/626669 (https://phabricator.wikimedia.org/T187984) (owner: 10Jcrespo) [13:53:09] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (Kanban): labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10Papaul) [13:56:32] (03CR) 10Hashar: "recheck CI job got updated to a newer gdnsd ( https://gerrit.wikimedia.org/r/c/integration/config/+/626508/1 )" [dns] - 10https://gerrit.wikimedia.org/r/626656 (https://phabricator.wikimedia.org/T261340) (owner: 10BBlack) [13:58:24] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:59:42] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:03:46] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [14:06:56] (03PS1) 10Muehlenhoff: Manage /etc/apt/sources.list via Puppet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) [14:08:03] (03CR) 10jerkins-bot: [V: 04-1] Manage /etc/apt/sources.list via Puppet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) (owner: 10Muehlenhoff) [14:14:15] (03PS2) 10Muehlenhoff: Manage /etc/apt/sources.list via Puppet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) [14:28:24] (03CR) 10Jbond: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [14:28:45] (03CR) 10Jbond: [C: 03+1] "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [14:28:51] (03CR) 10Jbond: [C: 03+2] git: allow multiple calls to git::systemconfig [take 2] [puppet] - 10https://gerrit.wikimedia.org/r/626598 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [14:33:46] (03PS5) 10Jbond: base: enable git protocol version2 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [14:34:24] (03CR) 10Jbond: "LGTM will merge on monday" [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [14:37:29] (03CR) 10Muehlenhoff: "The backport of 2.20 is only for stretch (will roll out the rest on Monday), but this would also apply to jessie? I guess we need some OS " [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [14:45:20] (03PS1) 10Jgiannelos: Release version 2020-09-11-113122-publish of push-notifications [deployment-charts] - 10https://gerrit.wikimedia.org/r/626699 [14:47:28] (03CR) 10MSantos: [C: 03+2] Release version 2020-09-11-113122-publish of push-notifications [deployment-charts] - 10https://gerrit.wikimedia.org/r/626699 (owner: 10Jgiannelos) [14:48:44] (03Merged) 10jenkins-bot: Release version 2020-09-11-113122-publish of push-notifications [deployment-charts] - 10https://gerrit.wikimedia.org/r/626699 (owner: 10Jgiannelos) [14:51:33] Just FYI, we are about to deploy a change to push-notifications service. The service doesn't serve production traffic nor is used by any app or service. [14:51:41] It's a change to test proxy configuration [14:52:37] cc/ akosiaris and joe any questions or concerns? [14:52:51] mateusbs17: go ahead :) [14:53:04] Thanks jayme [14:54:32] PROBLEM - Stale file for node-exporter textfile in eqiad on icinga1001 is CRITICAL: cluster=misc file=smartmon.prom instance=relforge1004 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [14:56:37] (03CR) 10Herron: [C: 03+2] kibana: enable logging by default [puppet] - 10https://gerrit.wikimedia.org/r/626391 (owner: 10Herron) [15:05:03] Hey jayme, is there a documentation about deployment for k8s? I just noticed our documentation baseline might be outdated https://www.mediawiki.org/wiki/Wikifeeds/Deployment_Process [15:05:18] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [15:06:22] mateusbs17: oh yeah. That one definitely is. Please take a look at https://wikitech.wikimedia.org/wiki/Deployments_on_kubernetes#Deploying_with_helmfile [15:07:39] jayme: thanks! [15:11:19] (03PS30) 10Volans: customscripts/interface_automation.py: Add Interface and IP Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [15:12:33] !log jgiannelos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [15:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:03] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) > * At some point before the maintenance, clone only otrs database into db1077 again and make it replicate from m2 primary (db110... [15:25:54] (03PS2) 10Muehlenhoff: Support password changes in manage_principals (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/559765 (https://phabricator.wikimedia.org/T237605) [15:28:38] (03PS1) 10Andrew Bogott: Add wmcs-drain-hypervisor admin script [puppet] - 10https://gerrit.wikimedia.org/r/626704 [15:31:02] (03CR) 10Andrew Bogott: [C: 03+2] Add wmcs-drain-hypervisor admin script [puppet] - 10https://gerrit.wikimedia.org/r/626704 (owner: 10Andrew Bogott) [15:32:33] (03PS3) 10Gehel: Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) [15:37:53] (03PS1) 10Papaul: DNS: Add production DNS for maps200[5-9] and maps2010 [dns] - 10https://gerrit.wikimedia.org/r/626707 [15:40:01] (03CR) 10Papaul: [C: 03+2] DNS: Add production DNS for maps200[5-9] and maps2010 [dns] - 10https://gerrit.wikimedia.org/r/626707 (owner: 10Papaul) [15:42:25] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps, 10Patch-For-Review: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [15:47:09] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) I went into system config and found `Physical Disk 00:01:12: SSD, SATA, 446.625GB, Ready, (512B)`, but in theory we have... [15:56:27] (03PS3) 10Volans: sre.ganeti.makevm: adapt to Netbox DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/623545 (https://phabricator.wikimedia.org/T258729) [15:58:40] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:01:04] (03PS1) 10Volans: scripts: enable primary IPs options [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/626712 (https://phabricator.wikimedia.org/T244153) [16:01:45] andrewbogott: is that you (puppet merge pending) $$$ [16:01:48] *^^^ [16:01:58] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/626713 (https://phabricator.wikimedia.org/T262674) (owner: 10CRusnov) [16:03:53] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/626712 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [16:04:21] (03CR) 10Volans: "LGTM, one question inline" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/626713 (https://phabricator.wikimedia.org/T262674) (owner: 10CRusnov) [16:04:57] volans: probably. Feel free to merge or I will do it in a few. [16:05:12] it's not blocking me, just icinga complaining ;) [16:07:58] volans: merged! My wifi went down between me submitting and me merging but now it seems to be back [16:08:20] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:08:47] (03CR) 10CRusnov: [C: 03+2] "THanks :)" (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/626713 (https://phabricator.wikimedia.org/T262674) (owner: 10CRusnov) [16:08:55] andrewbogott: thx, np [16:15:08] (03PS1) 10Tchanders: Enable Special:Investigate on itwiki, eswiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626715 (https://phabricator.wikimedia.org/T262436) [16:17:19] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Papaul) here is what the Controller is showing {F32270656} [16:30:17] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: an-worker11[02-17] have only one SSD in the flexbay - https://phabricator.wikimedia.org/T262690 (10elukey) [16:31:11] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) 05Open→03Stalled Pending T262690 [16:32:16] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: an-worker11[02-17] have only one SSD in the flexbay - https://phabricator.wikimedia.org/T262690 (10RobH) Confirmed this is indeed the case. It appears that on the ordering task T246784, packing slips were not included in the shipment, and instead had... [16:35:41] (03PS1) 10Dzahn: add new parse* servers to $wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626719 (https://phabricator.wikimedia.org/T247441) [16:36:13] * James_F grins at mutante. [16:36:33] (03CR) 10jerkins-bot: [V: 04-1] add new parse* servers to $wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626719 (https://phabricator.wikimedia.org/T247441) (owner: 10Dzahn) [16:38:54] (03PS2) 10Dzahn: add new parse* servers to $wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626719 (https://phabricator.wikimedia.org/T247441) [16:45:44] James_F: is wgLinterSubmitterWhitelist a new thing? would I be right assuming it can be added anytime including before servers are getting traffic? it sure sounds like it [16:48:27] it's been there since commit 1 in 2016, https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Linter/+/bce5b3161681c4304a818b93124c8b95dce319ce%5E%21/includes/ApiRecordLint.php [16:48:38] effectively uses an in_array check [16:48:45] so it may contain unknown stuff indeed [16:49:14] thanks, ack [16:49:19] (03PS1) 10Dzahn: add new parsoid servers to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/626721 (https://phabricator.wikimedia.org/T247441) [16:49:33] then i'll add it to a swat [16:56:20] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:56:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 238, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:57:03] (03PS1) 10Jbond: wmflib::debian::version: update the os_version [puppet] - 10https://gerrit.wikimedia.org/r/626723 [16:57:45] (03CR) 10jerkins-bot: [V: 04-1] wmflib::debian::version: update the os_version [puppet] - 10https://gerrit.wikimedia.org/r/626723 (owner: 10Jbond) [17:00:45] 10Operations, 10Wikimedia-Mailing-lists: Create ruwikipedia-arbcom Mailing list. - https://phabricator.wikimedia.org/T262525 (10Carn) [17:01:22] (03PS2) 10Jbond: wmflib::debian::version: update the os_version [puppet] - 10https://gerrit.wikimedia.org/r/626723 [17:04:35] 10Operations, 10Wikimedia-Mailing-lists: Create ruwikipedia-arbcom Mailing list. - https://phabricator.wikimedia.org/T262525 (10Carn) It seems that creation of broad mailing list may need to community aproval first, but we do need ability to get letters for all ArbCom on one fixed adress, so I changed the task... [17:05:56] (03CR) 10Dzahn: [V: 03+1] "this is really just a renaming now" [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [17:06:17] (03PS2) 10Dzahn: puppetmaster: (re)move hiera lookup for scripts to profiles [puppet] - 10https://gerrit.wikimedia.org/r/624335 [17:08:40] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Elasticsearch: Search is currently too busy - https://phabricator.wikimedia.org/T262694 (10kaldari) [17:09:32] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/25043/puppetmaster1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/624335 (owner: 10Dzahn) [17:13:05] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-Android-App-Backlog: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10Dbrant) [17:16:31] (03PS31) 10CRusnov: customscripts/interface_automation.py: Add Interface and IP Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [17:17:07] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Papaul) [17:19:20] (03PS2) 10Dzahn: service::node: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624346 [17:21:50] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 56.41 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:22:45] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [17:22:46] how do you guys get reviews? do you ping directly on IRC? manual emails? [17:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:37] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Elasticsearch: Search is currently too busy - https://phabricator.wikimedia.org/T262694 (10Majavah) possible duplicate of T262691 [17:26:51] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:49] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [17:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:56] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:08] 10Operations, 10serviceops, 10Patch-For-Review: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 (10Dzahn) p:05Medium→03High [17:34:56] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Papaul) [17:39:14] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 73.97 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:39:55] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) Hi @kfrancis is NDA on file for Michael? [17:41:36] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-en Mailing list. - https://phabricator.wikimedia.org/T262525 (10Carn) [17:41:55] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Elasticsearch: Search is currently too busy - https://phabricator.wikimedia.org/T262694 (10EBernhardson) Looking over some things: 16:44 Pool counter rejections for full text search start 16:45 full text QPS increased from 590 to 1100. 16:47 Pool counte... [17:45:08] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-en Mailing list. - https://phabricator.wikimedia.org/T262525 (10Adamant.pwn) [17:45:22] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-en Mailing list. - https://phabricator.wikimedia.org/T262525 (10Adamant.pwn) [17:46:01] 10Operations, 10DNS, 10Traffic: Verify diff.wikimedia.org ownership for Facebook - https://phabricator.wikimedia.org/T259807 (10CKoerner_WMF) 05Open→03Invalid Ah, ok. That's something I can take up with VIP. Thank you for getting back to me. [17:46:04] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru Mailing list. - https://phabricator.wikimedia.org/T262525 (10Carn) [17:46:32] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru Mailing list. - https://phabricator.wikimedia.org/T262525 (10Dzahn) We can create a mailing list on wikimedia servers or you can keep using google groups. Forwarding from our list server to Google though would be something completely new which we would... [17:50:45] 10Operations, 10Wikimedia-Mailing-lists: Create Wikimedia-BPY Mailing list - https://phabricator.wikimedia.org/T262165 (10Dzahn) "You have successfully created the mailing list wikimedia-bpy and notification has been sent to the list owner usingha@gmail.com. You can now: Visit the [[ https://lists.wikimedia.o... [17:52:11] (03CR) 10MusikAnimal: [C: 04-1] Manage /etc/apt/sources.list via Puppet (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) (owner: 10Muehlenhoff) [17:54:08] 10Operations, 10Wikimedia-Mailing-lists: Create Wikimedia-BPY Mailing list - https://phabricator.wikimedia.org/T262165 (10Dzahn) 05Open→03Resolved a:03Dzahn @Usingha The list has been created. You should have received mail with an initial random password. See the links above. The password should let you... [17:58:38] (03PS1) 10Legoktm: libraryupgrader: Set Restart=on-failure for libup-web service [puppet] - 10https://gerrit.wikimedia.org/r/626728 (https://phabricator.wikimedia.org/T262228) [18:00:10] 10Operations, 10Wikimedia-Mailing-lists: Create Wikimedia-BPY Mailing list - https://phabricator.wikimedia.org/T262165 (10Usingha) @Dzahn - Thank you so much! It was a great help. - Uttam Singha [18:00:36] 10Operations, 10Wikimedia-Mailing-lists: Wikimedia-RU mailing list page has wrong encoding - https://phabricator.wikimedia.org/T135226 (10Carn) [18:03:16] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru Mailing list. - https://phabricator.wikimedia.org/T262525 (10Adamant.pwn) [18:07:42] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru Mailing list. - https://phabricator.wikimedia.org/T262525 (10Adamant.pwn) Hi @Dzahn! I'm also the current arbitrator of the Russian Wikipedia and I support the request. Yes, the proposed scheme will suit us well. I've also seen that English Wikipedia Ar... [18:10:14] (03PS3) 10Ppchelko: Allow public access to API Portal main page for private launch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626229 (https://phabricator.wikimedia.org/T262480) (owner: 10Cicalese) [18:11:40] (03CR) 10Ppchelko: [C: 03+1] "Good. Let's put it in a deploy window on Mon?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626229 (https://phabricator.wikimedia.org/T262480) (owner: 10Cicalese) [18:13:59] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru Mailing list. - https://phabricator.wikimedia.org/T262525 (10Adamant.pwn) [18:14:59] 10Operations, 10Page Content Service, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, and 2 others: page/summary and page/mobile-html keeps responding code 429 in zh.wiki - https://phabricator.wikimedia.org/T262705 (10Mholloway) You're being rate-limited by something. Is this something user... [18:19:53] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [18:19:56] 10Operations, 10Page Content Service, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, and 2 others: page/summary and page/mobile-html keeps responding code 429 in zh.wiki - https://phabricator.wikimedia.org/T262705 (10Pchelolo) cc @Joe ratelimit was added to mobile-html with non-standard l... [18:21:03] 10Operations, 10Page Content Service, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, and 2 others: page/summary and page/mobile-html keeps responding code 429 in zh.wiki - https://phabricator.wikimedia.org/T262705 (10cooltey) Hi @Mholloway Yes, we have received two reports from the `zh.wi... [18:21:41] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-Android-App-Backlog: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10Dbrant) [18:21:42] 10Operations, 10Page Content Service, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, and 2 others: page/summary and page/mobile-html keeps responding code 429 in zh.wiki - https://phabricator.wikimedia.org/T262705 (10Dbrant) [18:23:05] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru Mailing list. - https://phabricator.wikimedia.org/T262525 (10Dzahn) >>! In T262525#6454852, @Adamant.pwn wrote: > Hi @Dzahn! I'm also the current arbitrator of the Russian Wikipedia and I support the request. Yes, the proposed scheme will suit us well.... [18:28:55] (03CR) 10Muehlenhoff: Manage /etc/apt/sources.list via Puppet (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T156562) (owner: 10Muehlenhoff) [18:29:23] (03PS3) 10Muehlenhoff: Manage /etc/apt/sources.list via Puppet (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/626693 (https://phabricator.wikimedia.org/T158562) [18:29:46] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-Android-App-Backlog: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10cooltey) Thanks for filing this @LGoto I have replied to the zh.wiki users who reported this issue. [18:30:47] (03PS3) 10Andrew Bogott: scap: add support for .eqiad1.wikimedia.cloud targets [puppet] - 10https://gerrit.wikimedia.org/r/626465 (https://phabricator.wikimedia.org/T260614) [18:32:10] (03CR) 10Andrew Bogott: [C: 03+2] scap: add support for .eqiad1.wikimedia.cloud targets [puppet] - 10https://gerrit.wikimedia.org/r/626465 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [18:44:41] 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 2 others: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10cooltey) [18:50:01] Wikipedia can't be found [18:50:12] "We can’t connect to the server at en..wikipedia.org." [18:50:15] oh [18:50:16] LOL [18:50:20] I'm an idiot, don't mind me [18:50:29] ugh :P [18:55:13] 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 2 others: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10Mholloway) T262705#6454901 [19:13:25] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru Mailing list. - https://phabricator.wikimedia.org/T262525 (10Carn) Yes, we wouldn't have many users in this group, so mailman mailing list functionality may be excessive for us. Thank you very much, Dzahn, for your help so that we get exactly what we... [19:15:31] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru@wikimedia.org - https://phabricator.wikimedia.org/T262525 (10Carn) [19:19:30] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru@wikimedia.org - https://phabricator.wikimedia.org/T262525 (10Dzahn) >>! In T262525#6455154, @Carn wrote: > Yes, we wouldn't have many users in this group, so mailman mailing list functionality may be excessive for us. I don't think one or the other is... [19:31:13] 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10Reedy) [19:35:25] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Tue, Sept 15: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10wiki_willy) [19:36:56] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson [19:37:16] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Tue, Sept 15: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson [19:37:42] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 15 PDU Upgrade 12pm-4pm UTC- Racks C4 and C5 - https://phabricator.wikimedia.org/T261456 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson [19:39:09] 10Operations, 10ops-eqiad, 10DC-Ops: Thur, Sept 17 PDU Upgrade 12pm-4pm UTC- Rack C1 (Fundraising) - https://phabricator.wikimedia.org/T261458 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson [19:39:58] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Thur, Sept 17: PDU Upgrade 12pm-4pm UTC- Racks D1 and D2 - https://phabricator.wikimedia.org/T261459 (10wiki_willy) [19:44:43] (03PS4) 10Volans: sre.ganeti.makevm: adapt to Netbox DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/623545 (https://phabricator.wikimedia.org/T258729) [19:44:45] (03PS1) 10Volans: sre.hosts.decommission: add Netbox DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/626738 (https://phabricator.wikimedia.org/T258729) [20:08:44] RECOVERY - Host ps1-d3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.50 ms [20:08:44] RECOVERY - Host ps1-d4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [20:12:51] (03PS1) 10Ashot1997: Add logo Wordmark and Tagline for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626740 (https://phabricator.wikimedia.org/T259985) [20:12:52] PROBLEM - ps1-d4-eqiad-infeed-load-tower-B-phase-Y on ps1-d4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:12:52] PROBLEM - ps1-d4-eqiad-infeed-load-tower-A-phase-Y on ps1-d4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:12:52] PROBLEM - ps1-d3-eqiad-infeed-load-tower-A-phase-Y on ps1-d3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:12:56] PROBLEM - ps1-d4-eqiad-infeed-load-tower-A-phase-Z on ps1-d4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:12:56] PROBLEM - ps1-d3-eqiad-infeed-load-tower-B-phase-Y on ps1-d3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:12:56] PROBLEM - ps1-d3-eqiad-infeed-load-tower-A-phase-X on ps1-d3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:13:02] PROBLEM - ps1-d3-eqiad-infeed-load-tower-B-phase-X on ps1-d3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:13:32] PROBLEM - ps1-d4-eqiad-infeed-load-tower-B-phase-X on ps1-d4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:14:04] PROBLEM - ps1-d3-eqiad-infeed-load-tower-A-phase-Z on ps1-d3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:14:06] PROBLEM - ps1-d4-eqiad-infeed-load-tower-B-phase-Z on ps1-d4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:14:26] PROBLEM - ps1-d4-eqiad-infeed-load-tower-A-phase-X on ps1-d4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:14:30] PROBLEM - ps1-d3-eqiad-infeed-load-tower-B-phase-Z on ps1-d3-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:15:53] (03PS2) 10Ashot1997: Add logo Wordmark and Tagline for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626740 (https://phabricator.wikimedia.org/T259985) [20:16:00] ^ there's no expected power work on D3/D4 today, right? [20:16:01] D3: test - ignore - https://phabricator.wikimedia.org/D3 [20:17:48] not sure. made https://phabricator.wikimedia.org/T262629 for that yesterday [20:18:00] ah nod [20:18:41] according to the schedule D3/D4 were September 8 and I'm pretty sure they finished on schedule [20:19:05] wiki_willy: ^ check me? [20:19:51] hey rzl....thats correct, it was completed on tue [20:20:00] okay so those alerts are probably genuine then [20:20:07] something left in a bad state after the work maybe? [20:20:08] but rob is messing around with the librenms alerts that they're throwing in netbox [20:20:13] ahh okay cool [20:20:37] he's basically trying to figure out how to get rid of these - https://netbox.wikimedia.org/extras/reports/librenms.LibreNMS/ [20:20:47] well, i ditched the old errros willy [20:20:55] now have new ones but its because the osftware side wasnt done [20:21:03] gotcha [20:21:05] but yeah, the above errors are me and can be ignored [20:21:15] sweet, thanks [20:21:22] they were offline, now they are online but the checks time out due to being for old version of the host [20:21:39] it'll clear up in a bit, but for now ignore them (unless yousee hosts go offline, then i messed up ; ) [20:21:43] robh: when you have a minute can you drop a note on T262629 for posterity? [20:21:44] T262629: icinga alert cleanup for power switches in eqiad - https://phabricator.wikimedia.org/T262629 [20:22:17] i guess, ill just resolve it now [20:22:23] because this is really tied to the racking task [20:22:41] sure, seems good [20:22:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Icinga: icinga alert cleanup for power switches in eqiad - https://phabricator.wikimedia.org/T262629 (10RobH) 05Open→03Resolved a:03RobH im working on clearing all the alerts via the racking task T261452 [20:23:01] i wont recall to touch another task later, so clearing that now [20:25:05] (03PS1) 10RobH: updating ps1-d[34]-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/626744 (https://phabricator.wikimedia.org/T261452) [20:25:34] (03CR) 10jerkins-bot: [V: 04-1] updating ps1-d[34]-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/626744 (https://phabricator.wikimedia.org/T261452) (owner: 10RobH) [20:26:51] (03PS2) 10RobH: updating ps1-d[34]-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/626744 (https://phabricator.wikimedia.org/T261452) [20:27:18] (03CR) 10jerkins-bot: [V: 04-1] updating ps1-d[34]-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/626744 (https://phabricator.wikimedia.org/T261452) (owner: 10RobH) [20:28:03] (03PS3) 10RobH: updating ps1-d[34]-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/626744 (https://phabricator.wikimedia.org/T261452) [20:28:04] bah indentation bane [20:28:29] (03CR) 10RobH: [C: 03+2] updating ps1-d[34]-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/626744 (https://phabricator.wikimedia.org/T261452) (owner: 10RobH) [20:37:42] (03CR) 10CRusnov: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/626738 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [20:40:13] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10RobH) [20:42:08] RECOVERY - ps1-d3-eqiad-infeed-load-tower-A-phase-X on ps1-d3-eqiad is OK: SNMP OK - ps1-d3-eqiad-infeed-load-tower-A-phase-X 274 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:08] RECOVERY - ps1-d3-eqiad-infeed-load-tower-A-phase-Y on ps1-d3-eqiad is OK: SNMP OK - ps1-d3-eqiad-infeed-load-tower-A-phase-Y 259 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:08] RECOVERY - ps1-d3-eqiad-infeed-load-tower-A-phase-Z on ps1-d3-eqiad is OK: SNMP OK - ps1-d3-eqiad-infeed-load-tower-A-phase-Z 286 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:08] RECOVERY - ps1-d3-eqiad-infeed-load-tower-B-phase-X on ps1-d3-eqiad is OK: SNMP OK - ps1-d3-eqiad-infeed-load-tower-B-phase-X 275 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:08] RECOVERY - ps1-d3-eqiad-infeed-load-tower-B-phase-Z on ps1-d3-eqiad is OK: SNMP OK - ps1-d3-eqiad-infeed-load-tower-B-phase-Z 280 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:08] RECOVERY - ps1-d3-eqiad-infeed-load-tower-B-phase-Y on ps1-d3-eqiad is OK: SNMP OK - ps1-d3-eqiad-infeed-load-tower-B-phase-Y 224 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:25] (03CR) 10Volans: [C: 03+2] "As agreed on IRC, the current version could use some refactor/improvements and also simplifications once the first mass import in Netbox w" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [20:42:34] RECOVERY - ps1-d4-eqiad-infeed-load-tower-A-phase-X on ps1-d4-eqiad is OK: SNMP OK - ps1-d4-eqiad-infeed-load-tower-A-phase-X 154 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:34] RECOVERY - ps1-d4-eqiad-infeed-load-tower-A-phase-Y on ps1-d4-eqiad is OK: SNMP OK - ps1-d4-eqiad-infeed-load-tower-A-phase-Y 200 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:34] RECOVERY - ps1-d4-eqiad-infeed-load-tower-A-phase-Z on ps1-d4-eqiad is OK: SNMP OK - ps1-d4-eqiad-infeed-load-tower-A-phase-Z 192 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:34] RECOVERY - ps1-d4-eqiad-infeed-load-tower-B-phase-X on ps1-d4-eqiad is OK: SNMP OK - ps1-d4-eqiad-infeed-load-tower-B-phase-X 173 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:34] RECOVERY - ps1-d4-eqiad-infeed-load-tower-B-phase-Y on ps1-d4-eqiad is OK: SNMP OK - ps1-d4-eqiad-infeed-load-tower-B-phase-Y 175 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:42:34] RECOVERY - ps1-d4-eqiad-infeed-load-tower-B-phase-Z on ps1-d4-eqiad is OK: SNMP OK - ps1-d4-eqiad-infeed-load-tower-B-phase-Z 206 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:56:41] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Mon, Sept 14 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3 - https://phabricator.wikimedia.org/T261455 (10RobH) [20:56:56] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 15 PDU Upgrade 12pm-4pm UTC- Racks C4 and C5 - https://phabricator.wikimedia.org/T261456 (10RobH) [20:57:20] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Thur, Sept 17: PDU Upgrade 12pm-4pm UTC- Racks D1 and D2 - https://phabricator.wikimedia.org/T261459 (10RobH) [20:57:34] 10Operations, 10ops-eqiad, 10DC-Ops: Thur, Sept 17 PDU Upgrade 12pm-4pm UTC- Rack C1 (Fundraising) - https://phabricator.wikimedia.org/T261458 (10RobH) [20:57:51] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10RobH) [20:58:00] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Wed, Sept 16 PDU Upgrade 12pm-4pm UTC- Racks C6 and C7 - https://phabricator.wikimedia.org/T261457 (10RobH) [21:00:13] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10RobH) [21:02:14] (03CR) 10Volans: [C: 03+1] "Thanks for resuming this work! The change seems a good starting point for an iterative approach to move towards a more structured handling" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [21:02:49] (03CR) 10Hashar: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [21:04:51] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10wiki_willy) [21:17:27] (03CR) 10Dzahn: [C: 03+2] libraryupgrader: Set Restart=on-failure for libup-web service [puppet] - 10https://gerrit.wikimedia.org/r/626728 (https://phabricator.wikimedia.org/T262228) (owner: 10Legoktm) [21:21:38] (03PS9) 10Dzahn: scap: Fix "Could not find resource 'User[deploy-devtools]'" in cloud [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [21:26:04] (03PS1) 10Hashar: deployment: migrate to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626757 (https://phabricator.wikimedia.org/T262244) [21:26:08] (03PS1) 10Hashar: phabricator: migrate to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626758 (https://phabricator.wikimedia.org/T262244) [21:27:08] (03CR) 10jerkins-bot: [V: 04-1] deployment: migrate to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626757 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [21:27:10] do you use puppet compiler with class names instead of hostnames? sometimes it works fine for me to use "C:classname" but it's kind of undocumented on wikitech [21:27:26] (03CR) 10jerkins-bot: [V: 04-1] phabricator: migrate to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626758 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [21:28:11] and here i have a case again where it doesn't find any and before i thought it's just missing facts. but right now i use "C:scap::target" and that is an oooold class [21:28:35] (03PS2) 10Hashar: deployment: migrate to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626757 (https://phabricator.wikimedia.org/T262244) [21:29:14] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10git-protocol-v2: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 (10hashar) **status update** @MoritzMuehlenhoff started u... [21:29:40] (03CR) 10jerkins-bot: [V: 04-1] deployment: migrate to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626757 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [21:31:58] (03PS3) 10Hashar: deployment: migrate to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626757 (https://phabricator.wikimedia.org/T262244) [21:33:35] (03PS2) 10Hashar: phabricator: migrate to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626758 (https://phabricator.wikimedia.org/T262244) [21:34:42] (03CR) 10jerkins-bot: [V: 04-1] phabricator: migrate to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626758 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [21:35:03] ah..duh.. it's a defined type, not a class. of course [21:37:13] (03PS3) 10Hashar: phabricator: migrate to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626758 (https://phabricator.wikimedia.org/T262244) [21:48:13] (03CR) 10Dzahn: [C: 04-1] "Compiled it on a bunch of hosts and it would change requires on production hosts. for example:" [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [21:51:27] (03PS10) 10Paladox: scap: Fix "Could not find resource 'User[deploy-devtools]'" in cloud [puppet] - 10https://gerrit.wikimedia.org/r/626035 [21:53:34] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unex [21:53:34] (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:57:28] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:57:56] (03PS11) 10Paladox: scap: Fix "Could not find resource 'User[deploy-devtools]'" in cloud [puppet] - 10https://gerrit.wikimedia.org/r/626035 [21:58:13] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [21:59:20] (03CR) 10jerkins-bot: [V: 04-1] scap: Fix "Could not find resource 'User[deploy-devtools]'" in cloud [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [22:00:12] (03PS12) 10Paladox: scap: Fix "Could not find resource 'User[deploy-devtools]'" in cloud [puppet] - 10https://gerrit.wikimedia.org/r/626035 [22:00:23] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [22:03:12] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru@wikimedia.org - https://phabricator.wikimedia.org/T262525 (10Adamant.pwn) Welp, there's one more question. You said we should contact ITS team, which is former #WMF-Office-IT, I guess? Project's description says we should contact ITS on https://office.w... [22:06:55] 10Operations, 10Beta-Cluster-Infrastructure, 10observability: Beta puppet patch "prometheus: make ferm DNS record type configurable" - https://phabricator.wikimedia.org/T244624 (10dpifke) 05Open→03Resolved a:03dpifke Calling this resolved, as it's been a week and nobody has missed the associated patch. [22:08:48] (03PS1) 10Dave Pifke: [WIP] arclamp: fix spurious "environment changed" [puppet] - 10https://gerrit.wikimedia.org/r/626779 (https://phabricator.wikimedia.org/T244776) [22:16:54] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10wiki_willy) For the last couple bullets, we (dc-ops) own that, but typically make the change after Accounting has their spreadsheet updated (when we move the <$1k assets to the top of the... [22:18:04] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Elasticsearch: Search is currently too busy - https://phabricator.wikimedia.org/T262694 (10RKemper) Following up here: Erik and I dug in here and we believe that the [[ https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?query=ff&editPane... [22:19:11] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/25046/" [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [22:22:05] (03PS2) 10Dave Pifke: arclamp: fix spurious "environment changed" [puppet] - 10https://gerrit.wikimedia.org/r/626779 (https://phabricator.wikimedia.org/T244776) [22:26:20] (03PS3) 10Dave Pifke: arclamp: fix spurious "environment changed" [puppet] - 10https://gerrit.wikimedia.org/r/626779 (https://phabricator.wikimedia.org/T244776) [22:29:36] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10KFrancis) Hi @Dzahn, I'm confirming we have an NDA on file for Michael. Thanks! [22:32:09] (03PS13) 10Paladox: scap: Fix "Could not find resource 'User[deploy-devtools]'" in cloud [puppet] - 10https://gerrit.wikimedia.org/r/626035 [22:35:50] (03PS14) 10Paladox: scap: Fix "Could not find resource 'User[deploy-devtools]'" in cloud [puppet] - 10https://gerrit.wikimedia.org/r/626035 [22:38:09] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [22:54:26] !log starting people2001 VM [22:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:34] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [23:06:08] (03PS1) 10Thcipriani: gitlab-test: pam access for ssh for git user [puppet] - 10https://gerrit.wikimedia.org/r/626785 (https://phabricator.wikimedia.org/T262516) [23:06:36] ^ brennen tada...maybe :) [23:07:11] (03CR) 10jerkins-bot: [V: 04-1] gitlab-test: pam access for ssh for git user [puppet] - 10https://gerrit.wikimedia.org/r/626785 (https://phabricator.wikimedia.org/T262516) (owner: 10Thcipriani) [23:08:19] (03PS2) 10Thcipriani: gitlab-test: pam access for ssh for git user [puppet] - 10https://gerrit.wikimedia.org/r/626785 (https://phabricator.wikimedia.org/T262516) [23:09:14] thcipriani: nice sleuthing [23:11:07] I do wonder if there's a better place to put that, but that should work if it merges. [23:18:36] (03PS1) 10Dave Pifke: New address for wpt-enwiki.wmftest.org [dns] - 10https://gerrit.wikimedia.org/r/626786 [23:22:32] (03CR) 10BryanDavis: gitlab-test: pam access for ssh for git user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626785 (https://phabricator.wikimedia.org/T262516) (owner: 10Thcipriani) [23:26:32] (03PS3) 10Thcipriani: gitlab-test: pam access for ssh for git user [puppet] - 10https://gerrit.wikimedia.org/r/626785 (https://phabricator.wikimedia.org/T262516) [23:31:45] (03CR) 10BryanDavis: [C: 03+1] gitlab-test: pam access for ssh for git user [puppet] - 10https://gerrit.wikimedia.org/r/626785 (https://phabricator.wikimedia.org/T262516) (owner: 10Thcipriani) [23:37:51] (03CR) 10Dzahn: [C: 03+2] "thank you! this should help to resolve Icinga alerts about nodes doing things on each puppet run. I will confirm it on webperf*" [puppet] - 10https://gerrit.wikimedia.org/r/626779 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [23:44:00] PROBLEM - ping-offload grafana alert on icinga1001 is CRITICAL: CRITICAL: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is alerting: target IP missing on hosts loopback. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [23:45:40] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 115.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [23:46:00] RECOVERY - ping-offload grafana alert on icinga1001 is OK: OK: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is not alerting. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [23:46:20] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops