[00:01:44] (03CR) 10Papaul: [C: 03+2] DNS: Add DNS for db213[6-9] and db2140 [dns] - 10https://gerrit.wikimedia.org/r/596071 (owner: 10Papaul) [00:04:04] (03PS2) 10Ryan Kemper: sre.wdqs.data-transfer: use proper systemctl path [cookbooks] - 10https://gerrit.wikimedia.org/r/596073 (https://phabricator.wikimedia.org/T206951) [00:05:23] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) [00:15:47] (03CR) 10EBernhardson: [C: 03+1] sre.wdqs.data-transfer: use proper systemctl path [cookbooks] - 10https://gerrit.wikimedia.org/r/596073 (https://phabricator.wikimedia.org/T206951) (owner: 10Ryan Kemper) [00:19:27] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10KFrancis) @Miriam Ah, yes, sorry, I clearly have pandemic brain. I am writing to confirm Daniele Rama's NDA is covered under the signed ag... [00:23:51] (03CR) 10Ryan Kemper: [C: 03+2] sre.wdqs.data-transfer: use proper systemctl path [cookbooks] - 10https://gerrit.wikimedia.org/r/596073 (https://phabricator.wikimedia.org/T206951) (owner: 10Ryan Kemper) [00:25:21] (03PS1) 10Volans: decorators: fix newly reported flake8 issue [software/spicerack] - 10https://gerrit.wikimedia.org/r/596074 [00:25:23] (03PS1) 10Volans: ipmi: fix subprocess.run calls to raise on failure [software/spicerack] - 10https://gerrit.wikimedia.org/r/596075 [00:25:25] (03PS1) 10Volans: tests: relax Prospector dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/596076 [00:25:27] (03PS1) 10Volans: tests: relax Bandit dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/596077 [00:25:29] (03PS1) 10Volans: actions: new module to track cookbook actions [software/spicerack] - 10https://gerrit.wikimedia.org/r/596078 [00:33:31] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [00:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:33] (03CR) 10Volans: [C: 03+2] decorators: fix newly reported flake8 issue [software/spicerack] - 10https://gerrit.wikimedia.org/r/596074 (owner: 10Volans) [00:43:22] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:23] (03Merged) 10jenkins-bot: decorators: fix newly reported flake8 issue [software/spicerack] - 10https://gerrit.wikimedia.org/r/596074 (owner: 10Volans) [00:49:43] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/595930 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [00:54:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: (Need By: TBD) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Jclark-ctr) [00:55:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: (Need By: TBD) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson switches racked in c8 and d5 added to Netbox [02:10:49] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [02:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:51] PROBLEM - WDQS high update lag on wdqs1003 is CRITICAL: 5192 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:27:54] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [02:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:55] PROBLEM - WDQS high update lag on wdqs2004 is CRITICAL: 6886 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [04:01:03] RECOVERY - WDQS high update lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1157 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [04:03:47] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:03:51] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response [04:03:51] pi/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [04:09:15] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [04:09:15] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 86, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:11:50] 10Operations, 10MediaWiki-extensions-CodeReview, 10Patch-For-Review: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Legoktm) @Dzahn the dump is at `mwmaint1002:/home/legoktm/codereview.tar.gz` [04:38:14] (03PS2) 10KartikMistry: Update cxserver to 2020-05-11-082207-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/595872 (https://phabricator.wikimedia.org/T250004) [04:40:40] * kart_ updating cxserver.. [04:41:09] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2020-05-11-082207-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/595872 (https://phabricator.wikimedia.org/T250004) (owner: 10KartikMistry) [04:41:27] (03Merged) 10jenkins-bot: Update cxserver to 2020-05-11-082207-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/595872 (https://phabricator.wikimedia.org/T250004) (owner: 10KartikMistry) [04:42:49] !log kartik@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [04:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:21] !log kartik@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'cxserver' for release 'production' . [04:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:29] !log kartik@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'cxserver' for release 'production' . [04:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:45] RECOVERY - WDQS high update lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 1086 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [04:52:50] !log Updated cxserver to 2020-05-11-082207-production (T250004) [04:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:53] T250004: Add Sakha MT support to Content Translation - https://phabricator.wikimedia.org/T250004 [04:53:03] PROBLEM - PHP7 rendering on mw1374 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1310 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:53:47] PROBLEM - Apache HTTP on mw1374 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1310 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:01:43] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Email to WikimediaUA mailing list from base-w[at]yandex.ru does not get delivered - https://phabricator.wikimedia.org/T247603 (10Base) (Seeing as it might take months for this to be addressed just want to mention that this is a problem I have for a while, so i... [05:06:02] 10Operations, 10Mail: lists1001 alerting on mailmain processes - https://phabricator.wikimedia.org/T252615 (10Marostegui) [05:06:14] 10Operations, 10Mail: lists1001 alerting on mailman processes - https://phabricator.wikimedia.org/T252615 (10Marostegui) p:05Triage→03Medium [05:10:41] !log root@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [05:10:42] !log root@cumin1001 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [05:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:54] !log root@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [05:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:05] !log root@cumin1001 Updating IPMI password on 1 hosts - root@cumin1001 [05:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:46] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [05:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:28] 10Operations, 10ops-codfw, 10Analytics: furud mgmt interface is down - https://phabricator.wikimedia.org/T252616 (10Marostegui) [05:13:45] 10Operations, 10ops-codfw, 10Analytics: furud mgmt interface is down - https://phabricator.wikimedia.org/T252616 (10Marostegui) p:05Triage→03Medium [05:14:29] ACKNOWLEDGEMENT - Host furud.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Marostegui https://phabricator.wikimedia.org/T252616 - The acknowledgement expires at: 2020-05-14 05:14:17. [05:15:11] elukey: you guys own furud per site.pp? ^ [05:16:13] (03CR) 10Marostegui: [C: 03+2] mysql misc: add access for cloudcontrol1005 to m5-master [puppet] - 10https://gerrit.wikimedia.org/r/595207 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [05:16:14] (03CR) 10Marostegui: [C: 03+2] "> I've already applied the change from this patch by hand -- if there" [puppet] - 10https://gerrit.wikimedia.org/r/595207 (https://phabricator.wikimedia.org/T252121) (owner: 10Andrew Bogott) [05:20:21] marostegui: ah yes thanks! only mgmt right? [05:20:28] yep [05:23:28] <3 [05:26:51] RECOVERY - PHP7 rendering on mw1374 is OK: HTTP OK: HTTP/1.1 200 OK - 75060 bytes in 0.250 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:27:06] <_joe_> !log restarting php-fpm on mw1374, children dying with SIGILL [05:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:35] RECOVERY - Apache HTTP on mw1374 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:32:32] !log wdqs1003 was depooled ~6 hours ago and was re-pooled ~10 mins ago after verifying the wdqs service was healthy [05:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:45] !log wdqs2004 was depooled ~3 hours ago and was re-pooled ~10 mins ago after verifying the wdqs service was healthy [05:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:16] * ryankemper just realized "wdqs service" is like saying "PIN number" [06:06:57] RECOVERY - dump of s1 in codfw on db1115 is OK: Last dump for s1 at codfw (db2097.codfw.wmnet:3311) taken on 2020-05-12 17:23:07 (154 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:28:58] (03PS1) 10AntiCompositeNumber: engine.imagemagick: Catch error when pyexiv2 can't find metadata [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/596093 (https://phabricator.wikimedia.org/T245440) [06:49:06] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add integration tests using docker-compose [software/purged] - 10https://gerrit.wikimedia.org/r/594148 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [06:56:23] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 52 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:56:59] 10Operations, 10observability, 10serviceops: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10Joe) FWIW, I think I remember systemctl status php7.2-fpm to stall on a busy server, but I might remember incorrectly. [07:02:19] (03PS2) 10Jcrespo: monitoring: remove usages of 'dba' contact group [puppet] - 10https://gerrit.wikimedia.org/r/595149 (https://phabricator.wikimedia.org/T237927) [07:04:10] (03CR) 10Jcrespo: [C: 03+2] "+2 after discussing this with manuel last week clarifying the (mostly) noop." [puppet] - 10https://gerrit.wikimedia.org/r/595149 (https://phabricator.wikimedia.org/T237927) (owner: 10Jcrespo) [07:08:03] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 47 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:08:57] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Improve the kafka consumer interface [software/purged] - 10https://gerrit.wikimedia.org/r/594953 (owner: 10Giuseppe Lavagetto) [07:10:49] (03CR) 10Jcrespo: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/556207 (owner: 10Jbond) [07:12:40] (03CR) 10Jcrespo: "If you saw it on clients, the patch is fine but it would be on the wrong class- it should be applied to the backup client ones, not the di" [puppet] - 10https://gerrit.wikimedia.org/r/556207 (owner: 10Jbond) [07:14:19] !log upload spark2_2.4.4-bin-hadoop2.6-2 for buster/stretch on apt1001 [07:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:19] (03PS6) 10Jcrespo: mariadb: Default monitor disk paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) [07:18:57] (03PS1) 10Marostegui: mariadb: Increase version [software] - 10https://gerrit.wikimedia.org/r/596140 (https://phabricator.wikimedia.org/T250666) [07:19:22] (03CR) 10Marostegui: [C: 04-1] "Not ready yet" [software] - 10https://gerrit.wikimedia.org/r/596140 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [07:19:28] (03CR) 10Jcrespo: [C: 03+1] "yay" [software] - 10https://gerrit.wikimedia.org/r/596140 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [07:23:36] (03PS1) 10JMeybohm: restrouter: Remove chart and namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/596141 (https://phabricator.wikimedia.org/T242461) [07:24:03] (03PS1) 10Elukey: Update version in changelog for Buster and update README [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/596142 (https://phabricator.wikimedia.org/T250161) [07:25:45] (03PS1) 10Giuseppe Lavagetto: Release 0.11 [software/purged] - 10https://gerrit.wikimedia.org/r/596143 [07:25:57] (03CR) 10jerkins-bot: [V: 04-1] Release 0.11 [software/purged] - 10https://gerrit.wikimedia.org/r/596143 (owner: 10Giuseppe Lavagetto) [07:27:17] (03CR) 10Filippo Giunchedi: [C: 03+2] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/595215 (owner: 10EBernhardson) [07:29:18] !log roll-restart logstash in codfw/eqiad for configuration change [07:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:36] (03CR) 10JMeybohm: [C: 03+2] New upstream version 2.16.7 [debs/helm] - 10https://gerrit.wikimedia.org/r/595591 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [07:36:47] (03CR) 10Giuseppe Lavagetto: "recheck" [software/purged] - 10https://gerrit.wikimedia.org/r/596143 (owner: 10Giuseppe Lavagetto) [07:38:07] 10Operations, 10observability: smart-data-dump --syslog producing errors and spamming root@ - https://phabricator.wikimedia.org/T252500 (10Marostegui) Thanks Cole! [07:41:08] !log imported helm 2.16.7-1 to main for buster-wikimedia [07:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:40] !log imported helm 2.16.7-1 to main for stretch-wikimedia [07:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:39] 10Operations, 10Traffic, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) [[ https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host?orgId=1 | The dashboard ]] won't open for me now, it's stuck on a spinner: {F... [07:44:00] (03CR) 10Filippo Giunchedi: "Thanks for fixing this! I'm getting connection refused from netbox2001.wikimedia.org:8443 from prometheus2003, expected?" [puppet] - 10https://gerrit.wikimedia.org/r/595991 (owner: 10CRusnov) [07:45:39] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10Marostegui) I can see the signature at `Tue, May 12, 17:36`. This needs approval from @leila / @Nuria [07:50:07] (03PS2) 10Giuseppe Lavagetto: Release 0.11 [software/purged] - 10https://gerrit.wikimedia.org/r/596143 [07:50:13] (03CR) 10jerkins-bot: [V: 04-1] Release 0.11 [software/purged] - 10https://gerrit.wikimedia.org/r/596143 (owner: 10Giuseppe Lavagetto) [07:50:38] 10Operations, 10ops-eqord, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10ayounsi) Remote hands replaced the optics yesterday but the link is still down. Lights are correct. Emailed Telia 12h ago with: ` Remote hands replaced the optic, we're still seeing... [07:51:08] (03CR) 10Jcrespo: [C: 04-1] "Undesired side effects on some hosts arose on Puppet compiler, needs fixing." [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [07:53:00] (03PS1) 10Kormat: install_server: Disallow reimaging of pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/596146 (https://phabricator.wikimedia.org/T252182) [07:54:08] (03CR) 10Marostegui: [C: 03+1] install_server: Disallow reimaging of pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/596146 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [07:55:11] (03PS3) 10Giuseppe Lavagetto: Release 0.11 [software/purged] - 10https://gerrit.wikimedia.org/r/596143 [07:55:44] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: move swift::params::swift_cluster to profile::swift::cluster [puppet] - 10https://gerrit.wikimedia.org/r/595930 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [07:56:03] (03CR) 10Kormat: [C: 03+2] install_server: Disallow reimaging of pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/596146 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [07:56:10] 10Operations: ferm refresh after initial puppet run failed to reload ferm rules - https://phabricator.wikimedia.org/T252622 (10Dzahn) [07:59:36] (03PS7) 10Jcrespo: mariadb: Default monitor disk paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) [08:02:17] (03PS8) 10Jcrespo: mariadb: Default monitor disk & process paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) [08:02:19] (03PS2) 10Gilles: engine.imagemagick: Catch error when pyexiv2 can't find metadata [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/596093 (https://phabricator.wikimedia.org/T245440) (owner: 10AntiCompositeNumber) [08:04:24] (03CR) 10Gilles: [V: 03+2 C: 03+2] engine.imagemagick: Catch error when pyexiv2 can't find metadata [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/596093 (https://phabricator.wikimedia.org/T245440) (owner: 10AntiCompositeNumber) [08:06:07] (03PS1) 10Filippo Giunchedi: prometheus: add thanos::sidecar to services and analytics [puppet] - 10https://gerrit.wikimedia.org/r/596147 (https://phabricator.wikimedia.org/T233956) [08:07:05] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add thanos::sidecar to services and analytics [puppet] - 10https://gerrit.wikimedia.org/r/596147 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [08:08:51] (03CR) 10Dzahn: [C: 03+2] add IPv6 records for people1002 [dns] - 10https://gerrit.wikimedia.org/r/595957 (https://phabricator.wikimedia.org/T247649) (owner: 10Dzahn) [08:08:57] (03PS2) 10Dzahn: add IPv6 records for people1002 [dns] - 10https://gerrit.wikimedia.org/r/595957 (https://phabricator.wikimedia.org/T247649) [08:13:20] (03CR) 10Dzahn: [C: 03+2] site: add peopleweb role to people1002 [puppet] - 10https://gerrit.wikimedia.org/r/595956 (https://phabricator.wikimedia.org/T247649) (owner: 10Dzahn) [08:14:16] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add thanos::sidecar to services and analytics [puppet] - 10https://gerrit.wikimedia.org/r/596147 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [08:19:41] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline and please include a PCC run with noop(s) too to confirm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) (owner: 10Ottomata) [08:19:50] 10Operations, 10Performance-Team: Lower per-IP PoolCounter throttling Thumbor settings - https://phabricator.wikimedia.org/T252426 (10Gilles) [08:20:59] (03CR) 10Jcrespo: [C: 03+1] "Right now this does everything I expected: https://puppet-compiler.wmflabs.org/compiler1002/22497/" [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [08:21:18] (03CR) 10Jcrespo: [C: 03+1] mariadb: Default monitor disk & process paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [08:22:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Release 0.11 [software/purged] - 10https://gerrit.wikimedia.org/r/596143 (owner: 10Giuseppe Lavagetto) [08:25:47] (03PS1) 10Gilles: Lower thresholds for Thumbor per-IP throttling [puppet] - 10https://gerrit.wikimedia.org/r/596149 (https://phabricator.wikimedia.org/T252426) [08:26:30] 10Operations, 10Performance-Team, 10Patch-For-Review: Lower per-IP PoolCounter throttling Thumbor settings - https://phabricator.wikimedia.org/T252426 (10Gilles) a:05Gilles→03None [08:26:32] (03PS1) 10JMeybohm: raw: Add apiVersion (helm lint), remove appVersion [deployment-charts] - 10https://gerrit.wikimedia.org/r/596150 (https://phabricator.wikimedia.org/T252428) [08:26:49] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Lower per-IP PoolCounter throttling Thumbor settings - https://phabricator.wikimedia.org/T252426 (10Gilles) [08:27:13] (03PS2) 10JMeybohm: raw: Add apiVersion (helm lint), remove appVersion [deployment-charts] - 10https://gerrit.wikimedia.org/r/596150 (https://phabricator.wikimedia.org/T252428) [08:27:46] (03PS3) 10JMeybohm: raw: Add apiVersion (helm lint), remove appVersion [deployment-charts] - 10https://gerrit.wikimedia.org/r/596150 (https://phabricator.wikimedia.org/T252428) [08:29:43] (03PS1) 10Dzahn: peopleweb: allow rsyncing /home to a new server [puppet] - 10https://gerrit.wikimedia.org/r/596151 (https://phabricator.wikimedia.org/T247649) [08:30:20] 10Operations, 10Traffic, 10Patch-For-Review: ATS: Add the ability to check if origin server responses can be cached and their lifetime to the Lua plugin - https://phabricator.wikimedia.org/T251537 (10ema) The PR adding a function to the TS API for getting maxage [[ https://github.com/apache/trafficserver/pul... [08:30:43] (03CR) 10jerkins-bot: [V: 04-1] peopleweb: allow rsyncing /home to a new server [puppet] - 10https://gerrit.wikimedia.org/r/596151 (https://phabricator.wikimedia.org/T247649) (owner: 10Dzahn) [08:32:03] (03PS2) 10Dzahn: peopleweb: allow rsyncing /home to a new server [puppet] - 10https://gerrit.wikimedia.org/r/596151 (https://phabricator.wikimedia.org/T247649) [08:32:32] (03PS1) 10Kormat: db-eqiad.php: Pool pc1010 for pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596152 (https://phabricator.wikimedia.org/T252182) [08:34:10] (03PS3) 10RhinosF1: Localisations for ti[wikipedia|wiktionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 [08:34:51] (03PS4) 10RhinosF1: Site name & meta namespace localisations for ti[wikipedia|wiktionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 [08:34:59] (03CR) 10Marostegui: [C: 04-1] "I would suggest to change the entire line, otherwise it can be confusing." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596152 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [08:35:38] (03CR) 10jerkins-bot: [V: 04-1] Site name & meta namespace localisations for ti[wikipedia|wiktionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 (owner: 10RhinosF1) [08:36:22] (03CR) 10Dzahn: [C: 03+2] peopleweb: allow rsyncing /home to a new server [puppet] - 10https://gerrit.wikimedia.org/r/596151 (https://phabricator.wikimedia.org/T247649) (owner: 10Dzahn) [08:36:48] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/22499/people1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/596151 (https://phabricator.wikimedia.org/T247649) (owner: 10Dzahn) [08:37:48] (03PS5) 10RhinosF1: Site name & meta namespace localisations for ti[wikipedia|wiktionary] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595883 (https://phabricator.wikimedia.org/T251287) [08:40:23] (03PS2) 10Kormat: db-eqiad.php: Pool pc1010 for pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596152 (https://phabricator.wikimedia.org/T252182) [08:41:29] (03CR) 10Marostegui: [C: 04-1] db-eqiad.php: Pool pc1010 for pc1 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596152 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [08:42:21] (03CR) 10Gilles: [C: 03+1] Decommission old ArcLamp HHVM pipeline [puppet] - 10https://gerrit.wikimedia.org/r/595602 (https://phabricator.wikimedia.org/T233884) (owner: 10Dave Pifke) [08:43:52] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10Miriam) @KFrancis thanks for your kind confirmation! And thanks @colewhite for helping out. According to your list, the last point should... [08:44:05] (03PS3) 10Kormat: db-eqiad.php: Pool pc1010 for pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596152 (https://phabricator.wikimedia.org/T252182) [08:44:49] (03CR) 10Marostegui: [C: 03+1] db-eqiad.php: Pool pc1010 for pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596152 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [08:44:57] (03CR) 10Kormat: "> Patch Set 2: Code-Review-1" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596152 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [08:45:48] (03PS1) 10Filippo Giunchedi: swift: clean up 'storage_policies' from swift::params [puppet] - 10https://gerrit.wikimedia.org/r/596155 (https://phabricator.wikimedia.org/T252537) [08:48:05] (03CR) 10Kormat: [C: 03+2] db-eqiad.php: Pool pc1010 for pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596152 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [08:48:24] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/22500/" [puppet] - 10https://gerrit.wikimedia.org/r/596155 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [08:48:50] 10Operations: ferm refresh after initial puppet run failed to reload ferm rules - https://phabricator.wikimedia.org/T252622 (10Dzahn) After i applied a role to it and made some changes to it.. which included opening a firewall hole for rsyncd.. this happened to me again. I saw puppet do the service refresh of f... [08:48:57] (03Merged) 10jenkins-bot: db-eqiad.php: Pool pc1010 for pc1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596152 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [08:52:09] !log kormat@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool pc1010 as pc1 master T252182 (duration: 01m 17s) [08:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:13] T252182: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 [08:52:59] kormat: make sure to check the processlist before stopping mysql on pc1007, also remember to disable (or downtime) replication checks for pc1010 and pc2007 as those are the two replicating from pc1007 [08:53:06] <_joe_> !log uploaded purged 0.11 [08:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:08] otherwise pc1010 will page [08:53:35] yep, on it. [08:54:05] kormat: thanks :* [08:55:59] 10Operations: ferm refresh after initial puppet run failed to reload ferm rules - https://phabricator.wikimedia.org/T252622 (10Dzahn) p:05Triage→03High [08:59:53] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [09:00:33] !log disabling puppet temporarily [09:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:01] (03PS1) 10Jbond: ferm: ferm has a status script now so instruct puppet [puppet] - 10https://gerrit.wikimedia.org/r/596157 [09:01:58] (03CR) 10Muehlenhoff: [C: 03+2] Remove component integration for Puppet 5 / Facter 3 on jessie/stretch [puppet] - 10https://gerrit.wikimedia.org/r/583028 (owner: 10Muehlenhoff) [09:03:07] (03CR) 10Jbond: [C: 03+2] ferm: ferm has a status script now so instruct puppet [puppet] - 10https://gerrit.wikimedia.org/r/596157 (owner: 10Jbond) [09:03:36] PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:51] ^^ this is me testing [09:04:40] thanks [09:04:46] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [09:05:08] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) Some notes: * Added https://grafana.wikimedia.org/d/000000174/redis?panelId=14&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-job=redis... [09:05:37] ACKNOWLEDGEMENT - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. John Bond testing ferm-status https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM! thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/596012 (https://phabricator.wikimedia.org/T250863) (owner: 10Bstorm) [09:08:10] !log rsyncing /home dirs from people.wikimedia.org to new backend people1002 [09:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:04] (03PS1) 10Kormat: install_server: Allow reimage of pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/596158 (https://phabricator.wikimedia.org/T252182) [09:10:22] (03CR) 10Marostegui: [C: 03+1] install_server: Allow reimage of pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/596158 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [09:10:33] (03CR) 10Kormat: [C: 03+2] install_server: Allow reimage of pc1007 [puppet] - 10https://gerrit.wikimedia.org/r/596158 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [09:11:14] !log re-enabling puppet [09:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:40] PROBLEM - Check the last execution of idp-u2f-sync on idp-test2001 is CRITICAL: CRITICAL: Status of the systemd unit idp-u2f-sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:18:01] ^^ also me [09:18:32] ACKNOWLEDGEMENT - Check the last execution of idp-u2f-sync on idp-test2001 is CRITICAL: CRITICAL: Status of the systemd unit idp-u2f-sync John Bond testing ferm-status https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:18:46] (03PS1) 10Ema: Backport patches adding ts.server_response.get_maxage() [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596160 (https://phabricator.wikimedia.org/T251537) [09:19:07] (03PS1) 10Jbond: ferm: restart ferm when its stopped [puppet] - 10https://gerrit.wikimedia.org/r/596161 [09:19:49] (03PS2) 10Ema: Backport patches adding ts.server_response.get_maxage() [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596160 (https://phabricator.wikimedia.org/T251537) [09:21:37] <_joe_> !log installing purged 0.11 on cp2028 T133821 [09:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:41] T133821: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 [09:21:47] (03CR) 10Jbond: [C: 03+2] ferm: restart ferm when its stopped [puppet] - 10https://gerrit.wikimedia.org/r/596161 (owner: 10Jbond) [09:24:55] (03PS1) 10Jbond: ferm: correct restart command [puppet] - 10https://gerrit.wikimedia.org/r/596162 [09:25:06] (03PS1) 10Muehlenhoff: Point external debmonitor links to CAS-enabled Icinga/Puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/596163 [09:25:10] (03CR) 10Jbond: [V: 03+2 C: 03+2] ferm: correct restart command [puppet] - 10https://gerrit.wikimedia.org/r/596162 (owner: 10Jbond) [09:32:51] <_joe_> !log installing purged 0.11 on cp2027 T133821 [09:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:54] T133821: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 [09:35:29] 10Operations: ferm refresh after initial puppet run failed to reload ferm rules - https://phabricator.wikimedia.org/T252622 (10jbond) This seems to have been caused by the introduction of the ferm-status script. when the ferm-status script ran it detects that the ferm rules are incorrect and inbstructes puppet... [09:35:48] 10Operations: ferm refresh after initial puppet run failed to reload ferm rules - https://phabricator.wikimedia.org/T252622 (10jbond) 05Open→03Resolved a:03jbond Closing, please re-open if you still see this [09:37:03] !log Upgrade db2102 to the new 10.4.13 - T250666 [09:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:06] T250666: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 [09:37:37] 10Operations, 10netops, 10observability: LibreNMS monitoring glitch caused paging - https://phabricator.wikimedia.org/T252630 (10ayounsi) p:05Triage→03Medium [09:38:23] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:42] 10Operations, 10netops: Upgrade Junos on asw2-esams - https://phabricator.wikimedia.org/T252631 (10ayounsi) p:05Triage→03Low [09:39:44] (03CR) 10Vgutierrez: [C: 03+1] Backport patches adding ts.server_response.get_maxage() [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596160 (https://phabricator.wikimedia.org/T251537) (owner: 10Ema) [09:40:16] 10Operations, 10netops, 10observability: LibreNMS monitoring glitch caused paging - https://phabricator.wikimedia.org/T252630 (10ayounsi) [09:40:50] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:23] (03PS2) 10Volans: actions: new module to track cookbook actions [software/spicerack] - 10https://gerrit.wikimedia.org/r/596078 [09:41:41] (03PS3) 10Ema: Release 8.0.7-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596160 (https://phabricator.wikimedia.org/T251537) [09:43:18] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [09:44:25] this is interesting --^ [09:44:58] it is very similar to what happens to cloudelastic [09:46:04] ah, I just downtimed those alerts for logstash btw, happens only on 5 and we'll be switching to elk 7 anyways [09:46:16] so IMHO not worth spending time looking at it [09:47:18] makes sense, I was checking if it was under reindex pressure but it doesn't seem so [09:47:55] what we have seen for cloudelastic is that under reindex pressure it ends up in trashing, namely the old gen pool fills up and the jvm spends a ton of time on it [09:48:01] we tried the CMS GC but didn't count much [09:48:23] in this case, es on logstash is probably under read pressure? [09:48:25] not sure [09:48:34] it might be an issue that re-presents with elk 7 [09:51:30] yeah quite possible! [09:51:31] (03PS1) 10Jbond: ferm: use /bin/systemctl not /usr/bin [puppet] - 10https://gerrit.wikimedia.org/r/596170 [09:52:36] (03CR) 10Jbond: [C: 03+2] ferm: use /bin/systemctl not /usr/bin [puppet] - 10https://gerrit.wikimedia.org/r/596170 (owner: 10Jbond) [09:53:08] (03CR) 10Marostegui: mariadb: Increase version [software] - 10https://gerrit.wikimedia.org/r/596140 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [09:53:10] (03CR) 10Marostegui: [C: 03+2] mariadb: Increase version [software] - 10https://gerrit.wikimedia.org/r/596140 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [09:55:06] !log deployed a fix to ferm-status script. unmanaged ferm rules may get removed [09:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:14] (03PS9) 10Jcrespo: mariadb: Default monitor disk & process paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) [09:55:16] (03PS1) 10Jcrespo: icinga: Disable notifications for db2097 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/596171 (https://phabricator.wikimedia.org/T252492) [09:55:20] (03CR) 10Giuseppe Lavagetto: purged: add support for kafka (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/595502 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [09:55:45] (03PS7) 10Giuseppe Lavagetto: purged: add support for kafka [puppet] - 10https://gerrit.wikimedia.org/r/595502 (https://phabricator.wikimedia.org/T133821) [09:55:47] (03PS3) 10Giuseppe Lavagetto: cache::text: enable reading purges from kafka on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/595905 (https://phabricator.wikimedia.org/T133821) [09:55:54] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] icinga: Disable notifications for db2097 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/596171 (https://phabricator.wikimedia.org/T252492) (owner: 10Jcrespo) [09:55:58] (03PS1) 10ArielGlenn: mediawiki: Disable fewestrevisions for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/596172 (https://phabricator.wikimedia.org/T238199) [09:56:06] (03PS2) 10Jcrespo: icinga: Disable notifications for db2097 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/596171 (https://phabricator.wikimedia.org/T252492) [09:56:10] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] icinga: Disable notifications for db2097 for maintenance [puppet] - 10https://gerrit.wikimedia.org/r/596171 (https://phabricator.wikimedia.org/T252492) (owner: 10Jcrespo) [09:56:47] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) 05Open→03Stalled Precisely, let's hold this task until T243106 is completed. [09:56:53] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10elukey) [09:59:16] (03CR) 10Muehlenhoff: "Yeah, that seems to me if we can't fix the underlying issue in the short term." [puppet] - 10https://gerrit.wikimedia.org/r/593166 (https://phabricator.wikimedia.org/T251349) (owner: 10Dzahn) [09:59:54] (03CR) 10Ema: [C: 03+2] Release 8.0.7-1wm5 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596160 (https://phabricator.wikimedia.org/T251537) (owner: 10Ema) [10:00:29] RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:31] (03PS5) 10JMeybohm: mathoid: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558092 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [10:02:28] (03PS1) 10Kormat: Revert "db-eqiad.php: Pool pc1010 for pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596173 (https://phabricator.wikimedia.org/T252182) [10:02:43] (03PS2) 10Kormat: Revert "db-eqiad.php: Pool pc1010 for pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596173 (https://phabricator.wikimedia.org/T252182) [10:03:37] RECOVERY - Check the last execution of idp-u2f-sync on idp-test2001 is OK: OK: Status of the systemd unit idp-u2f-sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:05:24] (03CR) 10Kormat: [C: 03+2] Revert "db-eqiad.php: Pool pc1010 for pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596173 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [10:06:07] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Pool pc1010 for pc1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596173 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [10:09:03] !log kormat@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1007 as pc1 master T252182 (duration: 01m 05s) [10:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:06] T252182: Upgrade parsercache to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T252182 [10:10:36] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10jcrespo) [10:11:28] (03PS1) 10Muehlenhoff: Enable CAS staging host for Icinga [puppet] - 10https://gerrit.wikimedia.org/r/596174 [10:12:30] (03PS1) 10Volans: aptrepro: add spicerack compoment for buster [puppet] - 10https://gerrit.wikimedia.org/r/596175 [10:12:43] (03PS1) 10Kormat: Revert "install_server: Allow reimage of pc1007" [puppet] - 10https://gerrit.wikimedia.org/r/596176 (https://phabricator.wikimedia.org/T252182) [10:13:13] kormat: pc1007 looking good with activity and pc1010 has stopped it [10:13:15] (03CR) 10JMeybohm: [C: 03+2] mathoid: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558092 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [10:13:25] marostegui: +1 [10:13:30] 10Operations, 10ops-codfw, 10DBA: db2097 memory errors leading to crash - https://phabricator.wikimedia.org/T252492 (10jcrespo) a:05jcrespo→03Papaul @Papaul please helps us out. This seems like an ordinary dimm failure, but we need to do the usual swap to discard board/processor. This happened before at... [10:13:36] (03Merged) 10jenkins-bot: mathoid: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558092 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [10:15:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/596175 (owner: 10Volans) [10:15:42] (03CR) 10Kormat: [C: 03+2] Revert "install_server: Allow reimage of pc1007" [puppet] - 10https://gerrit.wikimedia.org/r/596176 (https://phabricator.wikimedia.org/T252182) (owner: 10Kormat) [10:16:56] (03CR) 10Volans: [C: 03+2] aptrepro: add spicerack compoment for buster [puppet] - 10https://gerrit.wikimedia.org/r/596175 (owner: 10Volans) [10:18:51] (03PS1) 10Filippo Giunchedi: swift: add ability to toggle WMF-specific filters [puppet] - 10https://gerrit.wikimedia.org/r/596177 (https://phabricator.wikimedia.org/T252537) [10:19:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/596069 (https://phabricator.wikimedia.org/T252606) (owner: 10Alex Monk) [10:22:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/596075 (owner: 10Volans) [10:22:07] (03CR) 10Filippo Giunchedi: "PCC expected diff in frontends https://puppet-compiler.wmflabs.org/compiler1001/22501/" [puppet] - 10https://gerrit.wikimedia.org/r/596177 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [10:33:07] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/556207 (owner: 10Jbond) [10:34:00] !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [10:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/596078 (owner: 10Volans) [10:44:58] (03PS5) 10Arturo Borrero Gonzalez: kubeadm: add wmcs-k8s-node-upgrade.py script [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) [10:45:34] (03CR) 10Jcrespo: "> > Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/556207 (owner: 10Jbond) [10:49:14] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/556207 (owner: 10Jbond) [10:54:13] (03CR) 10Elukey: [C: 03+1] "> Right now this does everything I expected: https://puppet-compiler.wmflabs.org/compiler1002/22497/" [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [10:55:05] !log imported tqdm 4.11.2-1 packages into buster-wikimedia component/spicerack [10:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:34] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: add stubs for secrets [labs/private] - 10https://gerrit.wikimedia.org/r/595985 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [10:55:47] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] changeprop-jobqueue: add stubs for secrets [labs/private] - 10https://gerrit.wikimedia.org/r/595985 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [10:56:14] (03CR) 10Volans: [C: 03+2] ipmi: fix subprocess.run calls to raise on failure [software/spicerack] - 10https://gerrit.wikimedia.org/r/596075 (owner: 10Volans) [10:56:52] (03CR) 10Volans: [C: 03+2] "spacing/comments only" [software/spicerack] - 10https://gerrit.wikimedia.org/r/596076 (owner: 10Volans) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200513T1100). Please do the needful. [11:00:04] Lucas_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] I’ll be there in a few minutes [11:00:20] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'canary' . [11:00:20] !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mathoid' for release 'production' . [11:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:24] Lucas_WMDE: let me know when you're done [11:00:31] I need to make the actual patch :D [11:00:37] (03PS2) 10Jcrespo: transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [11:00:57] (03CR) 10Jcrespo: "checking ci..." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [11:01:00] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [11:01:06] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/596163 (owner: 10Muehlenhoff) [11:01:20] ok, I’m starting [11:02:09] (03Merged) 10jenkins-bot: ipmi: fix subprocess.run calls to raise on failure [software/spicerack] - 10https://gerrit.wikimedia.org/r/596075 (owner: 10Volans) [11:02:44] (03PS2) 10Lucas Werkmeister (WMDE): Anchor RegExp for Data Bridge in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595544 [11:02:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595544 (owner: 10Lucas Werkmeister (WMDE)) [11:02:54] (03Merged) 10jenkins-bot: tests: relax Prospector dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/596076 (owner: 10Volans) [11:03:33] (03Merged) 10jenkins-bot: Anchor RegExp for Data Bridge in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595544 (owner: 10Lucas Werkmeister (WMDE)) [11:03:53] (03CR) 10Jcrespo: "Strange, it is complaining about unrelated files, need more time to look at why and fix it on separate patches if needed (it didn't use to" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [11:04:01] pulling onto mwdebug1001 for a moment [11:04:22] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Duplicate "moderator request(s) waiting" emails sent to list admins - https://phabricator.wikimedia.org/T250032 (10Marostegui) >>! In T250032#6106883, @herron wrote: >>>! In T250032#6106545, @bd808 wrote: >> Received: from lists1001.wikim... [11:04:31] site doesn’t seem broken, syncing [11:04:45] 10Operations, 10Mail: lists1001 alerting on mailman processes - https://phabricator.wikimedia.org/T252615 (10Marostegui) [11:04:47] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Duplicate "moderator request(s) waiting" emails sent to list admins - https://phabricator.wikimedia.org/T250032 (10Marostegui) [11:04:57] 10Operations: ferm refresh after initial puppet run failed to reload ferm rules - https://phabricator.wikimedia.org/T252622 (10Dzahn) Thank you for the prompt fix! Will do. [11:05:20] (03PS1) 10Vgutierrez: Release 8.0.7-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596179 (https://phabricator.wikimedia.org/T249335) [11:05:34] (03CR) 10Arturo Borrero Gonzalez: kubeadm: add wmcs-k8s-node-upgrade.py script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) (owner: 10Arturo Borrero Gonzalez) [11:06:07] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: [[gerrit:595544|Anchor RegExp for Data Bridge in Beta (BETA-ONLY)]] (duration: 01m 06s) [11:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:39] (03PS6) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 [11:06:50] hum [11:07:19] (03PS1) 10Ladsgroup: Disable wgLegacyJavaScriptGlobals on fawiki and wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596180 (https://phabricator.wikimedia.org/T72470) [11:07:25] 11:06:02 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw1385.eqiad.wmnet returned [4]: NOT restarting php7.2-fpm: free opcache 198 MB … 1 hosts had failures restarting php-fpm [11:07:32] should I worry about that? [11:08:09] it was a beta-only change, so even if that appserver is now running with a slightly older wmf-config, it probably doesn’t matter [11:08:36] (03PS5) 10Dzahn: admins: new admin group to manage bulk jobs on Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/593166 (https://phabricator.wikimedia.org/T251349) [11:08:48] (03PS1) 10Apakhomov: Added support egress rules for blubberoid chart. Added egress template in common _helpers.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/596181 [11:09:12] Lucas_WMDE: beta only changes can go without SWAT, also it doesn't need syncing and it gets deployed to beta automatically [11:09:17] !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mathoid' for release 'production' . [11:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:42] (just rebase in deploy1001) [11:09:44] ok [11:09:51] anyways, I’m done, go ahead [11:09:55] break everyone’s userscripts >:) [11:10:32] (03PS1) 10Ema: varnish: add stp script to debug vcl reference leak [puppet] - 10https://gerrit.wikimedia.org/r/596182 (https://phabricator.wikimedia.org/T236754) [11:10:45] (03CR) 10Dzahn: [C: 03+2] admins: new admin group to manage bulk jobs on Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/593166 (https://phabricator.wikimedia.org/T251349) (owner: 10Dzahn) [11:10:46] It was deprecated six years ago :P [11:11:42] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596180 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [11:12:25] Krinkle: FYI ^ [11:12:34] (03Merged) 10jenkins-bot: Disable wgLegacyJavaScriptGlobals on fawiki and wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596180 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [11:13:34] (03CR) 10Ema: [C: 03+2] varnish: add stp script to debug vcl reference leak [puppet] - 10https://gerrit.wikimedia.org/r/596182 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [11:14:41] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:596180|Disable wgLegacyJavaScriptGlobals on fawiki and wikidatawiki (T72470)]] (duration: 01m 06s) [11:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:45] T72470: Remove legacy javascript globals - https://phabricator.wikimedia.org/T72470 [11:16:02] @bang 8 [11:16:19] jouncebot: next [11:16:19] In 0 hour(s) and 43 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200513T1200) [11:17:07] !log EU SWAT is done [11:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:17] mutante: you can go ahead if you want to :) [11:18:00] Amir1: ah, thanks, all good :) [11:18:01] (03PS1) 10Hnowlan: role::deployment_server: add changeprop-jobqueue [puppet] - 10https://gerrit.wikimedia.org/r/596183 (https://phabricator.wikimedia.org/T220399) [11:20:08] @seen DannyB [11:20:08] mutante: I have never seen DannyB [11:22:55] (03PS4) 10Privacybatm: transfer.py: Move Transferer, MariaDB logic and Firewall logic to its new module files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) [11:30:58] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request for srv/phab/phabricator/bin/bulk make-silent --id * command via SSH for moving tasks quarterly - https://phabricator.wikimedia.org/T251349 (10Dzahn) >>! In T251349#6119971, @MBinder_WMF wrote: > Thanks for moving this along. Please let me know... [11:31:41] Amir1: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/595881/ can go straight in if there’s time [11:32:11] Me needs to stop forgetting lunch swat. No test need for that [11:32:24] mutante: is it fine? [11:32:38] RhinosF1: sure [11:33:13] Amir1: i am not doing anything that would interfere with deployments. so if you mean that, yea [11:33:30] ty [11:33:52] coolio [11:33:53] Thanks [11:34:11] (03PS3) 10Ladsgroup: Add *.deutsche-digitale-bibliothek.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595881 (https://phabricator.wikimedia.org/T252296) (owner: 10RhinosF1) [11:34:12] i can confirm the translation :) [11:34:56] (03CR) 10Ladsgroup: [C: 03+2] Add *.deutsche-digitale-bibliothek.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595881 (https://phabricator.wikimedia.org/T252296) (owner: 10RhinosF1) [11:35:20] (03CR) 10Dzahn: [C: 03+1] "looks good, hosted by https://www.fiz-karlsruhe.de/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595881 (https://phabricator.wikimedia.org/T252296) (owner: 10RhinosF1) [11:35:52] mutante: Mein Duetsch ist nicht schlecht :D [11:35:52] (03Merged) 10jenkins-bot: Add *.deutsche-digitale-bibliothek.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595881 (https://phabricator.wikimedia.org/T252296) (owner: 10RhinosF1) [11:36:12] Amir1: sehr gut :) [11:36:25] danke [11:38:04] (03CR) 10Dzahn: "new group phabricator-bulk-manager now exists on phab servers. the next step will be to create a shell account for Max Binder and add him " [puppet] - 10https://gerrit.wikimedia.org/r/593166 (https://phabricator.wikimedia.org/T251349) (owner: 10Dzahn) [11:38:18] * RhinosF1 used google translate for the english name [11:38:31] (03PS1) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 [11:38:54] RhinosF1: ever played the Wikidata Game (v1) ? [11:39:04] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:595881|Add *.deutsche-digitale-bibliothek.de to the wgCopyUploadsDomains (T252296)]] (duration: 01m 06s) [11:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:07] T252296: Add *.deutsche-digitale-bibliothek.de to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T252296 [11:39:49] mutante: what's that? Last time I messed with wikidata. I got headache. [11:40:19] (03CR) 10jerkins-bot: [V: 04-1] Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/596187 (owner: 10Muehlenhoff) [11:40:28] RhinosF1: it's a way to edit wikidata and add missing info but a bit "gamified" and machine help https://tools.wmflabs.org/wikidata-game/# [11:40:56] for example it tries to guess which items without any property might be a person..and then asks you to just click Yes or No [11:41:21] * RhinosF1 looks [11:41:22] there are millions of items without any property at all left to do [11:41:27] (03Abandoned) 10Muehlenhoff: Add a Spicerack cook book to reboot hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/510113 (owner: 10Muehlenhoff) [11:41:31] most are in non-common languages [11:42:09] so i use Google Translate on all those random languages to see how good it is .. and if it's good enough to tell whether it is a person or not [11:42:54] also there are highscores to beat for each game, heh [11:42:54] cool [11:45:44] (03PS5) 10Privacybatm: transfer.py: Move Transferer, MariaDB logic and Firewall logic to its new module files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) [11:46:10] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Move Transferer, MariaDB logic and Firewall logic to its new module files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [11:46:55] (03PS2) 10Dzahn: static-codereview: Add link to SQL dumps [puppet] - 10https://gerrit.wikimedia.org/r/596018 (https://phabricator.wikimedia.org/T243055) (owner: 10Legoktm) [11:47:07] (03CR) 10Dzahn: [C: 03+2] static-codereview: Add link to SQL dumps [puppet] - 10https://gerrit.wikimedia.org/r/596018 (https://phabricator.wikimedia.org/T243055) (owner: 10Legoktm) [11:50:10] (03CR) 10Marostegui: [C: 03+1] mariadb: Default monitor disk & process paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [11:50:31] (03PS6) 10Privacybatm: transfer.py: Move Transferer, MariaDB logic and Firewall logic to its new module files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) [11:50:55] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Move Transferer, MariaDB logic and Firewall logic to its new module files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [11:51:00] (03PS1) 10Jcrespo: mariadb: Switchover zarcillo from db1115 to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/596189 (https://phabricator.wikimedia.org/T138562) [11:54:57] (03CR) 10Dzahn: [C: 03+2] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/596019 (https://phabricator.wikimedia.org/T243056) (owner: 10Legoktm) [11:56:48] (03CR) 10Privacybatm: "There two CI errors:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [11:56:57] (03PS6) 10Arturo Borrero Gonzalez: kubeadm: add wmcs-k8s-node-upgrade.py script [puppet] - 10https://gerrit.wikimedia.org/r/595964 (https://phabricator.wikimedia.org/T250867) [11:58:15] (03CR) 10Jcrespo: "CC @marostegui- this will move the canonical place of monitoring inventory and nil_grants tables, as well as icinga checks." [puppet] - 10https://gerrit.wikimedia.org/r/596189 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [11:59:22] (03CR) 10Marostegui: "Thanks for the heads up, I can fix nil_grants stuff once moved" [puppet] - 10https://gerrit.wikimedia.org/r/596189 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [11:59:30] (03CR) 10Jcrespo: "Will rename non-canonical tables on db1115 to zarcillo_old to prevent misunderstandings when deployed." [puppet] - 10https://gerrit.wikimedia.org/r/596189 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200513T1200) [12:00:08] mutante: why did you remind me to look at wikibase again? [12:00:16] spite [12:00:42] RhinosF1: did i? no, no, i just wanted you to play the game :) [12:01:02] you reminded me of it with the Google Translate comment [12:01:14] mutante: You mentioning wikidata reminded me and now I've dug myself a hole [12:01:28] try some language where you don't even know the alphabet [12:01:57] heh, i see, get out of the hole then [12:02:45] mutante: try testing wikibase on a non-wmf wiki. It's a mess. [12:03:25] RhinosF1: oh, yea, i believe it [12:04:57] (03PS1) 10Dzahn: static-codereview: activate Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/596193 (https://phabricator.wikimedia.org/T243056) [12:05:18] mutante: between wikibase and miraheze's lacking documentation. I'll have lost my hair when this is over. [12:06:50] (03PS2) 10Jcrespo: mariadb: Switchover zarcillo from db1115 to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/596189 (https://phabricator.wikimedia.org/T138562) [12:07:55] RhinosF1: is miraheze introducing miradata ? [12:08:22] mutante: someone created the wiki, we only forgot to set wikibase up right [12:09:50] *nod* [12:09:54] good luck [12:10:31] * RhinosF1 goes to actually use the run script on install of extension function in ManageWiki [12:11:07] (03PS3) 10Jcrespo: mariadb: Switchover zarcillo from db1115 to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/596189 (https://phabricator.wikimedia.org/T138562) [12:12:01] (03CR) 10Jcrespo: "I will move the metadata database configuration to a separate file on a later patch." [puppet] - 10https://gerrit.wikimedia.org/r/596189 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [12:18:52] (03PS1) 10Jcrespo: dbtools: point zarcillo database scripts to use db2093 [software] - 10https://gerrit.wikimedia.org/r/596197 (https://phabricator.wikimedia.org/T138562) [12:19:23] 10Operations: Add IRC SRE bot for SAL !log actions to #wikimedia-serviceops - https://phabricator.wikimedia.org/T213196 (10Marostegui) Do we still want this? I am tagging #serviceops as I guess they have a said here :) [12:20:12] 10Operations, 10SRE-tools, 10Patch-For-Review: wmf-auto-reimage-host: failed to resolve mgmt FQDN while renaming host - https://phabricator.wikimedia.org/T214314 (10Marostegui) a:03Volans [12:22:32] (03CR) 10Jcrespo: "> There two CI errors:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [12:24:40] (03PS1) 10Jcrespo: Fix CI errors [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596199 [12:25:02] (03CR) 10jerkins-bot: [V: 04-1] Fix CI errors [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596199 (owner: 10Jcrespo) [12:25:48] 10Operations: mw2269 rebooted/crashed unexpectedly on Jul 17th ~15:30UTC - https://phabricator.wikimedia.org/T228296 (10Marostegui) 05Open→03Resolved I am calling this resolved as it's been almost 10 months there is no way to debug this anymore :( Maybe it was a one time thing, if it happens again we can reo... [12:26:53] 10Operations, 10Analytics, 10Traffic, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Marostegui) It's been around 10 months since the last update, anything pending here? [12:27:15] (03PS2) 10Jcrespo: Fix CI errors [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596199 [12:27:41] (03PS3) 10Jcrespo: Fix CI errors [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596199 [12:28:04] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [12:28:11] (03CR) 10jerkins-bot: [V: 04-1] Fix CI errors [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596199 (owner: 10Jcrespo) [12:31:26] (03CR) 10Dzahn: [C: 03+2] Decommission old ArcLamp HHVM pipeline [puppet] - 10https://gerrit.wikimedia.org/r/595602 (https://phabricator.wikimedia.org/T233884) (owner: 10Dave Pifke) [12:34:27] (03CR) 10Dzahn: "this was a complete noop on webperf1002/2002. maybe something needs to be manually restarted?" [puppet] - 10https://gerrit.wikimedia.org/r/595602 (https://phabricator.wikimedia.org/T233884) (owner: 10Dave Pifke) [12:34:39] (03CR) 10Arturo Borrero Gonzalez: "ping andrew. We can probably merge this now?" [puppet] - 10https://gerrit.wikimedia.org/r/522196 (https://phabricator.wikimedia.org/T227785) (owner: 10Andrew Bogott) [12:34:40] 10Operations, 10cloud-services-team: Failing puppet runs on labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T235819 (10Marostegui) 05Open→03Resolved I did a puppet run here and it worked fine. Resolving - re-open if necessary! [12:35:05] (03PS4) 10Jcrespo: Fix CI errors [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596199 [12:36:06] (03CR) 10Jcrespo: [C: 03+2] Fix CI errors [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596199 (owner: 10Jcrespo) [12:36:47] (03CR) 10Jcrespo: "if you rebase on top of https://gerrit.wikimedia.org/r/596199 it should now vote +1 :-D" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [12:38:47] 10Operations, 10Traffic: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10Marostegui) @ema is this still valid? [12:39:00] (03CR) 10Muehlenhoff: [C: 03+2] Point external debmonitor links to CAS-enabled Icinga/Puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/596163 (owner: 10Muehlenhoff) [12:43:57] (03PS7) 10Jcrespo: transfer.py: Move Transferer, MariaDB logic and Firewall logic to its new module files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [12:44:24] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Move Transferer, MariaDB logic and Firewall logic to its new module files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [12:44:54] (03PS1) 10Muehlenhoff: Rename squid3 class to squid [puppet] - 10https://gerrit.wikimedia.org/r/596201 [12:46:38] (03PS8) 10Privacybatm: transfer.py: Move Transferer, MariaDB logic and Firewall logic to its new module files [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) [12:50:43] 10Operations, 10Commons, 10MediaWiki-File-management, 10Thumbor, 10Traffic: 500, Internal Server Error on Commons for images at specified size - https://phabricator.wikimedia.org/T250211 (10Marostegui) 05Open→03Resolved This works now. I am going to consider this resolved, maybe it was a one time thi... [12:51:42] (03CR) 10Privacybatm: "> Patch Set 6:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [12:51:52] (03PS1) 10Jcrespo: wmfmariadbpy: Fix integration test running under a restricted env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596204 [12:52:19] (03CR) 10Jcrespo: [C: 03+2] wmfmariadbpy: Fix integration test running under a restricted env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596204 (owner: 10Jcrespo) [12:52:52] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/22502/" [puppet] - 10https://gerrit.wikimedia.org/r/596201 (owner: 10Muehlenhoff) [12:53:10] 10Operations, 10Traffic: Servers freezing across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) [12:54:29] (03CR) 10Jcrespo: "Everything looks good, you can start rebasing on top of this- allow me some time to have lunch so I can do proper user-level testing and t" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595909 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [12:57:19] 10Operations, 10Traffic: Backport iproute2 4.x from debian testing -> our jessie - https://phabricator.wikimedia.org/T138591 (10Marostegui) 05Open→03Declined Declining per T138591#3853953 [12:58:09] 10Operations, 10SRE-swift-storage, 10Traffic: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648 (10Marostegui) 05Open→03Resolved There is no way we can debug this anymore after 4 years :) [12:59:48] 10Operations: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 (10Marostegui) @MoritzMuehlenhoff any host left here? [13:00:04] hashar and twentyafterfour: I, the Bot under the Fountain, allow thee, The Deployer, to do Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200513T1300). [13:00:51] 10Operations: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All jessie hosts run Linux 4.9 for a long time now, closing. [13:02:52] (03CR) 10Ppchelko: [C: 03+1] role::deployment_server: add changeprop-jobqueue [puppet] - 10https://gerrit.wikimedia.org/r/596183 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [13:03:02] 10Operations: pinentry-gtk2 pulls in a lot of unneeded Gnome/GTK libs - https://phabricator.wikimedia.org/T127054 (10MoritzMuehlenhoff) 05Open→03Declined This only affected jessie, which is going away and won't specifically get fixed there. [13:06:19] (03CR) 10Muehlenhoff: admin: add Daniele Rama to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595990 (https://phabricator.wikimedia.org/T252129) (owner: 10Cwhite) [13:07:40] (03PS2) 10Vgutierrez: Release 8.0.7-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596179 (https://phabricator.wikimedia.org/T249335) [13:08:22] (03PS1) 10Hnowlan: Add tool and configuration for generating beta configuration from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) [13:13:35] (03CR) 10Marostegui: [C: 03+1] dbtools: point zarcillo database scripts to use db2093 [software] - 10https://gerrit.wikimedia.org/r/596197 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [13:19:17] 10Operations, 10ops-eqord, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10CDanis) At 07:50 UTC Telia responded stating that this was due to a planned maintenance PWIC110129, despite the circuit having been down for days already. The maintenance was schedul... [13:19:55] (03CR) 10Ottomata: "I didn't realize there was anything different in the source for buster and stretch. What was it?" [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/596142 (https://phabricator.wikimedia.org/T250161) (owner: 10Elukey) [13:21:46] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:11] (03CR) 10Ottomata: "K! krb1001 is a no-op. It is just showing the change catalog adding the param to the classes, but the templates are not changed." [puppet] - 10https://gerrit.wikimedia.org/r/595648 (https://phabricator.wikimedia.org/T251606) (owner: 10Ottomata) [13:23:39] 10Operations, 10Traffic: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ema) 05Open→03Resolved a:03ema >>! In T237360#6133331, @Marostegui wrote: > @ema is this still valid? We haven't found the cause, but there's certain... [13:24:13] (03CR) 10Elukey: "> I didn't realize there was anything different in the source for" [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/596142 (https://phabricator.wikimedia.org/T250161) (owner: 10Elukey) [13:25:24] (03CR) 10Ema: [C: 03+1] Release 8.0.7-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596179 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [13:25:45] (03CR) 10Elukey: "The error is:" [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/596142 (https://phabricator.wikimedia.org/T250161) (owner: 10Elukey) [13:26:05] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.7-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596179 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [13:29:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "The referenced task mentions Special:NewFiles (or any other page with lots of thumbnails to be rendered), I'm guessing the '50' queuing li" [puppet] - 10https://gerrit.wikimedia.org/r/596149 (https://phabricator.wikimedia.org/T252426) (owner: 10Gilles) [13:30:18] (03CR) 10Giuseppe Lavagetto: raw: Add apiVersion (helm lint), remove appVersion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/596150 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [13:31:09] 10Operations, 10ops-eqord, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10CDanis) Replied to Telia: > Thanks, interfaces have been bounced on both ends. > > Here's the light levels on the Chicago side: > > cdanis@cr2-eqord> show interfaces diagnostics... [13:31:38] (03CR) 10Ottomata: "Ah, are you building the package for both stretch and buster? I think you can just build only for stretch (or buster) and then reprepro c" [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/596142 (https://phabricator.wikimedia.org/T250161) (owner: 10Elukey) [13:33:36] (03PS1) 10Filippo Giunchedi: WIP move out of swift::params::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596217 (https://phabricator.wikimedia.org/T252537) [13:34:19] (03CR) 10JMeybohm: raw: Add apiVersion (helm lint), remove appVersion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/596150 (https://phabricator.wikimedia.org/T252428) (owner: 10JMeybohm) [13:34:23] (03CR) 10jerkins-bot: [V: 04-1] WIP move out of swift::params::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596217 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [13:35:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] systemd::timer::job: fix bug re: On(In)?ActiveUnitSec [puppet] - 10https://gerrit.wikimedia.org/r/551281 (owner: 10CDanis) [13:36:02] <_joe_> cdanis: ^^ [13:36:11] ahahah [13:36:25] ty _+joe_ [13:37:37] (03CR) 10Elukey: "> Ah, are you building the package for both stretch and buster? I" [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/596142 (https://phabricator.wikimedia.org/T250161) (owner: 10Elukey) [13:37:59] (03PS2) 10Filippo Giunchedi: WIP move out of swift::params::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596217 (https://phabricator.wikimedia.org/T252537) [13:38:31] (03CR) 10Ottomata: "Maybe? If there are this package wouldn't work. It is just a direct download of the binaries put into a .deb, and upstream doesn't have " [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/596142 (https://phabricator.wikimedia.org/T250161) (owner: 10Elukey) [13:40:24] (03PS15) 10CDanis: prometheus: export NIC firmware versions [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) [13:40:53] (03Abandoned) 10Bearloga: profile::product_analytics: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/503069 (owner: 10Bearloga) [13:41:25] (03CR) 10Elukey: "> Maybe? If there are this package wouldn't work. It is just a" [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/596142 (https://phabricator.wikimedia.org/T250161) (owner: 10Elukey) [13:41:30] (03PS1) 10Mholloway: Wikifeeds: Update to 2020-05-13-132944-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/596220 (https://phabricator.wikimedia.org/T252422) [13:41:42] (03Abandoned) 10Elukey: Update version in changelog for Buster and update README [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/596142 (https://phabricator.wikimedia.org/T250161) (owner: 10Elukey) [13:42:24] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10leila) @Marostegui Thanks for your help here. I approve the request for access. [13:43:48] (03CR) 10Mholloway: [C: 03+2] Wikifeeds: Update to 2020-05-13-132944-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/596220 (https://phabricator.wikimedia.org/T252422) (owner: 10Mholloway) [13:43:57] (03PS3) 10Filippo Giunchedi: swift: move out of swift::params::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596217 (https://phabricator.wikimedia.org/T252537) [13:44:01] 10Operations, 10LDAP-Access-Requests: Add `dcipoletti` to `wmf` Access Group - https://phabricator.wikimedia.org/T252674 (10dcipoletti) [13:44:10] (03Merged) 10jenkins-bot: Wikifeeds: Update to 2020-05-13-132944-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/596220 (https://phabricator.wikimedia.org/T252422) (owner: 10Mholloway) [13:44:27] (03CR) 10Filippo Giunchedi: "PCC says yes https://puppet-compiler.wmflabs.org/compiler1001/22503/" [puppet] - 10https://gerrit.wikimedia.org/r/596217 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [13:45:38] (03PS1) 10Bearloga: Decommission Maps & Search metrics legacy dashboards [puppet] - 10https://gerrit.wikimedia.org/r/596221 (https://phabricator.wikimedia.org/T252365) [13:45:51] (03PS2) 10Bearloga: Decommission Maps & Search metrics legacy dashboards [puppet] - 10https://gerrit.wikimedia.org/r/596221 (https://phabricator.wikimedia.org/T252365) [13:47:04] (03PS4) 10JMeybohm: raw: Add apiVersion (helm lint), remove appVersion [deployment-charts] - 10https://gerrit.wikimedia.org/r/596150 (https://phabricator.wikimedia.org/T252428) [13:48:34] (03CR) 10Bearloga: "Yay for sunsetting old, non-maintained things..." [puppet] - 10https://gerrit.wikimedia.org/r/596221 (https://phabricator.wikimedia.org/T252365) (owner: 10Bearloga) [13:51:56] 10Operations, 10ops-codfw, 10Analytics: furud mgmt interface is down - https://phabricator.wikimedia.org/T252616 (10Papaul) a:03Papaul [13:53:15] (03PS5) 10JMeybohm: raw: Add apiVersion (helm lint) [deployment-charts] - 10https://gerrit.wikimedia.org/r/596150 (https://phabricator.wikimedia.org/T252428) [13:55:42] (03PS1) 10CDanis: Revert "prepend {es,kn}ams" [homer/public] - 10https://gerrit.wikimedia.org/r/596222 [13:55:46] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [13:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:09] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10Marostegui) @diego as @Rvvalentim doesn't seem to have access to officewiki, can you verify that key belongs to them and can you paste it on your office wiki? Tha... [13:57:21] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [13:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:42] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [13:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:17] 10Operations: prometheus-intel-microcode not in line with what's actually loaded by the kernel - https://phabricator.wikimedia.org/T252676 (10MoritzMuehlenhoff) [14:00:22] 10Operations: prometheus-intel-microcode not in line with what's actually loaded by the kernel - https://phabricator.wikimedia.org/T252676 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:01:50] (03CR) 10CDanis: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [14:05:56] (03PS1) 10Elukey: cassandra: use openjdk-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/596223 [14:06:45] (03CR) 10Jcrespo: [C: 03+2] mariadb: Default monitor disk & process paging to false [puppet] - 10https://gerrit.wikimedia.org/r/595153 (https://phabricator.wikimedia.org/T172492) (owner: 10Jcrespo) [14:07:13] (03PS16) 10CDanis: prometheus: export NIC firmware versions [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) [14:10:55] (03CR) 10Elukey: "Lovely no op: https://puppet-compiler.wmflabs.org/compiler1001/22505/" [puppet] - 10https://gerrit.wikimedia.org/r/596223 (owner: 10Elukey) [14:12:38] (03CR) 10Muehlenhoff: "Looks good, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596223 (owner: 10Elukey) [14:13:24] (03PS3) 10JMeybohm: termbox: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558093 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:13:29] (03PS17) 10CDanis: prometheus: export NIC firmware versions [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) [14:15:33] (03CR) 10Elukey: cassandra: use openjdk-8 on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596223 (owner: 10Elukey) [14:17:03] (03PS18) 10CDanis: prometheus: export NIC firmware versions [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) [14:18:44] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10Nuria) Approved on my end, we are missing the end date of the collaboration. [14:22:36] (03CR) 10CDanis: "Some sample output from a selection of hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [14:24:02] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10Nuria) This request needs an expiration date (when does access expires) and also the objective of the data access, is this another intern project? [14:24:24] (03PS4) 10JMeybohm: termbox: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558093 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:26:07] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10diego) @Marostegui here you have: https://office.wikimedia.org/wiki/User:Diego_(WMF)/internKeys [14:26:17] 10Operations, 10LDAP-Access-Requests: Add `dcipoletti` to `wmf` Access Group - https://phabricator.wikimedia.org/T252674 (10dr0ptp4kt) Approved [14:27:12] (03CR) 10CDanis: "Er, actual sample output from mw2163, after I didn't mangle it while pasting it into the comment block:" [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [14:28:30] (03CR) 10Ppchelko: [C: 04-1] "A couple of comments of various level of nitpickiness. Additionally, I think we should think about where to put this. Will these new files" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [14:28:32] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10diego) @Nuria this is for a 12-Week internship with the Research Team, working on the project "Exploration on content propagation across Wikimedia projects”. @R... [14:29:50] (03CR) 10JMeybohm: [C: 03+2] termbox: add TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/558093 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:29:55] (03CR) 10Muehlenhoff: [C: 03+1] cassandra: use openjdk-8 on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596223 (owner: 10Elukey) [14:34:25] (03PS1) 10Papaul: DHCP: Add db213[6-9] and db2140 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/596226 (https://phabricator.wikimedia.org/T251639) [14:35:19] !log upload trafficserver 8.0.7-1wm6 to apt.wm.o (buster) - T249335 T251537 [14:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:27] T251537: ATS: Add the ability to check if origin server responses can be cached and their lifetime to the Lua plugin - https://phabricator.wikimedia.org/T251537 [14:35:27] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [14:41:24] (03CR) 10Papaul: [C: 03+2] DHCP: Add db213[6-9] and db2140 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/596226 (https://phabricator.wikimedia.org/T251639) (owner: 10Papaul) [14:42:39] (03CR) 10Filippo Giunchedi: [C: 03+1] cassandra: use openjdk-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/596223 (owner: 10Elukey) [14:43:47] 10Operations, 10SRE-swift-storage, 10Traffic: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648 (10Aklapper) 05Resolved→03Declined Translates to declined to me as nothing was actively resolved :) [14:43:51] (03CR) 10Hnowlan: [C: 03+1] cassandra: use openjdk-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/596223 (owner: 10Elukey) [14:46:12] 10Operations, 10DBA, 10observability, 10Patch-For-Review: MySQL metrics monitoring - https://phabricator.wikimedia.org/T143896 (10jcrespo) [14:49:30] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2136.codfw.wmnet ` The log... [14:49:39] (03PS1) 10JMeybohm: termbox: deploy up to date chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/596227 [14:50:13] !log filippo@deploy1001 Started deploy [librenms/librenms@0a88d64]: Upgrade LibreNMS to 1.63 T251222 [14:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:16] 10Operations, 10Traffic: cp5012 fails to boot after reimage: junk in compressed archive unpacking initramfs - https://phabricator.wikimedia.org/T237360 (10ema) 05Resolved→03Declined [14:50:17] T251222: Upgrade LibreNMS to 1.63 - https://phabricator.wikimedia.org/T251222 [14:50:23] !log filippo@deploy1001 Finished deploy [librenms/librenms@0a88d64]: Upgrade LibreNMS to 1.63 T251222 (duration: 00m 10s) [14:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:35] (03PS2) 10Hnowlan: Add tool and configuration for generating beta configuration from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) [14:51:39] (03CR) 10Hnowlan: Add tool and configuration for generating beta configuration from kubernetes (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [14:52:14] (03CR) 10Hnowlan: [C: 03+2] role::deployment_server: add changeprop-jobqueue [puppet] - 10https://gerrit.wikimedia.org/r/596183 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [14:52:19] (03PS1) 10Dzahn: maintenance: temp allow rsyncing home dir to miscweb [puppet] - 10https://gerrit.wikimedia.org/r/596228 (https://phabricator.wikimedia.org/T243056) [14:53:30] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) ` [edit interfaces interface-range vlan-private1-a-codfw] member xe-2/0/3 { ... } + member ge-1/0/0; [edit interfaces... [14:53:47] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) [14:54:58] <_joe_> !log upgrading + restarting purged across ulsfo and codfw T133821 [14:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:02] T133821: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 [14:55:48] (03PS4) 10Jcrespo: mariadb: Switchover zarcillo from db1115 to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/596189 (https://phabricator.wikimedia.org/T138562) [14:56:03] (03PS5) 10Jcrespo: mariadb: Switchover zarcillo from db1115 to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/596189 (https://phabricator.wikimedia.org/T138562) [14:57:29] PROBLEM - MariaDB Slave SQL: m1 on db2078 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1118, Errmsg: Error Row size too large. The maximum row size for the used table type, not counting BLOBs, is 8126. This includes storage overhead, check the manual. You have to change some columns to TEXT or BLOBs on query. Default database: librenms. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshoo [14:57:29] a_slave [14:57:32] (03CR) 10Jcrespo: [C: 03+2] mariadb: Switchover zarcillo from db1115 to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/596189 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [14:58:41] oh [14:59:55] kormat: ^ 10.4 upgrade issue [15:00:45] 10Operations, 10Graphoid, 10serviceops, 10Core Platform Team (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [15:00:59] (03CR) 10CDanis: [C: 03+1] swift: clean up 'storage_policies' from swift::params [puppet] - 10https://gerrit.wikimedia.org/r/596155 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [15:01:07] 10Operations, 10netops, 10observability, 10User-fgiunchedi: Upgrade LibreNMS to 1.63 - https://phabricator.wikimedia.org/T251222 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Upgraded! [15:01:56] ^ XioNoX faidon or someone else did you recently do a librenms upgrade? [15:02:00] jynus: +1 [15:02:01] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) [15:02:03] (03CR) 10CDanis: [C: 03+1] swift: add ability to toggle WMF-specific filters [puppet] - 10https://gerrit.wikimedia.org/r/596177 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [15:02:12] actually looks like godog [15:02:23] he should reopen that ticket [15:02:51] what's the issue? [15:02:51] 10Operations, 10netops, 10observability, 10User-fgiunchedi: Upgrade LibreNMS to 1.63 - https://phabricator.wikimedia.org/T251222 (10jcrespo) 05Resolved→03Open ` [14:57] PROBLEM - MariaDB Slave SQL: m1 on db2078 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1118, Errmsg... [15:03:09] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/22506/mwmaint1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/596228 (https://phabricator.wikimedia.org/T243056) (owner: 10Dzahn) [15:03:22] (03CR) 10CDanis: [C: 03+1] swift: move out of swift::params::accounts [puppet] - 10https://gerrit.wikimedia.org/r/596217 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [15:03:29] XioNoX: database didn't like the alter [15:03:42] let me finish a deploy I am in the middle of [15:03:48] sure, no rush [15:04:05] what "didn't like" mean, here? [15:04:18] the database broke in layman terms [15:04:47] I will put the error on the ticket [15:04:49] (03PS4) 10Herron: lists: add concept of primary and standby host, and rsync prmry -> stndby [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) [15:04:58] thx [15:05:03] uggh, thanks jynus [15:05:24] please give me some minutes and it is not down down [15:05:29] just down a little [15:05:40] I will be with you in a second [15:05:55] 10Operations, 10netops, 10observability, 10User-fgiunchedi: Upgrade LibreNMS to 1.63 - https://phabricator.wikimedia.org/T251222 (10jcrespo) ` Error 'Row size too large. The maximum row size for the used table type, not counting BLOBs, is 8126. This includes storage overhead, check the manual. You have to... [15:06:13] interesting, yeah I can use the ui so definitely not super broken [15:06:33] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:38] herron: do you really want auto_sync => true and permanently have it copy with a cron? or just a one-time sync [15:06:38] (03PS2) 10JMeybohm: termbox: deploy up to date chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/596227 (https://phabricator.wikimedia.org/T235411) [15:06:43] it may not be even an app issue, maybe just a db config issue [15:07:19] (03CR) 10Ayounsi: [C: 03+1] Revert "prepend {es,kn}ams" [homer/public] - 10https://gerrit.wikimedia.org/r/596222 (owner: 10CDanis) [15:07:22] jynus: afaict the schema migration was reported "OK" from librenms point of view, not sure if it means that it ignored the error or what [15:07:34] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) [15:07:39] (03CR) 10RLazarus: [C: 03+1] Revert "prepend {es,kn}ams" [homer/public] - 10https://gerrit.wikimedia.org/r/596222 (owner: 10CDanis) [15:07:40] 10Operations, 10serviceops: Services and the deployment pipeline are hosted on production-level infrastructure - https://phabricator.wikimedia.org/T220405 (10akosiaris) [15:07:47] (03CR) 10CDanis: [C: 03+2] Revert "prepend {es,kn}ams" [homer/public] - 10https://gerrit.wikimedia.org/r/596222 (owner: 10CDanis) [15:07:51] mutante: I was planning on a permanent sync a) to keep the standby in sync until cutover and b) to support a codfw standby in the future [15:08:09] (03Merged) 10jenkins-bot: Revert "prepend {es,kn}ams" [homer/public] - 10https://gerrit.wikimedia.org/r/596222 (owner: 10CDanis) [15:08:17] PROBLEM - MariaDB Slave Lag: m1 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 834.51 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [15:08:26] herron: *nod* gotcha. code looks good [15:08:58] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:19] mutante: kk thx for looking at it [15:10:56] (03PS1) 10Ottomata: Allow kafka brokers in the to talk to eacn other's prometheus jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/596232 (https://phabricator.wikimedia.org/T252675) [15:11:02] herron: note the rsyncd will be installed on the source and the dest will pull from it [15:13:46] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2136.codfw.wmnet'] ` and were **ALL** successful. [15:13:56] (03CR) 10Dzahn: [C: 03+1] "lgtm https://puppet-compiler.wmflabs.org/compiler1002/22507/fermium.wikimedia.org/index.html (just can't compile on lists1001 because com" [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) (owner: 10Herron) [15:14:59] (03CR) 10Ottomata: "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/596232 (https://phabricator.wikimedia.org/T252675) (owner: 10Ottomata) [15:15:27] jynus: anything I can/should do or help ? [15:15:27] jbond42: just had anothe cases of making a ferm change and i can confirm the issue is gone. you fixed it. thanks [15:16:21] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2137.codfw.wmnet ` The log can be found in `/var... [15:17:42] I am trying to find why the error happened [15:18:11] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10leila) >>! In T252476#6133626, @Nuria wrote: > is this another intern project? Yes. [15:18:23] thanks [15:18:46] fwiw I believe this is the upstream commit: https://github.com/librenms/librenms/commit/7dfb4ef1df290fa0afbc4455295c71138a27792c [15:19:00] the alter is "normal" [15:19:11] and the table doesn't have anything strange [15:19:32] so I am a bit lost [15:19:49] does librenms have a list of supported db versions? [15:20:20] not sure, checking [15:20:58] the deploy worked, so I can finally focus on librenms [15:21:09] I didn't find any mentions of supported versions [15:21:11] sorry, but it was one that was not easy [15:21:34] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2138.codfw.wmnet ` The log can be found in `/var... [15:22:38] no worries, but yeah I can't find anything specifi either [15:23:08] I was thinking maybe a missconfiguration [15:23:29] but both dbs have the same one regarding innodb file format and row format [15:23:43] (03PS1) 10Dzahn: maintenance: also rsync codereview files to codfw miscweb [puppet] - 10https://gerrit.wikimedia.org/r/596235 (https://phabricator.wikimedia.org/T243056) [15:23:48] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [15:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:11] oh [15:24:13] wait [15:24:51] uf, if this is it, librenms really needs some extra docs, because it will break on many people's systems [15:25:04] good news it should be trivial to fix for us [15:25:07] (03PS5) 10Herron: lists: add concept of primary and standby host, and rsync prmry -> stndby [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) [15:25:36] 10Operations, 10SRE-Access-Requests: Give access to the Analytics Cluster to Research Inter (Rodolfo) - https://phabricator.wikimedia.org/T252476 (10Nuria) Approved on my end, let's please provide end date for access. [15:25:59] nope, compact row format didn't work [15:26:16] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [15:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:19] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) @Marostegui please check is this looks good on db2136 ` Disk /dev/sda: 8.7 TiB, 9598580817920 bytes, 18747228160 sectors Disk model: PERC H730P A... [15:26:31] to make sure I understand, the alter did work on at least some databases but not others or didn't work at all ? [15:26:44] it worked on the master, but not on the replica [15:27:02] but the replica has the latest version- which will eventually be setup on the master too [15:27:18] so we should make it work- otherwise it is just a time bomb [15:27:34] agree [15:27:36] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) [15:28:02] (03PS1) 10Hnowlan: ci::master: changeprop-jobqueue definitions [puppet] - 10https://gerrit.wikimedia.org/r/596236 (https://phabricator.wikimedia.org/T220399) [15:28:04] legoktm: in the dump tarball the files are called 1.html without the leading "r". should we rename all the files or change the link in index.html ? [15:28:19] (03PS1) 10Ema: 5.1.3-1wm15: add 0037-force-discard.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/596237 (https://phabricator.wikimedia.org/T236754) [15:28:27] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/22509/" [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) (owner: 10Herron) [15:29:13] godog: aside from the most ovious "you stop having redundancy on librenms" [15:29:21] !log imported scap 3.14.0-1 to main for stretch-wikimedia [15:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:24] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10colewhite) @Miriam, do you happen to have the end date for this collaboration available? [15:29:35] there, and in all m1 servicews [15:29:45] !log Manually de-pooling `wdqs1008.eqiad.wmnet` in preparation for wdqs data transfer [15:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:06] !log imported scap 3.14.0-1 to main for jessie-wikimedia [15:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:10] (03CR) 10Herron: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) (owner: 10Herron) [15:30:22] (03CR) 10Ppchelko: [C: 03+1] ci::master: changeprop-jobqueue definitions [puppet] - 10https://gerrit.wikimedia.org/r/596236 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [15:30:25] !log imported scap 3.14.0-1 to main for buster-wikimedia [15:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:46] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [15:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:59] 10Operations, 10observability, 10serviceops: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10Joe) I took a brief peek at what flows to systemd from php-fpm on dbus: ` $ sudo dbus-monitor --system path='/org/freedesktop/systemd1/unit/php7_2e2_2dfpm_2eservice ...... [15:31:09] (03CR) 10Hnowlan: [C: 03+2] ci::master: changeprop-jobqueue definitions [puppet] - 10https://gerrit.wikimedia.org/r/596236 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [15:31:14] (03CR) 10CRusnov: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/595991 (owner: 10CRusnov) [15:31:30] (03CR) 10Ppchelko: Add tool and configuration for generating beta configuration from kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [15:31:51] jynus: yeah not great :| let me know if/how I can help [15:32:08] !log upgrade ats to version 8.0.7-1wm7 on cp4032 [15:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:21] librenms upstream is responsive too if we need help or open issues etc [15:32:23] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:21] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) [15:33:36] godog: I think I may have it [15:33:36] RECOVERY - MariaDB Slave SQL: m1 on db2078 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [15:34:09] ALTER TABLE ports ENGINE=Innodb row_format=dynamic; [15:34:54] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:57] interesting [15:34:58] mutante: thx yes rsyncd on the primary is what I'm expecting too. I just made a small change to the patch, does that still look good to you? [15:35:03] row_format=dynamic is "the right config", but old installations will not have that at all [15:35:25] that will break a lot of installs that were on previous mysql/mariadb versions [15:35:38] we had the good config, I think, but it required a table recontruction [15:36:17] once replication catches up, I will create a ticket to consider rebulding all librenms tables to prevent further issues [15:36:34] RECOVERY - MariaDB Slave Lag: m1 on db2078 is OK: OK slave_sql_lag Replication lag: 0.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [15:37:04] godog: unless upgrade took hours you can close the ticket now [15:37:33] jynus: no the upgrade was quite quick [15:38:00] jynus: anything we should mention to upstream with respect to this issue you think ? [15:38:21] well, let me create the ticket internally first to fully understand the issue [15:38:37] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:54] sounds good [15:39:01] 10Operations, 10netops, 10observability, 10User-fgiunchedi: Upgrade LibreNMS to 1.63 - https://phabricator.wikimedia.org/T251222 (10fgiunchedi) 05Open→03Resolved Resolving as @jcrespo fixed the issue and will be following up with a separate task [15:39:06] herron: yea, looks good. i think the check is not even needed (inside quickdatacopy it already has both "if $source_host == $::fqdn" and "if $dest_host == $::fqdn), but it also won't hurt, so +1 [15:39:17] godog: but in terms of documentation i would suggest to create better docs regarding dependencies [15:39:24] it is ok if they don't support older installs [15:39:33] but it should be clarified [15:39:39] (03CR) 10Dzahn: [C: 03+1] lists: add concept of primary and standby host, and rsync prmry -> stndby [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) (owner: 10Herron) [15:40:37] godog: most likely something like "We require mysql > XX or mariadb > XX". If upgrading on this version from an older one, you should rebuild your tables (or have rebuilding as part of the upgrade process) [15:40:43] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2137.codfw.wmnet'] ` and were **ALL** successful. [15:41:07] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2139.codfw.wmnet ` The log can be found in `/var... [15:41:12] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:37] jynus: ack, thanks! yeah that makes sense [15:43:41] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install thanos-be200[1-4] - https://phabricator.wikimedia.org/T251634 (10Papaul) [15:44:14] godog: it is good that this error happened- it meant that our monitoring works :-D [15:45:06] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [15:45:10] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [15:45:10] indeed, much better than broken altogether and we didn't know about it [15:45:11] literally minutes after I resolved T237927 [15:45:11] T237927: Add replication lag (and other checks) to misc all hosts - https://phabricator.wikimedia.org/T237927 [15:45:47] (03CR) 10jerkins-bot: [V: 04-1] 5.1.3-1wm15: add 0037-force-discard.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/596237 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [15:45:49] (03PS1) 10Vgutierrez: trafficserver_exporter: Track throttled active connections [puppet] - 10https://gerrit.wikimedia.org/r/596238 (https://phabricator.wikimedia.org/T249335) [15:45:56] sigh, the rsyslog alert might be due to the gnutls listener, should self resolve in a bit [15:46:03] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2138.codfw.wmnet'] ` and were **ALL** successful. [15:46:18] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [15:46:20] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [15:47:22] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2140.codfw.wmnet ` The log can be found in `/var... [15:48:04] (03PS1) 10Herron: prometheus::class_config: use FQDN by default [puppet] - 10https://gerrit.wikimedia.org/r/596239 [15:49:30] (03CR) 10Ema: [C: 03+1] trafficserver_exporter: Track throttled active connections [puppet] - 10https://gerrit.wikimedia.org/r/596238 (https://phabricator.wikimedia.org/T249335) (owner: 10Vgutierrez) [15:50:53] (03PS1) 10Dzahn: static-codereview: do not allow directory listing for subdirs [puppet] - 10https://gerrit.wikimedia.org/r/596240 (https://phabricator.wikimedia.org/T243056) [15:51:16] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Access to analytics-privatedata-users for Research intern Daniram - https://phabricator.wikimedia.org/T252129 (10Miriam) Yes, thanks @Nuria and @colewhite. @Daniram3's internship end date is July 26th. Many thanks! [15:53:38] (03PS2) 10Cwhite: admin: add Daniele Rama to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/595990 (https://phabricator.wikimedia.org/T252129) [15:54:00] (03CR) 10Jcrespo: [C: 03+2] dbtools: point zarcillo database scripts to use db2093 [software] - 10https://gerrit.wikimedia.org/r/596197 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [15:58:10] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [15:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:20] (03PS1) 10Jcrespo: mariadb-backups: switch monitoring of tendril/zarcillo backups to codfw [puppet] - 10https://gerrit.wikimedia.org/r/596242 (https://phabricator.wikimedia.org/T138562) [15:59:42] (03PS2) 10Dzahn: static-codereview: do not allow directory listing for subdirs [puppet] - 10https://gerrit.wikimedia.org/r/596240 (https://phabricator.wikimedia.org/T243056) [16:00:45] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:33] (03PS1) 10Jcrespo: admin: Update alias to zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/596243 [16:01:54] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: switch monitoring of tendril/zarcillo backups to codfw [puppet] - 10https://gerrit.wikimedia.org/r/596242 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [16:02:39] (03CR) 10Jcrespo: [C: 03+2] admin: Update alias to zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/596243 (owner: 10Jcrespo) [16:03:29] (03PS1) 10Hnowlan: namespace: add changeprop-jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/596244 (https://phabricator.wikimedia.org/T220399) [16:04:20] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2140.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2140.codfw.wmnet'] ` [16:04:47] (03CR) 10Elukey: [C: 03+1] "LGTM, from Pcc it seems that other hosts using the jmx exporter are not affected as expected:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596232 (https://phabricator.wikimedia.org/T252675) (owner: 10Ottomata) [16:06:34] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2139.codfw.wmnet'] ` and were **ALL** successful. [16:08:51] James_F: Hi, around for group0 and perhaps group1? [16:09:25] Daimona: Let's do group0 and rest for a bit; I have meetings. [16:10:18] Sure, thank you [16:11:00] !log Running AbuseFilter updateVarDumps on group0 on mwmaint1002 T246539 [16:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:05] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [16:14:53] Taking a fair while. [16:17:56] Not surprising [16:18:25] 200000 rows on mediawikiwiki [16:18:45] 10Operations, 10ops-codfw: Degraded RAID on db2138 - https://phabricator.wikimedia.org/T252687 (10ops-monitoring-bot) [16:19:01] Daimona: Yeah, now half way done. [16:19:38] Alright, so approximately 20 mins... Not too bad after all [16:23:28] 10Operations, 10ops-codfw: Degraded RAID on db2137 - https://phabricator.wikimedia.org/T252688 (10ops-monitoring-bot) [16:25:28] Daimona: A minute per 10k rows means the enwiki 20M run will run for over a day. [16:25:40] (As well as adding quite a lot of load.) [16:26:55] Yeah, I was thinking about that [16:27:08] It's like 30 hours [16:27:44] Perhaps we could try increasing the batch size [16:30:47] I should probably ask DBAs [16:31:01] Done. [16:32:16] Or we could sleep for 100s or whatever between batches and just run it for a month? [16:32:28] Aye, thanks! [16:33:17] It's a possibility, it will certainly reduce the pressure on Load Balancers, but the server will be busy running the PHP script [16:33:17] (03CR) 10Ottomata: Allow kafka brokers in the to talk to eacn other's prometheus jmx exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596232 (https://phabricator.wikimedia.org/T252675) (owner: 10Ottomata) [16:34:07] It can be cleanly restarted. [16:34:26] Perhaps have something spawn the script and let it run for a while [16:34:33] (03PS2) 10Ottomata: Allow kafka brokers in the to talk to eacn other's prometheus jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/596232 (https://phabricator.wikimedia.org/T252675) [16:34:35] Well, ish. It does the full scan first before doing anything. [16:34:40] But it should be fine. [16:34:54] I'd have to check but I think it's possible to stop and restart it [16:35:31] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [16:37:17] (03CR) 10Ottomata: [C: 03+2] Allow kafka brokers in the to talk to eacn other's prometheus jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/596232 (https://phabricator.wikimedia.org/T252675) (owner: 10Ottomata) [16:38:25] (03CR) 10Ppchelko: [C: 03+2] namespace: add changeprop-jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/596244 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:38:54] I think it can be stopped, the stages are independent [16:38:57] (03Merged) 10jenkins-bot: namespace: add changeprop-jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/596244 (https://phabricator.wikimedia.org/T220399) (owner: 10Hnowlan) [16:39:34] For now I'm going to open a task for DBAs to chime in [16:40:40] (03CR) 10Eevans: [C: 03+1] cassandra: use openjdk-8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/596223 (owner: 10Elukey) [16:53:34] 10Operations, 10Wikimedia-Mailing-lists: Lost password for wikimedia-vn mailling list - https://phabricator.wikimedia.org/T252698 (10minhhuy) [16:54:20] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [16:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:52] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10thcipriani) [16:56:58] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10thcipriani) [16:57:46] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [16:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:33] 10Operations, 10Wikimedia-Mailing-lists: Lost password for wikimedia-vn mailling list - https://phabricator.wikimedia.org/T252698 (10jcrespo) Hi, @minhhuy you still have control of the email account associated with that list, right? I can force a password reset for you. I suggest if you could find a trusted p... [17:00:13] (03CR) 10Krinkle: [C: 04-1] "Pending T247028. It seems something is very likely regressed in the refactor. Let's find that out before we find out other things?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591388 (https://phabricator.wikimedia.org/T113916) (owner: 10Aaron Schulz) [17:01:46] 10Operations, 10Wikimedia-Mailing-lists: Lost password for wikimedia-vn mailling list - https://phabricator.wikimedia.org/T252698 (10minhhuy) hello @jcrespo : yes, I still using and accessing to my email linked for that list admin account. If you need verification, you can send a email for that admin list, and... [17:04:21] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [17:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:27] PROBLEM - cassandra-c CQL 10.192.48.123:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.123 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:04:57] PROBLEM - cassandra-c service on restbase2017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:05:03] PROBLEM - cassandra-c SSL 10.192.48.123:7001 on restbase2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [17:05:25] PROBLEM - Check systemd state on restbase2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:39] 10Operations, 10Wikimedia-Mailing-lists: Lost password for wikimedia-vn mailling list - https://phabricator.wikimedia.org/T252698 (10jcrespo) > no one willing to handle this list I understand 100% it can be difficult if the community is small, but I had to ask :-D. Will try to reset it and report back in a s... [17:05:41] urandom: ^^ [17:05:42] 10Operations, 10Wikimedia-Mailing-lists: Lost password for wikimedia-vn mailling list - https://phabricator.wikimedia.org/T252698 (10jcrespo) p:05Triage→03Medium a:03jcrespo [17:07:30] 10Operations, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10Cmjohnson) The issues on asw-d-eqiad listed above have been taken care of. there are more because of the mw's that need to be decom'd. I did not see a decommission task for them. [17:08:16] elukey: if by any chance you're still around lmk [17:08:41] RECOVERY - cassandra-c service on restbase2017 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:08:53] Pchelolo: coming back up [17:09:07] RECOVERY - Check systemd state on restbase2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:03] RECOVERY - cassandra-c CQL 10.192.48.123:9042 on restbase2017 is OK: TCP OK - 0.036 second response time on 10.192.48.123 port 9042 https://phabricator.wikimedia.org/T93886 [17:10:37] RECOVERY - cassandra-c SSL 10.192.48.123:7001 on restbase2017 is OK: SSL OK - Certificate restbase2017-c valid until 2020-11-29 09:26:19 +0000 (expires in 199 days) https://phabricator.wikimedia.org/T120662 [17:12:10] !log restarted cassandra-c, restbase2017 [17:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:31] (03PS1) 10Hnowlan: changeprop: default all feature toggles to on [deployment-charts] - 10https://gerrit.wikimedia.org/r/596250 (https://phabricator.wikimedia.org/T248677) [17:18:45] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:59] PROBLEM - WDQS high update lag on wdqs2005 is CRITICAL: 6784 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:20:23] 10Operations, 10Wikimedia-Mailing-lists: Lost password for wikimedia-vn mailling list - https://phabricator.wikimedia.org/T252698 (10jcrespo) I've reset it already, let me know when you receive it and change the password to something else (mail is not a very secure method of sending passwords). [17:21:07] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [17:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:03] (03CR) 10Ppchelko: [C: 03+2] "Wooop wooop!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/596250 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [17:23:05] PROBLEM - WDQS high update lag on wdqs1008 is CRITICAL: 6928 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:23:21] (03Merged) 10jenkins-bot: changeprop: default all feature toggles to on [deployment-charts] - 10https://gerrit.wikimedia.org/r/596250 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [17:23:49] 10Operations, 10Wikimedia-Mailing-lists: Lost password for wikimedia-vn mailling list - https://phabricator.wikimedia.org/T252698 (10minhhuy) hello @jcrespo : I received your email, login successfully and already changed to another password. Really thanks for your quick help. I will appoint another admin as so... [17:23:54] (03PS1) 10Jcrespo: mariadb-backups: Allow zarcillo as a valid backup section [puppet] - 10https://gerrit.wikimedia.org/r/596251 (https://phabricator.wikimedia.org/T138562) [17:24:00] !log Manually depooled wdqs2005 while lag catches up following the data xfer [17:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:32] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Allow zarcillo as a valid backup section [puppet] - 10https://gerrit.wikimedia.org/r/596251 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [17:27:13] 10Operations, 10Wikimedia-Mailing-lists: Lost password for wikimedia-vn mailling list - https://phabricator.wikimedia.org/T252698 (10jcrespo) 05Open→03Resolved Happy to be helpful. Have a nice day! [17:28:23] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [17:29:45] 10Operations, 10ops-eqiad, 10decommission: decommission thulium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T203520 (10Cmjohnson) 05Open→03Resolved This server is no longer in the rack and all switch cfg has been removed. Netbox does not show the server either. I found it on the list of the s... [17:32:45] (03PS1) 10Hnowlan: changeprop: fix syntax issue with dt updating. [deployment-charts] - 10https://gerrit.wikimedia.org/r/596253 (https://phabricator.wikimedia.org/T248677) [17:33:06] (03PS15) 10Thcipriani: phabricator: add envoy TLS terminator for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/587233 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [17:38:52] 10Operations, 10DBA, 10Patch-For-Review: Implement logic to be able to perform full and incremental backups of ES hosts - https://phabricator.wikimedia.org/T244884 (10jcrespo) a:05jcrespo→03None [17:42:41] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:12] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) [17:45:04] (03CR) 10Bstorm: [C: 03+2] toolforge-kubeadm: calico upgrade changes [puppet] - 10https://gerrit.wikimedia.org/r/596012 (https://phabricator.wikimedia.org/T250863) (owner: 10Bstorm) [17:45:07] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) 05Open→03Resolved @Marostegui complete [17:46:20] (03PS1) 10Jcrespo: backups: Add backup1002 as the eqiad host for ES db backups [puppet] - 10https://gerrit.wikimedia.org/r/596255 (https://phabricator.wikimedia.org/T79922) [17:46:30] (03CR) 10Bstorm: [C: 03+2] puppet-facts-export-puppetdb: Read localcacert from right section [puppet] - 10https://gerrit.wikimedia.org/r/596069 (https://phabricator.wikimedia.org/T252606) (owner: 10Alex Monk) [17:47:20] (03CR) 10Ppchelko: [C: 03+2] changeprop: fix syntax issue with dt updating. [deployment-charts] - 10https://gerrit.wikimedia.org/r/596253 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [17:47:45] (03Merged) 10jenkins-bot: changeprop: fix syntax issue with dt updating. [deployment-charts] - 10https://gerrit.wikimedia.org/r/596253 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [17:47:51] 10Operations, 10ops-eqiad, 10Data-Services, 10decommission, 10cloud-services-team (Hardware): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Cmjohnson) [17:48:09] 10Operations, 10ops-eqiad, 10Data-Services, 10decommission, 10cloud-services-team (Hardware): Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10Cmjohnson) 05Open→03Resolved Removed all of these of the racks, resolving this task [17:48:37] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.33 ms [17:49:10] (03PS2) 10Jdlrobson: Update production wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595014 (https://phabricator.wikimedia.org/T252143) [17:49:16] 10Operations, 10ops-eqiad: Decommission or repair old asw-c2-eqiad - https://phabricator.wikimedia.org/T156398 (10Cmjohnson) 05Open→03Resolved This switch is broken, off the rack. [17:50:02] (03CR) 10jerkins-bot: [V: 04-1] Update production wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595014 (https://phabricator.wikimedia.org/T252143) (owner: 10Jdlrobson) [17:51:34] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [17:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:14] (03PS3) 10Jdlrobson: Update production wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595014 (https://phabricator.wikimedia.org/T252143) [17:52:34] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'production' . [17:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:03] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [17:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:29] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [17:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:13] (03PS4) 10Jdlrobson: Update production wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595014 (https://phabricator.wikimedia.org/T252143) [17:56:22] (03CR) 10jerkins-bot: [V: 04-1] Update production wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595014 (https://phabricator.wikimedia.org/T252143) (owner: 10Jdlrobson) [17:57:03] 10Operations, 10ops-eqiad, 10Analytics, 10decommission: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10Cmjohnson) 05Open→03Resolved removed from rack, verified DNS, and switch ports were removed. [17:58:20] (03PS3) 10Privacybatm: transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) [17:58:42] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [18:00:04] hashar and twentyafterfour: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200513T1800). [18:00:05] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200513T1800). Please do the needful. [18:00:05] Jdlrobson: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:39] (03PS5) 10Jdlrobson: Update production wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595014 (https://phabricator.wikimedia.org/T252143) [18:00:43] here [18:01:19] Hi Jdlrobson , I can SWAT for you! [18:01:45] thanks Urbanecm ! [18:04:10] (03CR) 10Urbanecm: [C: 04-1] Update production wordmarks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595014 (https://phabricator.wikimedia.org/T252143) (owner: 10Jdlrobson) [18:04:49] (03CR) 10Herron: [C: 03+2] lists: add concept of primary and standby host, and rsync prmry -> stndby [puppet] - 10https://gerrit.wikimedia.org/r/594801 (https://phabricator.wikimedia.org/T250032) (owner: 10Herron) [18:04:57] Jdlrobson: I've left an ordering note, to keep the file somehow-arranged :) [18:06:25] Urbanecm: llooking [18:07:50] (03PS4) 10Privacybatm: transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) [18:08:37] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10soworu) [18:09:00] (03PS6) 10Jdlrobson: Update production wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595014 (https://phabricator.wikimedia.org/T252143) [18:09:02] thanks Urbanecm for catching that. new patch is up [18:09:10] thanks, looking [18:09:37] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595014 (https://phabricator.wikimedia.org/T252143) (owner: 10Jdlrobson) [18:09:40] LGTM [18:09:40] 10Operations, 10Traffic, 10Performance-Team (Radar): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10dpifke) Weird. It was working yesterday (I verified that new data was appearing with the correct labels), but is now hanging for me as well. I'll inves... [18:10:27] (03Merged) 10jenkins-bot: Update production wordmarks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595014 (https://phabricator.wikimedia.org/T252143) (owner: 10Jdlrobson) [18:12:06] Jdlrobson: pulled onto mwdebug1001, please check [18:12:26] (03PS1) 10Cmjohnson: removing old ips link to decom nodes lvs10[10-12] [dns] - 10https://gerrit.wikimedia.org/r/596257 (https://phabricator.wikimedia.org/T208586) [18:12:55] (03PS2) 10Cmjohnson: removing old ips link to decom nodes lvs10[10-12] [dns] - 10https://gerrit.wikimedia.org/r/596257 (https://phabricator.wikimedia.org/T208586) [18:13:21] testing Urbanecm [18:13:56] (03CR) 10Cmjohnson: [C: 03+2] removing old ips link to decom nodes lvs10[10-12] [dns] - 10https://gerrit.wikimedia.org/r/596257 (https://phabricator.wikimedia.org/T208586) (owner: 10Cmjohnson) [18:15:11] LGTM Urbanecm [18:15:14] feel free to sync! [18:15:20] thanks, syncing [18:16:19] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission lvs1007-1012 - https://phabricator.wikimedia.org/T208586 (10Cmjohnson) 05Open→03Resolved lvs10[10-12] still had network port description. removed a few old lvs links in wmnet file. resolving task. Removed all se... [18:16:22] 10Operations, 10ops-eqiad, 10decommission, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10Cmjohnson) [18:16:25] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10Cmjohnson) [18:17:15] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/: SWAT: 38db3e0: Update production wordmarks (T252143) (duration: 01m 09s) [18:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:18] T252143: Update existing outdated wordmarks - https://phabricator.wikimedia.org/T252143 [18:17:22] Jdlrobson: done [18:21:14] 10Operations, 10ops-eqiad, 10Traffic, 10decommission: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10Cmjohnson) 05Open→03Resolved no sign of any entries for these servers, the have already been sold. [18:21:18] thanks Urbanecm looks great! [18:21:18] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10Cmjohnson) [18:21:21] 10Operations, 10ops-eqiad, 10decommission, 10netops: Decommission asw-c-eqiad - https://phabricator.wikimedia.org/T208734 (10Cmjohnson) [18:21:25] happy to help! [18:21:43] 10Operations, 10ops-eqiad: Decommission brokenasw-c2-eqiad - https://phabricator.wikimedia.org/T211998 (10Cmjohnson) 05Open→03Resolved [18:22:58] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 38db3e0: Update production wordmarks (T252143) (duration: 01m 07s) [18:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:01] T252143: Update existing outdated wordmarks - https://phabricator.wikimedia.org/T252143 [18:25:03] (03PS1) 10Herron: lists: don't monitor mailman procs on standby_host [puppet] - 10https://gerrit.wikimedia.org/r/596259 (https://phabricator.wikimedia.org/T252615) [18:25:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission dbproxy1011.eqiad.wmnet - https://phabricator.wikimedia.org/T249590 (10Cmjohnson) [18:25:51] 10Operations, 10ops-eqiad, 10decommission: decommission dbstore1001.eqiad.wmnet - https://phabricator.wikimedia.org/T236227 (10Cmjohnson) 05Open→03Resolved No dns or switch cfg exists, records indicate this server was removed and sold. [18:28:33] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T252705 (10soworu) [18:31:17] (03PS1) 10Cmjohnson: Removing mgmt dns entries for dbproxy1011 [dns] - 10https://gerrit.wikimedia.org/r/596261 (https://phabricator.wikimedia.org/T249590) [18:31:18] mutante: great and thanks [18:31:33] (03PS2) 10Cmjohnson: Removing mgmt dns entries for dbproxy1011 [dns] - 10https://gerrit.wikimedia.org/r/596261 (https://phabricator.wikimedia.org/T249590) [18:32:17] (03CR) 10Cmjohnson: [C: 03+2] Removing mgmt dns entries for dbproxy1011 [dns] - 10https://gerrit.wikimedia.org/r/596261 (https://phabricator.wikimedia.org/T249590) (owner: 10Cmjohnson) [18:33:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission dbproxy1011.eqiad.wmnet - https://phabricator.wikimedia.org/T249590 (10Cmjohnson) [18:33:41] 10Operations, 10DBA, 10Data-Services: Replace labsdb (wikireplicas) dbproxies: dbproxy1010 and dbproxy1011 - https://phabricator.wikimedia.org/T231520 (10Cmjohnson) [18:33:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission dbproxy1011.eqiad.wmnet - https://phabricator.wikimedia.org/T249590 (10Cmjohnson) 05Open→03Resolved Server is out of the rack [18:37:53] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10Nuria) @soworu please have your manager approve this request [18:39:05] 10Operations, 10ops-eqiad, 10decommission, 10User-jijiki: Decommission rdb1001, rdb1002, rdb1003, rdb1004, rdb1007, rdb1008 - https://phabricator.wikimedia.org/T209181 (10Cmjohnson) 05Open→03Resolved verified all servers are gone, they are on the list that was sold already. [18:40:07] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10Cmjohnson) [18:41:18] (03CR) 10Herron: [C: 03+1] "PCC LGTM https://puppet-compiler.wmflabs.org/compiler1003/22512/" [puppet] - 10https://gerrit.wikimedia.org/r/596259 (https://phabricator.wikimedia.org/T252615) (owner: 10Herron) [18:41:43] 10Operations, 10Analytics, 10Analytics-Kanban, 10LDAP-Access-Requests: LDAP access to the wmf group for Segun Oworu (superset, turnilo, hue) - https://phabricator.wikimedia.org/T252703 (10ahemmer) Hi @Nuria Approved for @soworu. Thank you! [18:42:26] mutante: do you have a min for another lists patch sanity check? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/596259/ [18:44:27] (03PS1) 10Joal: Bump AQS druid snapshot to 2020_04 [puppet] - 10https://gerrit.wikimedia.org/r/596263 [18:44:33] 10Operations, 10ops-eqiad, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10Cmjohnson) all dns and network switch records have been removed, servers have already been sold [18:44:41] 10Operations, 10ops-eqiad, 10decommission: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099 - https://phabricator.wikimedia.org/T229586 (10Cmjohnson) 05Open→03Resolved [18:46:13] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [18:49:46] phab and gerrit are timing out for a few of us [18:49:52] Same here :/ [18:49:52] me too! [18:50:05] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [18:50:25] !log restarted kafka broker on kafka-main1001 for java security updates [18:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:31] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [18:50:41] same here, gerrit need a kick? [18:51:14] tyler's looking at gerrit [18:51:32] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Phabricator [18:51:33] kk [18:51:52] here to help if needed [18:51:57] phabricator server looks ok to me [18:51:59] same [18:52:04] o/ [18:52:17] is it a traffic routing issue? [18:52:18] o/ what's goin on [18:52:18] * jbond42 here [18:52:19] <_joe_> I can't see the ite though [18:52:21] phab not loading for me [18:52:45] <_joe_> ok, enough people around. Call me if I'm needed [18:52:49] !log restarting apache on phab1001 for lack of a better idea [18:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:28] I don't understand why phab and gerrit would both go down together [18:53:44] !log restarting gerrit [18:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:58] something at the cache layer maybe? [18:53:58] * jbond42 phab back for me [18:54:14] !log restarted php-fpm on phab1001 [18:54:15] it was returning a 503 for me, maybe because of the reboot [18:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:21] it's loading now [18:54:34] still not loading for me? [18:55:07] ^ does that correlate with DCs? it's loading via eqiad for me [18:55:08] RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 37158 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Phabricator [18:55:19] cache DC I mean [18:55:24] apachectl status on phab1001 looks like requests were slowly backing up in W state [18:55:32] up here now, eqiad [18:55:34] but clerared presumably now that gerrit is alive again [18:55:35] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 864 bytes in 0.036 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [18:55:45] so phabricator can get backed up by requests to gerrit when gerrit goes down [18:55:56] tyler restarted gerrit which finally fixed things [18:56:00] herron's thing sounds plausible, ignore mine [18:56:01] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27676 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [18:56:06] * twentyafterfour was in a meeting with tyler when this happened [18:56:07] <_joe_> so [18:56:26] <_joe_> the solution is to make requests from phab to gerrit have a shorter timeout [18:56:30] yes i think there was a previous issue where a gerrit outage caused phab to go down [18:56:42] herron: right. I need to shorten the timeout on the gerrit patch list in phabricator tasks [18:56:51] _joe_: exactly [18:57:04] sounds like a plan [18:57:10] <_joe_> rzl: I think phab has already an envoy installed, if we want to get fancy. [18:57:45] <_joe_> but for now making that timeout something reasonable like 2 seconds might help [18:59:00] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455 (10Cmjohnson) Checked everything and all entries removed, server is off rack, removed the storage array attached to it and set to offline and piled for sale. [18:59:05] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455 (10Cmjohnson) 05Open→03Resolved [18:59:07] 10Operations, 10decommission, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821 (10Cmjohnson) [18:59:36] _joe_: making it 2 seconds. that seems like a good place to start [18:59:40] it was set to 10 [18:59:53] (03PS1) 10Bstorm: dumpsdistribution: quiet the load alerts a bit [puppet] - 10https://gerrit.wikimedia.org/r/596268 [19:00:05] hashar and twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200513T1900). [19:00:27] there was indeed a pile-up of busy apache workers on phab until things collapsed: https://w.wiki/Qkg [19:00:38] (if that serves you the grafana front page, make sure you are logged in) [19:00:54] that was bizarre from the gerrit side: plenty of heap, no unusual traffic load. Nothing in the apache error log either :\ [19:01:14] _joe_: yeah let's do the easy thing first, and worry about envoy later if it's still worth it then [19:01:23] I guess I should hold on running the train shouldn't i ?:) [19:01:56] what's the elevator explanation of how envoy helps this case? [19:01:59] 10Operations, 10DC-Ops, 10decommission, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821 (10wiki_willy) [19:02:29] hashar: the outage is over with [19:02:33] gerrit1001 had a drop of tcp timewait and a slow raise of tcp inuse [19:03:35] herron: making a guess possibly the sugestion was to tweak the timout values in envoy [19:03:41] gerrit cpu load was low but maybe something was keeping idle connections open? [19:03:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission dbproxy1002.eqiad.wmnet - https://phabricator.wikimedia.org/T245384 (10Cmjohnson) 05Open→03Resolved Removed server from rack, updated netbox [19:04:06] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission dbproxy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T244463 (10Cmjohnson) [19:04:16] does our backup gerrit server support the search query rest api? [19:04:35] the replica I mean [19:04:42] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission dbproxy1001.eqiad.wmnet - https://phabricator.wikimedia.org/T244463 (10Cmjohnson) 05Open→03Resolved verified all entries are gone, removed from rack, updated netbox to offline [19:04:55] twentyafterfour: nop :-( there is no secondary index maintained there [19:05:04] iirc [19:05:13] doing the train [19:05:32] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=gerrit1001&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc&from=1589395269631&to=1589396563044 [19:05:42] this almost looks like something on gerrit1001 got OOM-killed [19:05:49] (03PS1) 10Hashar: group1 wikis to 1.35.0-wmf.32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596270 [19:05:51] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.35.0-wmf.32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596270 (owner: 10Hashar) [19:06:12] but none of that in syslog [19:06:30] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.32 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596270 (owner: 10Hashar) [19:06:38] twentyafterfour: https://phabricator.wikimedia.org/T235251 has the gerrit answer ;) [19:07:01] cdanis: the huge drop in memory is me restarting gerrit, likely [19:07:09] ah, of course [19:07:34] @hashar let me know if I caused any more UBNs with my Revision work [19:07:35] socket utilization that hashar pointed out is interesting [19:07:46] yeah [19:08:19] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decommission dbproxy1007.eqiad.wmnet - https://phabricator.wikimedia.org/T245385 (10Cmjohnson) 05Open→03Resolved updated netbox, removed from rack. [19:08:49] the network logs in logstash might give some clue ( https://logstash.wikimedia.org/app/kibana#/dashboard/6bcd2a10-7d21-11e7-86fb-51c84229aeb7 ) [19:08:53] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.32 [19:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:15] DannyS712: hi :) i will let you know [19:09:59] !log hashar@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.32 (duration: 01m 05s) [19:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:52] (03PS1) 10Cmjohnson: Removing and old dns entry for decom host labsdb1002 [dns] - 10https://gerrit.wikimedia.org/r/596271 (https://phabricator.wikimedia.org/T146455) [19:10:54] (03PS1) 10Cmjohnson: Remove mgmt dns entries for frack host bismuth (decom'd) [dns] - 10https://gerrit.wikimedia.org/r/596272 (https://phabricator.wikimedia.org/T248516) [19:11:36] no mediawiki log spam yet ;) [19:12:01] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decommission bismuth.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T248516 (10Cmjohnson) [19:12:13] (03PS2) 10Cmjohnson: Removing and old dns entry for decom host labsdb1002 [dns] - 10https://gerrit.wikimedia.org/r/596271 (https://phabricator.wikimedia.org/T146455) [19:13:49] (03CR) 10Cmjohnson: [C: 03+2] Removing and old dns entry for decom host labsdb1002 [dns] - 10https://gerrit.wikimedia.org/r/596271 (https://phabricator.wikimedia.org/T146455) (owner: 10Cmjohnson) [19:14:01] (03PS2) 10Cmjohnson: Remove mgmt dns entries for frack host bismuth (decom'd) [dns] - 10https://gerrit.wikimedia.org/r/596272 (https://phabricator.wikimedia.org/T248516) [19:15:42] DannyS712: nothing so far so I guess it is all fine :] [19:42:26] (03CR) 10Hashar: "> Nothing else pulls in Java there, so I don't see why alternatives are even needed? If you only install Java 8, you only get Java 8." [puppet] - 10https://gerrit.wikimedia.org/r/595531 (https://phabricator.wikimedia.org/T224591) (owner: 10Muehlenhoff) [19:50:47] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:32] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console for soworu - https://phabricator.wikimedia.org/T252705 (10Aklapper) [19:54:49] (03PS1) 10Cmjohnson: removing mgmt/asset tag of a decom server [dns] - 10https://gerrit.wikimedia.org/r/596276 (https://phabricator.wikimedia.org/T226715) [19:55:26] (03CR) 10Cmjohnson: [C: 03+2] Remove mgmt dns entries for frack host bismuth (decom'd) [dns] - 10https://gerrit.wikimedia.org/r/596272 (https://phabricator.wikimedia.org/T248516) (owner: 10Cmjohnson) [19:55:37] (03PS1) 10Jdlrobson: Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) [19:55:39] (03PS2) 10Cmjohnson: removing mgmt/asset tag of a decom server [dns] - 10https://gerrit.wikimedia.org/r/596276 (https://phabricator.wikimedia.org/T226715) [19:56:53] RECOVERY - WDQS high update lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 881.6 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:57:26] (03CR) 10Cmjohnson: [C: 03+2] removing mgmt/asset tag of a decom server [dns] - 10https://gerrit.wikimedia.org/r/596276 (https://phabricator.wikimedia.org/T226715) (owner: 10Cmjohnson) [19:57:57] RECOVERY - WDQS high update lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 1000 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:58:51] 08:28:04 legoktm: in the dump tarball the files are called 1.html without the leading "r". should we rename all the files or change the link in index.html ? <-- probably change the links in index.html [20:00:04] halfak and accraze: Your horoscope predicts another unfortunate Services – Graphoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200513T2000). [20:06:04] (03PS1) 10Legoktm: static-codereview: Fix links on index.html [puppet] - 10https://gerrit.wikimedia.org/r/596280 (https://phabricator.wikimedia.org/T243056) [20:06:09] mutante: ^^ [20:06:16] (03PS2) 10Jforrester: Stop loading the ParsoidBatchAPI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595994 (https://phabricator.wikimedia.org/T242430) [20:06:38] (03CR) 10Jforrester: [C: 03+2] Stop loading the ParsoidBatchAPI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595994 (https://phabricator.wikimedia.org/T242430) (owner: 10Jforrester) [20:07:20] (03Merged) 10jenkins-bot: Stop loading the ParsoidBatchAPI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595994 (https://phabricator.wikimedia.org/T242430) (owner: 10Jforrester) [20:09:52] (03PS2) 10Jdlrobson: Use AddFooterLink hook for code of conduct and contact links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596277 (https://phabricator.wikimedia.org/T251817) [20:10:01] (03PS1) 10CDanis: run apache-exporter on gerrit hosts [puppet] - 10https://gerrit.wikimedia.org/r/596281 [20:12:32] (03CR) 10CDanis: "PCC lgtm https://puppet-compiler.wmflabs.org/compiler1002/22513/" [puppet] - 10https://gerrit.wikimedia.org/r/596281 (owner: 10CDanis) [20:16:39] (03PS2) 10Jforrester: Stop loading i18n for the ParsoidBatchAPI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595995 (https://phabricator.wikimedia.org/T242430) [20:16:44] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T242430 Stop loading the ParsoidBatchAPI extension (duration: 01m 08s) [20:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:49] T242430: Undeploy ParsoidBatchAPI from the Wikimedia cluster - https://phabricator.wikimedia.org/T242430 [20:20:46] (03CR) 10Jforrester: [C: 03+2] Stop loading i18n for the ParsoidBatchAPI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595995 (https://phabricator.wikimedia.org/T242430) (owner: 10Jforrester) [20:21:35] (03Merged) 10jenkins-bot: Stop loading i18n for the ParsoidBatchAPI extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/595995 (https://phabricator.wikimedia.org/T242430) (owner: 10Jforrester) [20:24:04] 10Operations, 10observability, 10serviceops: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10CDanis) Discussed some with Joe on IRC and the consensus approach for now is to, for now, write a textfile exporter that parses the systemd status line, and perhaps later... [20:24:06] (03CR) 10Jforrester: [C: 03+1] restrouter: Remove chart and namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/596141 (https://phabricator.wikimedia.org/T242461) (owner: 10JMeybohm) [20:24:09] 10Operations, 10observability, 10serviceops: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10CDanis) a:03CDanis [20:28:00] (03CR) 10Thcipriani: [C: 03+1] "Looks like useful info." [puppet] - 10https://gerrit.wikimedia.org/r/596281 (owner: 10CDanis) [20:28:42] thcipriani: cool, imma merge this now [20:28:44] (03CR) 10CDanis: [C: 03+2] run apache-exporter on gerrit hosts [puppet] - 10https://gerrit.wikimedia.org/r/596281 (owner: 10CDanis) [20:30:15] 10Operations, 10Parsing-Team, 10TechCom, 10serviceops, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) From the TechCom meeting: we could just put the parser output into memcached with a short expiry t... [20:30:36] cdanis: cool, thank you for that change [20:31:43] 10Operations, 10Security-Team, 10Patch-For-Review, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10HMarcus) @jbond I've set up a service account for you with the following information: Service account name: noc-333@j... [20:31:56] 10Operations, 10Parsing-Team, 10TechCom, 10serviceops, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10tstarling) a:03tstarling [20:33:44] (03CR) 10Paladox: run apache-exporter on gerrit hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596281 (owner: 10CDanis) [20:34:16] (03CR) 10CDanis: run apache-exporter on gerrit hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596281 (owner: 10CDanis) [20:34:52] 10Operations, 10Parsing-Team, 10TechCom, 10serviceops, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10tstarling) [20:36:44] (03PS1) 10CDanis: gerrit apache-exporter: fix comment mispaste [puppet] - 10https://gerrit.wikimedia.org/r/596283 [20:37:02] (03CR) 10CDanis: [C: 03+2] gerrit apache-exporter: fix comment mispaste [puppet] - 10https://gerrit.wikimedia.org/r/596283 (owner: 10CDanis) [20:40:46] (03CR) 10Paladox: [C: 03+1] gerrit apache-exporter: fix comment mispaste [puppet] - 10https://gerrit.wikimedia.org/r/596283 (owner: 10CDanis) [20:41:50] (03CR) 10Jhedden: [C: 03+1] dumpsdistribution: quiet the load alerts a bit [puppet] - 10https://gerrit.wikimedia.org/r/596268 (owner: 10Bstorm) [20:57:11] (03CR) 10Bstorm: [C: 03+2] dumpsdistribution: quiet the load alerts a bit [puppet] - 10https://gerrit.wikimedia.org/r/596268 (owner: 10Bstorm) [20:58:01] PROBLEM - SSH on analytics1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:01:57] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Jclark-ctr) @CDanis I have edited config file I am still having issues connecting. requesting password [21:02:31] PROBLEM - Check size of conntrack table on analytics1055 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.5.18: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:02:33] PROBLEM - Check systemd state on analytics1055 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.5.18: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:55] PROBLEM - puppet last run on analytics1055 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.5.18: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:03:01] PROBLEM - Hadoop DataNode on analytics1055 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.5.18: Connection reset by peer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:03:01] PROBLEM - Disk space on Hadoop worker on analytics1055 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.5.18: Connection reset by peer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:03:11] PROBLEM - Hadoop NodeManager on analytics1055 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.5.18: Connection reset by peer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:05:11] PROBLEM - Disk space on analytics1055 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.5.18: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1055&var-datasource=eqiad+prometheus/ops [21:05:41] PROBLEM - Check whether ferm is active by checking the default input chain on analytics1055 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.5.18: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:15:21] PROBLEM - IPMI Sensor Status on analytics1055 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.5.18: Connection reset by peer https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:20:12] PROBLEM - DPKG on analytics1055 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.5.18: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:20:36] PROBLEM - configured eth on analytics1055 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.5.18: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [21:26:19] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [21:26:56] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [21:27:21] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [21:29:44] PROBLEM - Check the NTP synchronisation status of timesyncd on analytics1055 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.5.18: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [21:30:43] !log powercycle analytics1055 [21:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:52] PROBLEM - Host analytics1055 is DOWN: PING CRITICAL - Packet loss = 100% [21:32:47] lovely, it gets stuck while booting [21:38:12] RECOVERY - Check size of conntrack table on analytics1055 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:38:14] RECOVERY - Host analytics1055 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [21:38:16] RECOVERY - Disk space on Hadoop worker on analytics1055 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:38:16] RECOVERY - Hadoop DataNode on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:38:25] \o/ [21:38:32] RECOVERY - Hadoop NodeManager on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:38:56] RECOVERY - SSH on analytics1055 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:39:04] RECOVERY - Check systemd state on analytics1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:06] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:41:32] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:22] RECOVERY - MegaRAID on analytics1055 is OK: OK: optimal, 12 logical, 13 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:46:08] RECOVERY - Disk space on analytics1055 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=analytics1055&var-datasource=eqiad+prometheus/ops [21:46:12] RECOVERY - IPMI Sensor Status on analytics1055 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:48:31] (03PS1) 10Subramanya Sastry: Switch Parsoid's rt-testing from testreduce_0715 to testreduce [puppet] - 10https://gerrit.wikimedia.org/r/596293 [21:49:03] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:50:15] RECOVERY - DPKG on analytics1055 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [21:50:35] RECOVERY - configured eth on analytics1055 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [21:52:01] 10Operations, 10LDAP-Access-Requests: Add `dcipoletti` to `wmf` Access Group - https://phabricator.wikimedia.org/T252674 (10colewhite) a:05colewhite→03None [21:53:35] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:54:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:54:31] (03CR) 10C. Scott Ananian: [C: 03+1] Switch Parsoid's rt-testing from testreduce_0715 to testreduce [puppet] - 10https://gerrit.wikimedia.org/r/596293 (owner: 10Subramanya Sastry) [21:54:41] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:55:13] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:57:34] (03PS1) 10Cwhite: admin add Segun Oworu to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/596298 (https://phabricator.wikimedia.org/T251523) [22:00:01] RECOVERY - Check the NTP synchronisation status of timesyncd on analytics1055 is OK: OK: synced at Wed 2020-05-13 22:00:00 UTC. https://wikitech.wikimedia.org/wiki/NTP [22:02:17] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251523 (10colewhite) Hi Segun! Is it safe to assume you will need the same level of access to Analytics data as @ahemmer? T251123 [22:06:41] RECOVERY - Check whether ferm is active by checking the default input chain on analytics1055 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:08:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:10:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:15:33] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251523 (10soworu) >>! In T251523#6135269, @colewhite wrote: > Hi Segun! > > Is it safe to assume you will need the same level of acce... [22:21:07] (03PS1) 10BryanDavis: Apply --mem and --cpu to kubernetes shell pods [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/596306 (https://phabricator.wikimedia.org/T252700) [22:26:45] !log Pooled wdqs1008 given that lag has returned to normal levels and the instance is responding to queries correctly [22:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:38] !log Pooled wdqs2005 given that lag has returned to normal levels and the instance is responding to queries correctly [22:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:20] !log Depooled wdqs1004 for subsequent wdqs data xfer [22:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:22] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [22:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:45] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [22:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:38] (03CR) 10Bmansurov: Add recommendation-api chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [22:42:51] (03PS11) 10Bmansurov: Add recommendation-api chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) [22:43:07] (03CR) 10jerkins-bot: [V: 04-1] Add recommendation-api chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/565788 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [22:50:12] (03CR) 10Bstorm: "Between gems, pip and npm, it was only a matter of time!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/596306 (https://phabricator.wikimedia.org/T252700) (owner: 10BryanDavis) [22:50:26] (03PS1) 10Catrope: Enable GrowthExperiments features on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596313 (https://phabricator.wikimedia.org/T252420) [22:59:02] (03CR) 10Bstorm: "This works. I'm merging it." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/596306 (https://phabricator.wikimedia.org/T252700) (owner: 10BryanDavis) [22:59:07] (03CR) 10Bstorm: [C: 03+2] Apply --mem and --cpu to kubernetes shell pods [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/596306 (https://phabricator.wikimedia.org/T252700) (owner: 10BryanDavis) [22:59:41] (03Merged) 10jenkins-bot: Apply --mem and --cpu to kubernetes shell pods [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/596306 (https://phabricator.wikimedia.org/T252700) (owner: 10BryanDavis) [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200513T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:03:34] (03CR) 10Cwhite: "Will this update the value of the instance label Prometheus server attaches to metrics on scrape?" [puppet] - 10https://gerrit.wikimedia.org/r/596239 (owner: 10Herron) [23:15:15] (03PS1) 10CDanis: Revert "Revert "prepend {es,kn}ams"" [homer/public] - 10https://gerrit.wikimedia.org/r/596319 [23:44:55] (03CR) 10Bstorm: [C: 03+2] "I'm going to go ahead and merge this. The biggest impact would likely be in tools anyway, where it is already set." [puppet] - 10https://gerrit.wikimedia.org/r/596063 (https://phabricator.wikimedia.org/T252260) (owner: 10Bstorm) [23:51:17] 10Operations, 10MediaWiki-General, 10serviceops-radar, 10Performance-Team (Radar), and 3 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10Krinkle)