[00:00:02] RoanKattouw: I can hang out if you can.. landing this performance fix is important to me :) [00:00:10] Yeah I have time [00:00:13] It should be almost done with CI now [00:00:16] looks like it finally merged [00:00:17] Oh, right, 'next' does not include currrent [00:00:21] (03CR) 10Ayounsi: [C: 03+1] netbox : Add Hiera data for automatic LibreNMS Netbox report [puppet] - 10https://gerrit.wikimedia.org/r/522562 (owner: 10CRusnov) [00:00:23] sorry RoanKattouw , I +2'ed a patch. please ignore :) [00:00:29] No worries, I'm pulling now [00:01:35] jdlrobson: Now live on mwdebug1002, please test [00:03:42] !log restart logstash to revert mitigations - T228089 [00:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:50] T228089: Logstash down for MediaWiki - https://phabricator.wikimedia.org/T228089 [00:03:59] RoanKattouw: sync away [00:04:53] PROBLEM - Disk space on mw1293 is CRITICAL: DISK CRITICAL - free space: /tmp 1404 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [00:05:18] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.13/skins/MinervaNeue/: Restrict AMC scripts and styles to AMC mode (T227929) (duration: 00m 52s) [00:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:25] T227929: CSS spiked by 2kb (23% increase) for ALL users - https://phabricator.wikimedia.org/T227929 [00:07:18] (03CR) 10CRusnov: [C: 03+2] netbox : Add Hiera data for automatic LibreNMS Netbox report [puppet] - 10https://gerrit.wikimedia.org/r/522562 (owner: 10CRusnov) [00:07:50] (03PS2) 10CRusnov: netbox : Add Hiera data for automatic LibreNMS Netbox report [puppet] - 10https://gerrit.wikimedia.org/r/522562 [00:10:47] thnks RoanKattouw jus monitoring the graphs now.. [00:14:43] RECOVERY - Disk space on mw1293 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [00:14:44] RoanKattouw: ok to roll out the other patch now? [00:21:41] Krinkle: yeah go for it [00:22:16] k [00:24:46] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.13/includes/cache/LinkCache.php: 4a5f4ca2fd788 (duration: 00m 51s) [00:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:09] jdlrobson: it'll be hard to see the regression fade out because we're at the daily climb. [00:29:09] https://grafana.wikimedia.org/d/000000038/navigation-timing-by-platform?var-source=navtiming2&var-metric=loadEventEnd&var-percentile=p50&refresh=5m&orgId=1 [00:29:18] but, it appears to be fitting last weeks' curve again [00:29:26] so looking good [00:29:46] (looking at the avg/1h last week graph) [00:44:58] (03PS2) 10Smalyshev: wdqs: introduced tuned journal options to wdqs2001. [puppet] - 10https://gerrit.wikimedia.org/r/523266 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [00:45:25] (03CR) 10jerkins-bot: [V: 04-1] wdqs: introduced tuned journal options to wdqs2001. [puppet] - 10https://gerrit.wikimedia.org/r/523266 (https://phabricator.wikimedia.org/T228122) (owner: 10Gehel) [00:53:55] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10ayounsi) 05Open→03Resolved Everything in the scope of that task is completed. [00:54:00] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [00:57:45] 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (CI & Testing services), 10User-Joe: Create jenkins job for creating deployment artifacts for `docker-pkg-deploy` - https://phabricator.wikimedia.org/T179562 (10greg) Is this task still generally accurate? [01:27:38] RECOVERY - Docker registry HTTPS interface on registry1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2545 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Docker [01:28:04] RECOVERY - Docker registry HTTPS interface on registry1002 is OK: HTTP OK: HTTP/1.1 200 OK - 2545 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Docker [01:30:21] (03PS1) 10Smalyshev: HOST doesn't seem to be actually defined anywhere, and should be always the same [puppet] - 10https://gerrit.wikimedia.org/r/523415 [01:31:30] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 97.76 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [02:12:28] (03PS3) 10CDanis: WIP nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 [02:21:04] (03PS2) 10BryanDavis: sudo: Allow root to assume any group [puppet] - 10https://gerrit.wikimedia.org/r/501043 [02:31:54] (03PS4) 10CDanis: WIP nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 [02:31:56] (03PS1) 10CDanis: monitoring: extract build_notes_url() function [puppet] - 10https://gerrit.wikimedia.org/r/523452 [02:33:15] (03CR) 10jerkins-bot: [V: 04-1] WIP nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [02:39:34] (03CR) 10BryanDavis: [C: 04-1] "Needs to be vetted by the Security team. See https://phabricator.wikimedia.org/T123978#5085634" [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [02:43:01] (03PS5) 10CDanis: WIP nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 [02:44:01] (03CR) 10jerkins-bot: [V: 04-1] WIP nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [02:45:18] (03CR) 10CDanis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [02:46:14] (03CR) 10jerkins-bot: [V: 04-1] WIP nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [02:51:32] PROBLEM - puppet last run on snapshot1008 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [02:58:12] (03CR) 10CDanis: "PCC says no-op, as expected: https://puppet-compiler.wmflabs.org/compiler1001/17390/" [puppet] - 10https://gerrit.wikimedia.org/r/523452 (owner: 10CDanis) [03:02:23] (03CR) 10CDanis: "Instead of copying & pasting code around I extracted this into a function in I8a77f337 ... but now CI is failing with Evaluation Error: Un" [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [03:12:02] (03PS2) 10CDanis: monitoring: extract build_notes_url() function [puppet] - 10https://gerrit.wikimedia.org/r/523452 [03:12:05] (03PS6) 10CDanis: WIP nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 [03:13:16] (03CR) 10jerkins-bot: [V: 04-1] WIP nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [03:18:50] RECOVERY - puppet last run on snapshot1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:21:28] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 921.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:40:26] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:40:54] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:53:42] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 15.63 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:32:28] 10Operations, 10Traffic: ATS lacks the possibility of reporting SSL stats to an origin server via HTTP Headers - https://phabricator.wikimedia.org/T228135 (10Vgutierrez) [05:32:47] 10Operations, 10Traffic: ATS lacks the possibility of reporting SSL stats to an origin server via HTTP Headers - https://phabricator.wikimedia.org/T228135 (10Vgutierrez) p:05Triage→03Normal [05:41:12] 10Operations, 10Traffic: ATS lacks the possibility of reporting SSL stats to an origin server via HTTP Headers - https://phabricator.wikimedia.org/T228135 (10Vgutierrez) Two PRs have been submitted to upstream: * Implement logging of SSL Elliptic Curve used: https://github.com/apache/trafficserver/pull/5724 *... [05:53:53] (03PS3) 10Fsero: helmfile,k8s: adding calico-policy into deploy* for manage it in code [puppet] - 10https://gerrit.wikimedia.org/r/523132 [05:53:55] (03PS9) 10Fsero: deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 [05:54:47] (03Abandoned) 10Fsero: helmfile,k8s: adding calico-policy into deploy* for manage it in code [puppet] - 10https://gerrit.wikimedia.org/r/523132 (owner: 10Fsero) [05:55:10] (03CR) 10Fsero: [C: 03+2] deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 (owner: 10Fsero) [05:57:55] (03PS10) 10Fsero: deploy,helmfile: little refactor and introduce admin_services_secrets [puppet] - 10https://gerrit.wikimedia.org/r/523198 [06:00:00] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10ArielGlenn) @Krinkle, any chance you can give a link to a few of those errors with their stacktrace in log... [06:14:38] (03PS1) 10Fsero: helmfile: bug, it also needs the parent directory [puppet] - 10https://gerrit.wikimedia.org/r/523567 [06:15:20] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment-charts/helmfile.d/admin/staging/calico/private] [06:16:34] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment-charts/helmfile.d/admin/staging/calico/private] [06:17:36] (03CR) 10Fsero: [C: 03+2] helmfile: bug, it also needs the parent directory [puppet] - 10https://gerrit.wikimedia.org/r/523567 (owner: 10Fsero) [06:22:00] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:26:12] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [06:27:46] (03PS1) 10Fsero: helmfile: bug, update hiera admin_services structure [puppet] - 10https://gerrit.wikimedia.org/r/523568 [06:30:05] (03CR) 10Fsero: [C: 03+2] helmfile: bug, update hiera admin_services structure [puppet] - 10https://gerrit.wikimedia.org/r/523568 (owner: 10Fsero) [06:30:54] PROBLEM - puppet last run on db2086 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:31:38] PROBLEM - puppet last run on conf2003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:35:52] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:41:20] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [06:47:02] (03PS1) 10Fsero: Capture calico deployment in code. [deployment-charts] - 10https://gerrit.wikimedia.org/r/523580 (https://phabricator.wikimedia.org/T227775) [06:48:29] (03PS2) 10Fsero: Capture calico deployment in code. [deployment-charts] - 10https://gerrit.wikimedia.org/r/523580 (https://phabricator.wikimedia.org/T227775) [06:49:07] (03PS3) 10Fsero: Capture calico deployment in code. [deployment-charts] - 10https://gerrit.wikimedia.org/r/523580 (https://phabricator.wikimedia.org/T227775) [06:49:49] (03Abandoned) 10Elukey: profile::swap: use the ldap-ro endpoint [puppet] - 10https://gerrit.wikimedia.org/r/521832 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [06:50:12] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10MoritzMuehlenhoff) It's my understanding that this reduces the steps necessary to restart our recursors is now reduced to a simple depool/repool and that the previous, complex approach from... [06:58:10] RECOVERY - puppet last run on db2086 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:52] RECOVERY - puppet last run on conf2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:59] (03PS11) 10Elukey: mcrouter: allow async foreign set/delete WAN cache operations [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [07:02:52] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10jcrespo) FYI, after applying the above change, I expected a huge shift on reported load (even if performance didn't change) or on temperatures, given this (wikireplicas on labs) are our busiest databases... [07:12:27] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17391/" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [07:14:39] (03PS2) 10Elukey: Enable mcrouter async replication to codfw on mw1261 and mw1276 [puppet] - 10https://gerrit.wikimedia.org/r/520726 (https://phabricator.wikimedia.org/T225642) [07:25:33] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17392/" [puppet] - 10https://gerrit.wikimedia.org/r/520726 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [07:27:19] (03CR) 10Effie Mouzeli: [C: 03+1] Enable mcrouter async replication to codfw on mw1261 and mw1276 [puppet] - 10https://gerrit.wikimedia.org/r/520726 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [07:29:28] thanks jijiki :) [07:31:15] :D [07:34:21] (03CR) 10Elukey: [C: 03+2] Enable mcrouter async replication to codfw on mw1261 and mw1276 [puppet] - 10https://gerrit.wikimedia.org/r/520726 (https://phabricator.wikimedia.org/T225642) (owner: 10Elukey) [07:40:29] (03CR) 10Vgutierrez: [C: 03+2] Split langlist helper in two [dns] - 10https://gerrit.wikimedia.org/r/523106 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [07:40:39] (03PS4) 10Vgutierrez: Split langlist helper in two [dns] - 10https://gerrit.wikimedia.org/r/523106 (https://phabricator.wikimedia.org/T133548) [07:42:32] (03PS3) 10Gehel: wdqs: introduced tuned journal options to wdqs2001. [puppet] - 10https://gerrit.wikimedia.org/r/523266 (https://phabricator.wikimedia.org/T228122) [07:43:37] (03CR) 10Vgutierrez: [C: 03+2] Add a ncredir-parking zone [dns] - 10https://gerrit.wikimedia.org/r/523114 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [07:43:49] (03PS2) 10Vgutierrez: Add a ncredir-parking zone [dns] - 10https://gerrit.wikimedia.org/r/523114 (https://phabricator.wikimedia.org/T133548) [07:45:02] !log depool mw1261 to test mcrouter changes [07:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:56] (03PS5) 10Vgutierrez: lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) [08:01:31] (03PS1) 10Elukey: Revert "Enable mcrouter async replication to codfw on mw1261 and mw1276" [puppet] - 10https://gerrit.wikimedia.org/r/523619 [08:01:32] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10elukey) This is happening on mw1261 (currently depooled): ` elukey@mw1261:~$ ech... [08:02:00] (03CR) 10Elukey: [C: 03+2] Revert "Enable mcrouter async replication to codfw on mw1261 and mw1276" [puppet] - 10https://gerrit.wikimedia.org/r/523619 (owner: 10Elukey) [08:07:46] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:07:58] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:08:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 500: Internal Server Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:09:47] (03PS1) 10Muehlenhoff: Add LDPA replicas in codfw to conf-tool/LVS [puppet] - 10https://gerrit.wikimedia.org/r/523624 [08:09:56] looks like ripe atlas api is in trouble [08:10:22] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:11:54] !log uploading coredns_1.5.2 for buster and stretch [08:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:11] !log uploading coredns_1.5.2 for buster and stretch - T226516 [08:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:18] T226516: deploy CoreDNS as a in-cluster DNS service - https://phabricator.wikimedia.org/T226516 [08:13:20] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 4 probes of 480 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:13:26] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 22 probes of 438 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:13:54] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 21 probes of 438 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:14:58] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 6 probes of 480 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:15:07] (03CR) 10Filippo Giunchedi: "Will this bind both v4 and v6 or only v6? Anyways ferm rules will need adjusting too for A + AAAA when using @resolve" [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [08:24:05] (03CR) 10Filippo Giunchedi: "I'm not opposed to change, I'm skeptical though it'll be effective to keep toolforge and ci docker versions separate, the reason I'm sayin" [puppet] - 10https://gerrit.wikimedia.org/r/522434 (https://phabricator.wikimedia.org/T226236) (owner: 10Muehlenhoff) [08:25:18] (03CR) 10Elukey: "> Will this bind both v4 and v6 or only v6? Anyways ferm rules will" [puppet] - 10https://gerrit.wikimedia.org/r/523229 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [08:26:52] (03CR) 10Muehlenhoff: "Our documented procedure currently explicitly documents how to update a single component, so I think that'll work out fine:" [puppet] - 10https://gerrit.wikimedia.org/r/522434 (https://phabricator.wikimedia.org/T226236) (owner: 10Muehlenhoff) [08:28:04] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:28:11] (03PS4) 10Jcrespo: mariadb::ferm_misc: remove firewall rule for servermon [puppet] - 10https://gerrit.wikimedia.org/r/502176 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [08:28:56] 10Operations, 10ops-codfw: mc2023 / mc2025 fail to mount root partition within 90 seconds using Linux 4.9 - https://phabricator.wikimedia.org/T170152 (10MoritzMuehlenhoff) 05Open→03Resolved I think we can close this, the error didn't reoccur with the subsequent reboots and might have just been a race condi... [08:29:42] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:30:20] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:30:48] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:32:15] ACKNOWLEDGEMENT - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out Fsero RIPE Atlas API is having a bad time. Giving a 3h ack - The acknowledgement expires at: 2019-07-17 11:31:39. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:32:15] ACKNOWLEDGEMENT - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out Fsero RIPE Atlas API is having a bad time. Giving a 3h ack - The acknowledgement expires at: 2019-07-17 11:31:39. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:32:15] ACKNOWLEDGEMENT - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out Fsero RIPE Atlas API is having a bad time. Giving a 3h ack - The acknowledgement expires at: 2019-07-17 11:31:39. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:32:15] ACKNOWLEDGEMENT - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out Fsero RIPE Atlas API is having a bad time. Giving a 3h ack - The acknowledgement expires at: 2019-07-17 11:31:39. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:32:23] (03CR) 10Filippo Giunchedi: "LGTM modulo the ordered_yaml comment" [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [08:32:40] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 4 probes of 480 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:34:18] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 4 probes of 480 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:34:56] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 3 probes of 480 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:35:22] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 480 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:36:12] (03PS6) 10Vgutierrez: lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) [08:36:20] (03PS4) 10Jcrespo: mariadb: revoke servermon grants [puppet] - 10https://gerrit.wikimedia.org/r/502172 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [08:37:22] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10jcrespo) What about the puppet database on m1? [08:38:54] (03CR) 10Jcrespo: [C: 03+2] mariadb: revoke servermon grants [puppet] - 10https://gerrit.wikimedia.org/r/502172 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [08:39:13] (03PS1) 10Awight: Enable FileImporter source wiki edit and delete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523661 (https://phabricator.wikimedia.org/T225617) [08:40:08] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10jcrespo) Also the passwords have to be removed from the private repo (and possibly from labs/private). [08:44:41] !log droping servermon accounts from m1 dbs T198939 [08:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:47] T198939: Decommission servermon - https://phabricator.wikimedia.org/T198939 [08:47:44] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10fgiunchedi) ok! sounds good @Cmjohnson ! [08:47:46] (03PS1) 10Fsero: Added a new coredns container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/523662 (https://phabricator.wikimedia.org/T226516) [08:48:17] (03CR) 10Fsero: [V: 03+2 C: 03+2] Added a new coredns container image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/523662 (https://phabricator.wikimedia.org/T226516) (owner: 10Fsero) [08:50:03] !log upload coredns docker image into registry T226516 [08:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:10] T226516: deploy CoreDNS as a in-cluster DNS service - https://phabricator.wikimedia.org/T226516 [08:57:58] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10elukey) All right this is a clear PEBCAK (problem exists between computer and key... [08:59:56] (03PS1) 10Elukey: Revert "Revert "Enable mcrouter async replication to codfw on mw1261 and mw1276"" [puppet] - 10https://gerrit.wikimedia.org/r/523664 [09:00:09] (03PS2) 10Elukey: Revert "Revert "Enable mcrouter async replication to codfw on mw1261 and mw1276"" [puppet] - 10https://gerrit.wikimedia.org/r/523664 [09:05:38] (03PS1) 10Urbanecm: Raise zh_classicalwiki's requirement for autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523665 (https://phabricator.wikimedia.org/T228141) [09:08:53] (03PS1) 10Jbond: puppetmaster: updatenetboot with correct puppetmaster config [puppet] - 10https://gerrit.wikimedia.org/r/523666 [09:10:45] (03PS2) 10Jbond: puppetmaster: update netboot with correct puppetmaster config [puppet] - 10https://gerrit.wikimedia.org/r/523666 [09:12:57] (03CR) 10Jbond: [C: 03+2] puppetmaster: update netboot with correct puppetmaster config [puppet] - 10https://gerrit.wikimedia.org/r/523666 (owner: 10Jbond) [09:15:07] (03CR) 10Elukey: [C: 03+2] Revert "Revert "Enable mcrouter async replication to codfw on mw1261 and mw1276"" [puppet] - 10https://gerrit.wikimedia.org/r/523664 (owner: 10Elukey) [09:15:15] (03PS3) 10Elukey: Revert "Revert "Enable mcrouter async replication to codfw on mw1261 and mw1276"" [puppet] - 10https://gerrit.wikimedia.org/r/523664 [09:20:30] (03PS1) 10Filippo Giunchedi: prometheus: add kafka logging consumer lag alerts [puppet] - 10https://gerrit.wikimedia.org/r/523667 (https://phabricator.wikimedia.org/T228145) [09:22:21] (03PS2) 10Arturo Borrero Gonzalez: toolforge: kubeadm master nodes shouldn't use client certs for etcd [puppet] - 10https://gerrit.wikimedia.org/r/523328 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [09:22:36] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10elukey) In light of what I wrote above: ` elukey@mw1261:~$ echo -e "get /codfw/... [09:23:13] !log pool mw1261 back with mcrouter async replication settings - T225642 [09:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:21] T225642: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 [09:24:29] (03CR) 10Jbond: [C: 03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/523452 (owner: 10CDanis) [09:24:56] !log apply mcrouter async replication settings to mw1276 - T225642 [09:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:00] (03PS3) 10Jbond: monitoring: extract build_notes_url() function [puppet] - 10https://gerrit.wikimedia.org/r/523452 (owner: 10CDanis) [09:26:52] (03CR) 10Jbond: [C: 03+2] monitoring: extract build_notes_url() function [puppet] - 10https://gerrit.wikimedia.org/r/523452 (owner: 10CDanis) [09:27:46] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10elukey) We can check per-hosts metrics with: https://grafana.wikimedia.org/d/000... [09:28:16] (03PS7) 10Jbond: WIP nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [09:29:07] (03CR) 10jerkins-bot: [V: 04-1] WIP nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [09:34:00] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [09:39:28] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:39:40] (03PS2) 10Filippo Giunchedi: prometheus: add kafka logging consumer lag alerts [puppet] - 10https://gerrit.wikimedia.org/r/523667 (https://phabricator.wikimedia.org/T228145) [09:42:39] 10Operations, 10media-storage: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad"" - https://phabricator.wikimedia.org/T226937 (10fgiunchedi) @Urbanecm I'm sorry I won't have time to dig into this anytime soon, although {T228086} might have a clue [09:45:56] (03PS7) 10Vgutierrez: lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) [09:47:07] (03PS8) 10Jbond: WIP nrpe: support dashboard_links in nrpe::check_service [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [09:47:14] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10MoritzMuehlenhoff) I've submitted a proposed update to fix the underlying OpenSSH bug in Debian Stretch... [09:49:33] (03PS8) 10Vgutierrez: lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) [09:50:17] (03CR) 10jerkins-bot: [V: 04-1] lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [09:52:50] (03PS9) 10Vgutierrez: lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) [09:53:30] (03CR) 10Jbond: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [09:54:49] (03PS1) 10Lens0021: Update ext dist settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523668 [09:56:24] (03CR) 10Ema: [C: 03+1] "go earn your tshirt" [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [09:57:58] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227138 (10elukey) The kafka hosts are going to be decommed in T226517, so not a concern. The other hosts can go down without horrible consequences :) [09:59:37] (03CR) 10Vgutierrez: [C: 03+2] lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [09:59:46] (03PS10) 10Vgutierrez: lvs: Add ncredir service to high-traffic1 [puppet] - 10https://gerrit.wikimedia.org/r/522055 (https://phabricator.wikimedia.org/T133548) [09:59:54] (03PS1) 10Filippo Giunchedi: Revert "syslog: add temp rsync to copy data" [puppet] - 10https://gerrit.wikimedia.org/r/523669 (https://phabricator.wikimedia.org/T200706) [09:59:56] (03PS1) 10Filippo Giunchedi: Remove lithium from service [puppet] - 10https://gerrit.wikimedia.org/r/523670 (https://phabricator.wikimedia.org/T200706) [10:00:16] 10Operations, 10ops-eqiad, 10DC-Ops: a3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227139 (10elukey) All the analytics nodes are hadoop workers, not a big deal if they loose power. [10:01:42] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10elukey) [10:02:00] (03PS2) 10Muehlenhoff: Add LDAP replicas in codfw to conf-tool/LVS [puppet] - 10https://gerrit.wikimedia.org/r/523624 [10:02:18] (03CR) 10jerkins-bot: [V: 04-1] Add LDAP replicas in codfw to conf-tool/LVS [puppet] - 10https://gerrit.wikimedia.org/r/523624 (owner: 10Muehlenhoff) [10:02:19] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10elukey) [10:02:53] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10elukey) [10:03:31] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "syslog: add temp rsync to copy data" [puppet] - 10https://gerrit.wikimedia.org/r/523669 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [10:03:43] (03PS2) 10Filippo Giunchedi: Revert "syslog: add temp rsync to copy data" [puppet] - 10https://gerrit.wikimedia.org/r/523669 (https://phabricator.wikimedia.org/T200706) [10:03:59] (03CR) 10Jbond: WIP nrpe: support dashboard_links in nrpe::check_service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [10:04:21] (03Abandoned) 10Fsero: Added defaults of node heap size to match the new one introduced. [deployment-charts] - 10https://gerrit.wikimedia.org/r/483398 (https://phabricator.wikimedia.org/T213414) (owner: 10Fsero) [10:04:24] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=ncredir,service=nginx [10:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:52] 10Operations, 10ops-eqiad, 10DC-Ops: a4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227140 (10elukey) I replaced the Analytics tag for kafka1001 with @herron since the kafka main cluster is now handled by infrastructure foundations. I also added some Analytics tags, and added @akosiaris for conf1... [10:05:42] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10elukey) [10:05:49] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10elukey) A single Hadoop worker node for analytics, all good. [10:08:27] !log restarting pybal on lvs2004 [10:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:08] PROBLEM - PyBal connections to etcd on lvs2001 is CRITICAL: CRITICAL: 4 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [10:09:22] PROBLEM - PyBal connections to etcd on lvs2004 is CRITICAL: CRITICAL: 4 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [10:09:58] ^^ that's expected [10:11:28] !log restarting pybal on lvs1016 [10:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:58] PROBLEM - PyBal connections to etcd on lvs1013 is CRITICAL: CRITICAL: 4 connections established with conf1004.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [10:13:26] PROBLEM - PyBal IPVS diff check on lvs1013 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.154.232:443, 208.80.154.232:80, 2620:0:861:ed1a::9:80, 2620:0:861:ed1a::9:443]) https://wikitech.wikimedia.org/wiki/PyBal [10:14:08] PROBLEM - PyBal IPVS diff check on lvs2001 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:860:ed1a::9:443, 208.80.153.232:443, 2620:0:860:ed1a::9:80, 208.80.153.232:80]) https://wikitech.wikimedia.org/wiki/PyBal [10:14:50] RECOVERY - PyBal connections to etcd on lvs2004 is OK: OK: 8 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [10:15:54] !log restart pybal on lvs2001 [10:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:23] !log restart pybal on lvs1013 [10:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:07] (03CR) 10Effie Mouzeli: [C: 03+1] Switch pool counters for Thumbor in codfw to poolcounter2003 [puppet] - 10https://gerrit.wikimedia.org/r/521283 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [10:18:51] (03CR) 10Filippo Giunchedi: "Note that e.g. PDUs still point to lithium (syslog.eqiad.wmnet), DNS change to follow" [puppet] - 10https://gerrit.wikimedia.org/r/523670 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [10:18:52] RECOVERY - PyBal IPVS diff check on lvs1013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:19:36] RECOVERY - PyBal IPVS diff check on lvs2001 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:20:04] RECOVERY - PyBal connections to etcd on lvs2001 is OK: OK: 8 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [10:21:14] (03CR) 10Vgutierrez: [C: 03+2] Switch wikipedia.com to the ncredir-parking DNS zonefile [dns] - 10https://gerrit.wikimedia.org/r/523115 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:21:20] (03PS2) 10Vgutierrez: Switch wikipedia.com to the ncredir-parking DNS zonefile [dns] - 10https://gerrit.wikimedia.org/r/523115 (https://phabricator.wikimedia.org/T133548) [10:22:49] (03CR) 10Filippo Giunchedi: [C: 03+1] Sync docker packages to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/522434 (https://phabricator.wikimedia.org/T226236) (owner: 10Muehlenhoff) [10:22:50] RECOVERY - PyBal connections to etcd on lvs1013 is OK: OK: 8 connections established with conf1004.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [10:23:24] (03PS3) 10Muehlenhoff: Add LDAP replicas in codfw to conf-tool/LVS [puppet] - 10https://gerrit.wikimedia.org/r/523624 [10:28:19] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:28:20] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:37] !log rebooting ms-fe1005 to pick up kernel with SACK fixed (T228086) [10:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:45] T228086: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 [10:34:48] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10MoritzMuehlenhoff) >>! In T198939#5336152, @jcrespo wrote: > What about the puppet database on m1? The database should all be ephemeral data about past server state, so no need to retain, but adding @akosia... [10:43:18] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [10:46:13] (03PS1) 10Jbond: monitoring::build_notes_url: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/523675 [10:47:13] (03PS1) 10Vgutierrez: ncredir: Enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/523676 (https://phabricator.wikimedia.org/T133548) [10:47:20] (03CR) 10jerkins-bot: [V: 04-1] monitoring::build_notes_url: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/523675 (owner: 10Jbond) [10:49:05] (03PS2) 10Jbond: monitoring::build_notes_url: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/523675 [10:49:19] (03PS1) 10Elukey: Introduce module amd_rocm [puppet] - 10https://gerrit.wikimedia.org/r/523677 (https://phabricator.wikimedia.org/T224723) [10:50:59] (03CR) 10Jbond: WIP nrpe: support dashboard_links in nrpe::check_service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [10:52:25] (03PS2) 10Elukey: Introduce module amd_rocm [puppet] - 10https://gerrit.wikimedia.org/r/523677 (https://phabricator.wikimedia.org/T224723) [10:52:39] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Enable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/523676 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:54:07] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17403/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/523677 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [10:57:48] (03PS1) 10Vgutierrez: cumin: Add ncredir aliases [puppet] - 10https://gerrit.wikimedia.org/r/523680 (https://phabricator.wikimedia.org/T133548) [10:58:10] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team: rack/setup/install cloudmon100[123] - https://phabricator.wikimedia.org/T228102 (10aborrero) We already have some servers in a similar namespace: labmon1001 and labmon1002. I find it confusing that we use a similar naming scheme for 2 diff... [10:58:25] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudmon100[123] - https://phabricator.wikimedia.org/T228102 (10aborrero) [10:58:44] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/523680 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190716T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:09] (03CR) 10Vgutierrez: [C: 03+2] cumin: Add ncredir aliases [puppet] - 10https://gerrit.wikimedia.org/r/523680 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [11:02:18] 10Operations, 10media-storage: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10MoritzMuehlenhoff) The effect is pretty visible for ms-be1005 on https://grafana.wikimedia.org/d/SxmTH3IZk/arzhels-playground?orgId=1&panelId=3&fullscreen&from=now-1h&to=now ; I'll also reboot the other... [11:07:29] (03PS1) 10Vgutierrez: lvs: Fix typo on icinga check command definition for ncredir [puppet] - 10https://gerrit.wikimedia.org/r/523682 (https://phabricator.wikimedia.org/T133548) [11:07:33] *sigh* [11:09:21] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [11:12:16] !log rebooting remaining swift frontends in eqiad to pick up a kernel with SACK fixed (T228086) [11:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:24] T228086: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 [11:12:28] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:12:30] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:06] (03PS4) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) [11:13:47] (03PS1) 10Urbanecm: Enable partial blocks on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523686 (https://phabricator.wikimedia.org/T228150) [11:14:33] (03PS3) 10Arturo Borrero Gonzalez: toolforge: kubeadm master nodes shouldn't use client certs for etcd [puppet] - 10https://gerrit.wikimedia.org/r/523328 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [11:16:49] (03CR) 10Volans: [C: 03+1] "LGTM, AFAIK it shouldn't have any unwanted effect fixing the typo." [puppet] - 10https://gerrit.wikimedia.org/r/523682 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [11:17:10] (03CR) 10Vgutierrez: [C: 03+2] lvs: Fix typo on icinga check command definition for ncredir [puppet] - 10https://gerrit.wikimedia.org/r/523682 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [11:19:43] (03CR) 10Cwhite: "heh, somehow missed that one. Sorry about that!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [11:22:17] (03CR) 10Cwhite: [C: 03+1] Logstash: Use log context for the api-feature-usage channel [puppet] - 10https://gerrit.wikimedia.org/r/493323 (https://phabricator.wikimedia.org/T217162) (owner: 10Anomie) [11:23:14] (03PS7) 10Jbond: hiera backends: update hiera.yaml file to work with puppet 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) [11:28:59] 10Operations, 10media-storage: Swift TCP retransmits increase - https://phabricator.wikimedia.org/T228086 (10MoritzMuehlenhoff) I've also rebooted the remaining frontends, but with some more data it doesn't actually seem as if this is caused by the disabled SACKs, if e.g. one limits the dashboard to "stat1005"... [11:33:12] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudmon100[123] - https://phabricator.wikimedia.org/T228102 (10Bstorm) Great point @aborrero! I almost half wanted to name all of these "cloudstore" and figure it out from there, but that's not great. `cloudst... [11:36:23] I'm cutting the branch for this week's train (which has a blocker so don't know if it will be deployed yet) [11:43:27] (03PS3) 10Muehlenhoff: Sync docker packages to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/522434 (https://phabricator.wikimedia.org/T226236) [11:43:30] (03CR) 10Ema: [C: 03+1] "lgtm, pcc also seems fine https://puppet-compiler.wmflabs.org/compiler1002/17406/" [puppet] - 10https://gerrit.wikimedia.org/r/523624 (owner: 10Muehlenhoff) [11:43:42] (03PS4) 10Arturo Borrero Gonzalez: toolforge: kubeadm master nodes shouldn't use client certs for etcd [puppet] - 10https://gerrit.wikimedia.org/r/523328 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [11:44:15] (03CR) 10Arturo Borrero Gonzalez: "This mas mostly reverted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/523328" [puppet] - 10https://gerrit.wikimedia.org/r/523220 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [11:45:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: kubeadm master nodes shouldn't use client certs for etcd [puppet] - 10https://gerrit.wikimedia.org/r/523328 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [11:46:03] (03CR) 10Jbond: "lgtm, some minor nits" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [11:46:23] (03CR) 10Vgutierrez: [C: 04-1] "you're missing hieradata/role/codfw/openldap/replica.yaml defining lvs::realserver::realserver_ips for codfw" [puppet] - 10https://gerrit.wikimedia.org/r/523624 (owner: 10Muehlenhoff) [11:47:40] (03PS4) 10Muehlenhoff: Sync docker packages to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/522434 (https://phabricator.wikimedia.org/T226236) [11:49:45] (03CR) 10Muehlenhoff: [C: 03+2] Sync docker packages to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/522434 (https://phabricator.wikimedia.org/T226236) (owner: 10Muehlenhoff) [11:54:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/523670 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [11:57:55] !log synched docker-ce, docker-ce-cli, containerd.io to thirdparty/ci for stretch-wikimedia (T226236) [11:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:02] T226236: Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 [11:58:40] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review: Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10MoritzMuehlenhoff) 05Open→03Resolved Packages have been synched to thirdparty/ci for stretch-w... [11:58:44] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10MoritzMuehlenhoff) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190716T1200) [12:08:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, two nits inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523677 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [12:11:12] !log Depool mw1293 and pool back [12:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:44] (03CR) 10Muehlenhoff: "Good catch! Will amend." [puppet] - 10https://gerrit.wikimedia.org/r/523624 (owner: 10Muehlenhoff) [12:20:27] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10akosiaris) >>! In T198939#5336414, @MoritzMuehlenhoff wrote: >>>! In T198939#5336152, @jcrespo wrote: >> What about the puppet database on m1? > > The database should all be ephemeral data about past server... [12:22:04] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::grafana::production: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523129 (owner: 10Muehlenhoff) [12:22:42] (03PS1) 10Vgutierrez: lvs: Fix icinga checks for ncredir and ncredir-https [puppet] - 10https://gerrit.wikimedia.org/r/523700 (https://phabricator.wikimedia.org/T133548) [12:24:58] (03PS4) 10Muehlenhoff: Add LDAP replicas in codfw to conf-tool/LVS [puppet] - 10https://gerrit.wikimedia.org/r/523624 [12:28:05] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10jcrespo) Could also confirm all puppet grants (mysql database is understood, of course) on puppet database are no longer needed? You can find it on the misc production grants. [12:36:35] (03PS1) 10Lars Wirzenius: Group0 to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523701 [12:37:34] (03PS1) 10Jcrespo: mariadb: Remove puppet mysql grants for m1 misc databases [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) [12:39:32] (03PS2) 10Vgutierrez: lvs: Fix icinga checks for ncredir and ncredir-https [puppet] - 10https://gerrit.wikimedia.org/r/523700 (https://phabricator.wikimedia.org/T133548) [12:42:40] !log deleting stale wikidata indices (elastic@eqiad) T227136 [12:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:46] T227136: Reindexing search index wikidatawiki for eqiad fails - https://phabricator.wikimedia.org/T227136 [12:47:46] (03PS1) 10Effie Mouzeli: WIP: profile:service_proxy: Add more hiera vars [puppet] - 10https://gerrit.wikimedia.org/r/523703 [12:48:28] (03CR) 10Elukey: Introduce module amd_rocm (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523677 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [12:48:33] (03CR) 10jerkins-bot: [V: 04-1] WIP: profile:service_proxy: Add more hiera vars [puppet] - 10https://gerrit.wikimedia.org/r/523703 (owner: 10Effie Mouzeli) [12:49:06] (03PS3) 10Elukey: Introduce module amd_rocm [puppet] - 10https://gerrit.wikimedia.org/r/523677 (https://phabricator.wikimedia.org/T224723) [12:49:48] !log liw@deploy1001 Pruned MediaWiki: 1.34.0-wmf.5 (duration: 07m 42s) [12:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:13] !log liw@deploy1001 Pruned MediaWiki: 1.34.0-wmf.4 (duration: 02m 11s) [12:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:29] !log liw@deploy1001 Pruned MediaWiki: 1.34.0-wmf.6 (duration: 02m 04s) [12:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:16] !log liw@deploy1001 Pruned MediaWiki: 1.34.0-wmf.7 (duration: 02m 01s) [12:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:11] !log liw@deploy1001 Pruned MediaWiki: 1.34.0-wmf.8 (duration: 01m 46s) [12:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:24] Krinkle: liw is about to push mediawiki to testwiki :) [13:01:29] though the sync is going to take a while [13:02:11] (03PS1) 10Ema: ATS: add support for atsmtail systemd services [puppet] - 10https://gerrit.wikimedia.org/r/523705 (https://phabricator.wikimedia.org/T227668) [13:03:35] (03PS1) 10Muehlenhoff: Disable mirror/udeb/suite now that Buster is a stable release [puppet] - 10https://gerrit.wikimedia.org/r/523706 [13:03:37] (03PS1) 10Muehlenhoff: Remove trusty d-i config [puppet] - 10https://gerrit.wikimedia.org/r/523707 [13:03:39] (03PS1) 10Muehlenhoff: Add a new buster-test d-i config [puppet] - 10https://gerrit.wikimedia.org/r/523708 [13:03:52] (03PS1) 10Jbond: puppet_compiler: Add checks for missing facts files [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/523709 [13:04:07] hashar_: k, dont forget to pull in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/523695/ first [13:04:10] (03PS3) 10Vgutierrez: lvs: Fix icinga checks for ncredir and ncredir-https [puppet] - 10https://gerrit.wikimedia.org/r/523700 (https://phabricator.wikimedia.org/T133548) [13:04:21] was meant to land in master before the cut, but the cut started while it was in CI [13:04:26] so I cherrypicked immediaely [13:04:28] it landed now [13:04:47] liw: ^^ gotta pull mediawiki/core in /srv/mediawiki-staging/php-1.34.0-wmf.14 :) [13:07:13] (03PS5) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) [13:08:52] (03CR) 10Vgutierrez: [C: 03+2] lvs: Fix icinga checks for ncredir and ncredir-https [puppet] - 10https://gerrit.wikimedia.org/r/523700 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [13:09:02] (03PS4) 10Vgutierrez: lvs: Fix icinga checks for ncredir and ncredir-https [puppet] - 10https://gerrit.wikimedia.org/r/523700 (https://phabricator.wikimedia.org/T133548) [13:12:23] (03PS2) 10Ema: ATS: add support for atsmtail systemd services [puppet] - 10https://gerrit.wikimedia.org/r/523705 (https://phabricator.wikimedia.org/T227668) [13:12:27] Krinkle: rebased :] [13:14:30] !log liw@deploy1001 Started scap: testwiki to php-1.34.0-wmf.14 and rebuild l10n cache [13:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:57] hashar_, ack, thanks [13:15:07] (03CR) 10Ema: "pcc output here https://puppet-compiler.wmflabs.org/compiler1001/17414/" [puppet] - 10https://gerrit.wikimedia.org/r/523705 (https://phabricator.wikimedia.org/T227668) (owner: 10Ema) [13:18:41] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:20:19] !log restarting pybal on lvs2004 and lvs1016 [13:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:50] (03CR) 10Muehlenhoff: [C: 03+1] Introduce module amd_rocm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523677 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [13:22:20] (03PS6) 10Ema: ATS: add atsbackend.mtail [puppet] - 10https://gerrit.wikimedia.org/r/523168 (https://phabricator.wikimedia.org/T227668) [13:22:37] 10Operations, 10observability, 10User-fgiunchedi: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolving, quarter is over, some subtasks still TODO [13:24:44] !log restarting pybal on lvs2001 and lvs1013 [13:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:12] (03PS4) 10Elukey: Introduce module amd_rocm [puppet] - 10https://gerrit.wikimedia.org/r/523677 (https://phabricator.wikimedia.org/T224723) [13:27:46] (03PS2) 10Muehlenhoff: Disable mirror/udeb/suite now that Buster is a stable release [puppet] - 10https://gerrit.wikimedia.org/r/523706 [13:28:05] (03PS3) 10Muehlenhoff: Disable mirror/udeb/suite now that Buster is a stable release [puppet] - 10https://gerrit.wikimedia.org/r/523706 [13:31:46] (03CR) 10Ottomata: [C: 03+1] Introduce module amd_rocm [puppet] - 10https://gerrit.wikimedia.org/r/523677 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [13:31:58] (03CR) 10Muehlenhoff: [C: 03+2] Disable mirror/udeb/suite now that Buster is a stable release [puppet] - 10https://gerrit.wikimedia.org/r/523706 (owner: 10Muehlenhoff) [13:32:13] (03CR) 10Elukey: [C: 03+2] Introduce module amd_rocm [puppet] - 10https://gerrit.wikimedia.org/r/523677 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [13:32:21] (03PS5) 10Elukey: Introduce module amd_rocm [puppet] - 10https://gerrit.wikimedia.org/r/523677 (https://phabricator.wikimedia.org/T224723) [13:33:51] 10Operations, 10Traffic, 10Goal, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Vgutierrez) [13:36:34] 10Operations, 10Traffic, 10Goal, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Vgutierrez) ncredir service has been deployed successfully and it's currently serving live traffic for wikipedia.co... [13:37:27] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 54.55, 22.50, 13.72 https://wikitech.wikimedia.org/wiki/Application_servers [13:37:53] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 48.33, 19.40, 12.53 https://wikitech.wikimedia.org/wiki/Application_servers [13:38:25] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 48.78, 22.77, 13.51 https://wikitech.wikimedia.org/wiki/Application_servers [13:38:39] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 65.27, 27.38, 17.74 https://wikitech.wikimedia.org/wiki/Application_servers [13:38:43] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 55.64, 21.69, 13.54 https://wikitech.wikimedia.org/wiki/Application_servers [13:38:49] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 21.85, 19.70, 13.47 https://wikitech.wikimedia.org/wiki/Application_servers [13:38:50] mmm [13:38:53] we're in the deploy window, but I'm not quite yet ready to deploy to group0 [13:39:01] PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CRITICAL - load average: 68.20, 30.25, 19.26 https://wikitech.wikimedia.org/wiki/Application_servers [13:39:05] PROBLEM - High CPU load on API appserver on mw1280 is CRITICAL: CRITICAL - load average: 65.50, 28.99, 18.66 https://wikitech.wikimedia.org/wiki/Application_servers [13:39:13] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 21.41, 18.67, 12.94 https://wikitech.wikimedia.org/wiki/Application_servers [13:39:31] (03PS2) 10Muehlenhoff: Remove trusty d-i config [puppet] - 10https://gerrit.wikimedia.org/r/523707 [13:39:49] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 18.82, 19.57, 13.16 https://wikitech.wikimedia.org/wiki/Application_servers [13:40:03] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 24.98, 23.86, 17.36 https://wikitech.wikimedia.org/wiki/Application_servers [13:40:03] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) The idea that I have is to re-use what done for the appservers, namely put nginx in front of httpd to terminate TLS. In theory we could: * generate one... [13:40:09] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 26.26, 21.81, 14.39 https://wikitech.wikimedia.org/wiki/Application_servers [13:40:25] RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 27.73, 26.42, 18.87 https://wikitech.wikimedia.org/wiki/Application_servers [13:40:29] RECOVERY - High CPU load on API appserver on mw1280 is OK: OK - load average: 25.95, 25.31, 18.27 https://wikitech.wikimedia.org/wiki/Application_servers [13:41:16] !log liw@deploy1001 Finished scap: testwiki to php-1.34.0-wmf.14 and rebuild l10n cache (duration: 26m 45s) [13:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:25] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) [13:41:39] (03PS2) 10Gehel: HOST doesn't seem to be actually defined anywhere, and should be always the same [puppet] - 10https://gerrit.wikimedia.org/r/523415 (owner: 10Smalyshev) [13:41:48] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) a:03elukey [13:41:57] (03PS3) 10Gehel: wdqs: HOST doesn't seem to be actually defined anywhere, and should be always the same [puppet] - 10https://gerrit.wikimedia.org/r/523415 (owner: 10Smalyshev) [13:42:39] (03CR) 10jerkins-bot: [V: 04-1] wdqs: HOST doesn't seem to be actually defined anywhere, and should be always the same [puppet] - 10https://gerrit.wikimedia.org/r/523415 (owner: 10Smalyshev) [13:42:48] (03Abandoned) 10Gehel: Revert "maps: upgrade to nodejs10" [puppet] - 10https://gerrit.wikimedia.org/r/508841 (owner: 10Gehel) [13:43:08] (03CR) 10Muehlenhoff: [C: 03+2] Remove trusty d-i config [puppet] - 10https://gerrit.wikimedia.org/r/523707 (owner: 10Muehlenhoff) [13:43:56] (03PS4) 10Gehel: wdqs: remove $HOST variable from reloadDCAT script [puppet] - 10https://gerrit.wikimedia.org/r/523415 (owner: 10Smalyshev) [13:44:29] (03CR) 10CDanis: [C: 03+1] profile::grafana::production: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523129 (owner: 10Muehlenhoff) [13:44:43] (03PS5) 10Gehel: wdqs: remove $HOST variable from reloadDCAT script [puppet] - 10https://gerrit.wikimedia.org/r/523415 (owner: 10Smalyshev) [13:48:03] (03PS1) 10Bstorm: toolforge: put the client certs back in for etcd [puppet] - 10https://gerrit.wikimedia.org/r/523716 (https://phabricator.wikimedia.org/T215531) [13:48:44] (03PS2) 10Muehlenhoff: Add a new buster-test d-i config [puppet] - 10https://gerrit.wikimedia.org/r/523708 [13:49:51] (03CR) 10Gehel: [C: 03+2] wdqs: remove $HOST variable from reloadDCAT script [puppet] - 10https://gerrit.wikimedia.org/r/523415 (owner: 10Smalyshev) [13:51:22] (03PS3) 10Muehlenhoff: Add a new buster-test d-i config [puppet] - 10https://gerrit.wikimedia.org/r/523708 [13:53:42] (03CR) 10Muehlenhoff: [C: 03+2] Add a new buster-test d-i config [puppet] - 10https://gerrit.wikimedia.org/r/523708 (owner: 10Muehlenhoff) [13:56:43] (03CR) 10CDanis: prometheus: add kafka logging consumer lag alerts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523667 (https://phabricator.wikimedia.org/T228145) (owner: 10Filippo Giunchedi) [13:57:49] (03CR) 10CDanis: [C: 03+1] "thanks for doing this and for your help with the other changes!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523675 (owner: 10Jbond) [13:58:12] am I correct that beta is not currently getting updated as patches get merged? [13:59:48] (03PS1) 10MSantos: Skipping download if PBF file exists [puppet] - 10https://gerrit.wikimedia.org/r/523718 [14:00:36] (03Abandoned) 10MSantos: Skipping download if PBF file exists [puppet] - 10https://gerrit.wikimedia.org/r/464441 (https://phabricator.wikimedia.org/T205462) (owner: 10MSantos) [14:00:44] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10Ottomata) Hm, all for it! Although, do you think it would be worth exploring the built in TLS support in the services where they support i... [14:01:37] 10Operations, 10observability, 10Goal, 10User-fgiunchedi: TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220104 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Quarter is over, resolving. Migrating statsd off pops work is wrapping up [14:01:40] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) >>! In T227860#5337057, @Ottomata wrote: > Hm, all for it! Although, do you think it would be worth exploring the built in TLS sup... [14:03:04] (03CR) 10Lars Wirzenius: [C: 03+2] Group0 to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523701 (owner: 10Lars Wirzenius) [14:03:12] (03PS2) 10Andrew Bogott: striker: Update package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/523335 (owner: 10BryanDavis) [14:03:50] (03PS1) 10Fsero: helmfile,k8s: Add a coredns deployment for DNS in-cluster service [deployment-charts] - 10https://gerrit.wikimedia.org/r/523722 (https://phabricator.wikimedia.org/T226516) [14:04:17] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523701 (owner: 10Lars Wirzenius) [14:04:45] (03PS2) 10Fsero: helmfile,k8s: Add a coredns deployment for DNS in-cluster service [deployment-charts] - 10https://gerrit.wikimedia.org/r/523722 (https://phabricator.wikimedia.org/T226516) [14:05:06] matthiasmullie: I don't know too much about beta, but I see that the jenkins job beta-code-update-eqiad does seem to be running successfully on commits to master [14:05:59] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) Refactored the puppet code into a separate module called amd_rocm and updated the documentation. We'll need to follow up wi... [14:06:29] PROBLEM - Check systemd state on ms-be1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:06:50] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group0 to php-1.34.0-wmf.14 [14:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:25] !log group0 to 1.34.0-wmf.14 [14:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:12] (03PS5) 10Andrew Bogott: systemd::timer::job: Add optional $max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) (owner: 10BryanDavis) [14:10:47] (03PS1) 10Bstorm: toolforge-etcd: enable client cert checking [puppet] - 10https://gerrit.wikimedia.org/r/523723 (https://phabricator.wikimedia.org/T215531) [14:12:53] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10elukey) [14:13:34] !log Disabling puppet on thumbor*codfw.wmnet - T224572 [14:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:50] T224572: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 [14:13:57] (03PS3) 10Jbond: monitoring::build_notes_url: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/523675 [14:14:48] (03CR) 10Jbond: monitoring::build_notes_url: add spec tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/523675 (owner: 10Jbond) [14:16:50] !log Depool thumbor2001 and pool back - T224572 [14:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:29] 10Operations, 10Traffic: Setup a new PKI software as an alternative to the puppet CA for managing services certificates - https://phabricator.wikimedia.org/T194031 (10Ottomata) I wouldn't call [[ https://github.com/wikimedia/cergen | cergen ]] a proper PKI management software, and probably is too painful to us... [14:17:48] (03CR) 10CDanis: [C: 03+1] monitoring::build_notes_url: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/523675 (owner: 10Jbond) [14:17:49] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:19:24] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10elukey) Analytics side: if possible I'd need some heads up to force a failover for an-master1001. Memcached side: we have 5 mc10XX shards in the same rack, loosing all of them could be a big problem with... [14:19:40] (03CR) 10Filippo Giunchedi: prometheus: add kafka logging consumer lag alerts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/523667 (https://phabricator.wikimedia.org/T228145) (owner: 10Filippo Giunchedi) [14:19:55] (03CR) 10Jbond: [C: 03+2] monitoring::build_notes_url: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/523675 (owner: 10Jbond) [14:20:03] (03PS4) 10Jbond: monitoring::build_notes_url: add spec tests [puppet] - 10https://gerrit.wikimedia.org/r/523675 [14:20:12] (03PS3) 10Filippo Giunchedi: prometheus: add kafka logging consumer lag alerts [puppet] - 10https://gerrit.wikimedia.org/r/523667 (https://phabricator.wikimedia.org/T228145) [14:20:18] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10elukey) [14:21:25] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10elukey) Some heads up could be good for me to gracefully stop daemons on an-coord1001. For kafka-jumbo1003 it is fine if it doesn't risk to loose power together with other kafka-jumbo nodes (2 down are to... [14:22:08] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10elukey) [14:26:17] (03Abandoned) 10Ottomata: profile::eventschemas::service - allow server_alias to be configured via hiera [puppet] - 10https://gerrit.wikimedia.org/r/514372 (owner: 10Ottomata) [14:26:36] (03CR) 10Effie Mouzeli: [C: 03+2] Switch pool counters for Thumbor in codfw to poolcounter2003 [puppet] - 10https://gerrit.wikimedia.org/r/521283 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [14:26:40] (03PS9) 10CDanis: nrpe::check_service: add dashboard_links; add one for disk_space [puppet] - 10https://gerrit.wikimedia.org/r/523248 [14:26:48] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10elukey) Adding some info about racking: https://netbox.wikimedia.org/search/?q=mc10&obj_type= *... [14:26:51] (03PS2) 10Effie Mouzeli: Switch pool counters for Thumbor in codfw to poolcounter2003 [puppet] - 10https://gerrit.wikimedia.org/r/521283 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [14:28:18] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: June 30) rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10wiki_willy) a:05Cmjohnson→03RobH [14:29:17] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need By: June 30) rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10wiki_willy) @RobH - looks like Chris finished the racking part of this install. Can you finish up the rest of the install for these 5 Kafka hosts? Thanks, W... [14:29:43] RECOVERY - Check systemd state on ms-be1032 is OK: OK - running: The system is fully operational [14:29:59] !log Enable puppet and rolling restart thumbor* in codfw - T224572 [14:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:07] T224572: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 [14:32:44] (03CR) 10CDanis: [C: 03+2] "Diffs look reasonable: https://puppet-compiler.wmflabs.org/compiler1001/17416/" [puppet] - 10https://gerrit.wikimedia.org/r/523248 (owner: 10CDanis) [14:32:58] (03PS10) 10CDanis: nrpe::check_service: add dashboard_links; add one for disk_space [puppet] - 10https://gerrit.wikimedia.org/r/523248 [14:33:08] (03CR) 10Jbond: [C: 03+2] hiera backends: update hiera.yaml file to work with puppet 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) (owner: 10Jbond) [14:33:18] (03PS8) 10Jbond: hiera backends: update hiera.yaml file to work with puppet 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) [14:34:06] (03CR) 10Andrew Bogott: [C: 03+2] systemd::timer::job: Add optional $max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) (owner: 10BryanDavis) [14:34:13] (03PS6) 10Andrew Bogott: systemd::timer::job: Add optional $max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) (owner: 10BryanDavis) [14:34:56] (03PS9) 10Jbond: hiera backends: update hiera.yaml file to work with puppet 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) [14:36:14] (03CR) 10jerkins-bot: [V: 04-1] hiera backends: update hiera.yaml file to work with puppet 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) (owner: 10Jbond) [14:37:05] (03PS1) 10Bstorm: toolforge: Switch up using etcd client certs in k8s a little [puppet] - 10https://gerrit.wikimedia.org/r/523726 (https://phabricator.wikimedia.org/T215531) [14:38:04] (03CR) 10Jbond: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) (owner: 10Jbond) [14:38:42] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) @wiki_willy @RobH hi! Don't mean to jump the queue, but I am wondering if this task and its codfw one could be prioritized over the next week... [14:40:51] !log disable puppet accross the fleat to make a change to the hiera [14:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:13] (03CR) 10CDanis: [C: 03+1] prometheus: add kafka logging consumer lag alerts [puppet] - 10https://gerrit.wikimedia.org/r/523667 (https://phabricator.wikimedia.org/T228145) (owner: 10Filippo Giunchedi) [14:41:45] (03PS7) 10Andrew Bogott: systemd::timer::job: Add optional $max_runtime_seconds [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) (owner: 10BryanDavis) [14:43:11] jbond42: you might want to syncup with jijiki (see previous SAL) ;) [14:43:14] !log jbond@cumin1001 conftool action : set/pooled=no; selector: name=dns1001.wikimedia.org [14:43:15] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:43:15] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:10] (03PS1) 10Muehlenhoff: Switch Thumbor pool counters in eqiad to poolcounter1004 [puppet] - 10https://gerrit.wikimedia.org/r/523728 (https://phabricator.wikimedia.org/T224572) [14:44:23] ahh yes i will do, i basicly pushed and then realised this would cause a rrestart of the puppet m,aster which is why there is a lack of notice [14:48:59] volans: :D [14:49:07] (03PS2) 10Muehlenhoff: profile::grafana::production: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523129 [14:50:38] !log jbond@cumin1001 conftool action : set/pooled=yes; selector: name=dns1001.wikimedia.org [14:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:10] !log enable puppet accross the fleat [14:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:47] 10Operations, 10ops-eqiad, 10Analytics: Broken disk on analytics1072 - https://phabricator.wikimedia.org/T226467 (10Cmjohnson) I received the disk on-site but I cannot tell which disk is failed, they all have green LEDs. @elukey could you please let me know which disk slot or let's coordinate to make the di... [14:52:49] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10Cmjohnson) @godog I did get the new disk but since it's not failed...I am not sure which disk is actually bad on my end. Do you know which slot the disk is in... [14:52:56] !log will restart redis on oresdb at 16:00 UTC - T228045 [14:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:10] !log stop eventlogging mysql consumers on eventlog1002 and eventlogging_sync on db1108 to allow db1107 maintenance [14:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:18] akosiaris halfak fyi this starts in ~5 mins [14:53:48] !log stop mariadb on db1107 to allow maintenance [14:53:49] Thanks, jbond42. [14:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:33] (03CR) 10Muehlenhoff: [C: 03+2] profile::grafana::production: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523129 (owner: 10Muehlenhoff) [14:57:18] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10wiki_willy) @elukey @RobH - I've marked it as accelerate on the procurement doc. Rob, can you work on getting these two servers included on this pro... [14:57:49] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [14:58:13] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [14:58:55] (03CR) 10jenkins-bot: GrowthExperiments: Remove reference to non-existent feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523233 (owner: 10Kosta Harlan) [14:59:50] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523701 (owner: 10Lars Wirzenius) [14:59:55] (03CR) 10jenkins-bot: Use wmgEnableJsonConfigDataMode instead of wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522531 (owner: 10Jforrester) [15:00:28] halfak, akosiaris i have finished [15:01:22] (03CR) 10jenkins-bot: Stop setting wgNonincludableNamespaces to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522535 (owner: 10Jforrester) [15:01:29] (03CR) 10jenkins-bot: tests/Defines.php: Re-synchronise from MW core master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523308 (owner: 10Jforrester) [15:01:32] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Cmjohnson) one last paste of the idrac log Record: 84 Date/Time: 04/29/2019 09:35:48 Source: system Severity: Non... [15:01:46] jbond42, great. I don't see any evidence of downtime. [15:01:51] On ORES that is [15:02:04] thats good :) [15:02:08] Oh wait. I think we returned a single 500 :) [15:02:18] Which is about as minor of a blip as we might expect. [15:02:18] still not bad :D [15:02:21] (03CR) 10jenkins-bot: Drop wmgEnableTabularData and wmgEnableMapData, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522532 (owner: 10Jforrester) [15:02:51] It could even be that that 500 was unrelated. [15:03:01] ACKNOWLEDGEMENT - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Jcrespo not in production, but pointing master maintenance https://wikitech.wikimedia.org/wiki/HAProxy [15:03:01] ACKNOWLEDGEMENT - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Jcrespo not in production, but pointing master maintenance https://wikitech.wikimedia.org/wiki/HAProxy [15:04:43] (03PS2) 10Arturo Borrero Gonzalez: toolforge: put the client certs back in for etcd [puppet] - 10https://gerrit.wikimedia.org/r/523716 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:05:33] halfak: cool, i have but fairly acurate times of the restarts in the ticket incase any correlation needs to be done at later date. cheers [15:06:21] PROBLEM - Host db1107.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:06:45] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Cmjohnson) Swapped DIMM A3 with DIMM B3, now we have to powrer the server back on and let it go for a few days to see if the error... [15:07:21] elukey: ^^ forgot down time? [15:07:55] jbond42: nope I am pretty sure I did downtime [15:09:04] elukey: the mgmt interface as well? :) [15:09:11] ah no :D [15:09:22] didn't notice it was the mgmt [15:09:41] i didn;t at first my self :) [15:10:01] (03PS1) 10Muehlenhoff: profile::mediawiki::deployment::server: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523735 [15:11:55] RECOVERY - Host db1107.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [15:12:44] !log start mariadb on db1107 and re-enable mysql consumers on eventlog1002 and replication on db1108 [15:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:09] (03PS2) 10Arturo Borrero Gonzalez: toolforge: Switch up using etcd client certs in k8s a little [puppet] - 10https://gerrit.wikimedia.org/r/523726 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:16:13] (03PS1) 10Muehlenhoff: noc: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523738 [15:16:34] (03PS1) 10Jbond: varnishmtail: use -logs /dev/stdin instead of -logfds 0 [puppet] - 10https://gerrit.wikimedia.org/r/523739 (https://phabricator.wikimedia.org/T225604) [15:20:42] (03CR) 10Ayounsi: [C: 03+1] noc: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/523738 (owner: 10Muehlenhoff) [15:22:11] (03PS1) 10Jforrester: [Beta] Fix references to undefined variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523744 [15:22:43] PROBLEM - Host mw1239.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:23:03] liw: OK if I do a Beta-Cluster-only config push? [15:25:28] (03CR) 10Jforrester: [C: 03+2] [Beta] Fix references to undefined variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523744 (owner: 10Jforrester) [15:26:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Cmjohnson) Last log paste before clearing the log Record: 4 Date/Time: 11/08/2018 00:18:01 Source: system Severity: Non-Critical Description: Correctable memory error rate... [15:26:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Cmjohnson) I swapped all the DIMM from side A to side B cleared the log and powered back up. Please put the server back in service and let's see if the reseating worked. [15:27:09] (03Merged) 10jenkins-bot: [Beta] Fix references to undefined variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523744 (owner: 10Jforrester) [15:27:13] (03CR) 10jenkins-bot: [Beta] Fix references to undefined variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/523744 (owner: 10Jforrester) [15:27:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Cmjohnson) 05Open→03Resolved I am resolving this ticket, please re-open and ping me if the problem returns. [15:27:30] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. - https://phabricator.wikimedia.org/T217359 (10RobH) [15:27:57] RECOVERY - Host mw1239.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [15:29:39] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10fgiunchedi) @Cmjohnson should be blinking now ` root@ms-be1043:~# ls -la /dev/disk/by-path/ | grep -i sdk$ lrwxrwxrwx 1 root root 9 May 30 11:06 pci-0000:02:... [15:30:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Switch up using etcd client certs in k8s a little [puppet] - 10https://gerrit.wikimedia.org/r/523726 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:32:03] James_F, I don't know [15:33:33] (03PS3) 10Andrew Bogott: openstack: resume VM state on host reboot [puppet] - 10https://gerrit.wikimedia.org/r/522548 (https://phabricator.wikimedia.org/T216040) (owner: 10Jhedden) [15:34:19] (03CR) 10Andrew Bogott: [C: 03+2] openstack: resume VM state on host reboot [puppet] - 10https://gerrit.wikimedia.org/r/522548 (https://phabricator.wikimedia.org/T216040) (owner: 10Jhedden) [15:35:50] (03PS1) 10Bstorm: toolforge-etcd: tell etcd to check client certs [puppet] - 10https://gerrit.wikimedia.org/r/523746 (https://phabricator.wikimedia.org/T215531) [15:36:12] (03Abandoned) 10Bstorm: toolforge-etcd: enable client cert checking [puppet] - 10https://gerrit.wikimedia.org/r/523723 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:36:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "I’m not from the termbox team, but I can at least verify that the image exists, so let’s go ahead :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/523124 (owner: 10Tarrow) [15:37:00] !log reboot analytics1072 as attempt to force the raid controller to set a drive failed - T226467 [15:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:08] T226467: Broken disk on analytics1072 - https://phabricator.wikimedia.org/T226467 [15:38:26] ^ mw1239.mgmt is scheduled, I forgot to downtime mgmt [15:43:19] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [15:43:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [15:44:11] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10Andrew) [15:46:47] (03CR) 10Tarrow: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/523124 (owner: 10Tarrow) [15:47:02] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10kzimmerman) @RStallman-legalteam does the NDA HR sent you meet the needs here? Thanks! [15:47:45] (03CR) 10Tarrow: [V: 03+2] "Looks like it also needs to be manually marked verified" [deployment-charts] - 10https://gerrit.wikimedia.org/r/523124 (owner: 10Tarrow) [15:48:10] (03PS1) 10Jbond: hiera: use hiera v1 backend in labs [puppet] - 10https://gerrit.wikimedia.org/r/523752 (https://phabricator.wikimedia.org/T228174) [15:48:14] 10Operations, 10ops-eqiad, 10Analytics: Broken disk on analytics1072 - https://phabricator.wikimedia.org/T226467 (10elukey) Seems to have worked: ` elukey@analytics1072:~$ sudo megacli -PDList -aALL | grep "Firmware state" Firmware state: Unconfigured(good), Spun Up Firmware state: Online, Spun Up Firmware... [15:49:23] (03CR) 10Jbond: [C: 03+2] hiera: use hiera v1 backend in labs [puppet] - 10https://gerrit.wikimedia.org/r/523752 (https://phabricator.wikimedia.org/T228174) (owner: 10Jbond) [15:49:38] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [15:49:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [15:52:00] RECOVERY - MegaRAID on analytics1072 is OK: OK: optimal, 12 logical, 13 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:52:21] liw: No worries, I pushed it. [15:52:23] (03PS1) 10RobH: setting kafka-main100[1-5] prod dns [dns] - 10https://gerrit.wikimedia.org/r/523753 (https://phabricator.wikimedia.org/T226274) [15:52:45] (03CR) 10jerkins-bot: [V: 04-1] setting kafka-main100[1-5] prod dns [dns] - 10https://gerrit.wikimedia.org/r/523753 (https://phabricator.wikimedia.org/T226274) (owner: 10RobH) [15:54:19] :) [15:57:54] !log tarrow@ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' . [15:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:44] (03Abandoned) 10Bstorm: toolforge: put the client certs back in for etcd [puppet] - 10https://gerrit.wikimedia.org/r/523716 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:58:48] (03PS2) 10RobH: setting kafka-main100[1-5] prod dns [dns] - 10https://gerrit.wikimedia.org/r/523753 (https://phabricator.wikimedia.org/T226274) [15:59:03] (03PS2) 10Bstorm: toolforge-etcd: tell etcd to check client certs [puppet] - 10https://gerrit.wikimedia.org/r/523746 (https://phabricator.wikimedia.org/T215531) [15:59:15] (03CR) 10jerkins-bot: [V: 04-1] setting kafka-main100[1-5] prod dns [dns] - 10https://gerrit.wikimedia.org/r/523753 (https://phabricator.wikimedia.org/T226274) (owner: 10RobH) [15:59:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, please merge." [puppet] - 10https://gerrit.wikimedia.org/r/523746 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:59:49] (03PS1) 10Jbond: hiera backend: version 1 cant handle the throw [puppet] - 10https://gerrit.wikimedia.org/r/523756 [15:59:54] (03PS3) 10RobH: setting kafka-main100[1-5] prod dns [dns] - 10https://gerrit.wikimedia.org/r/523753 (https://phabricator.wikimedia.org/T226274) [16:00:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "For the record, tested with:" [puppet] - 10https://gerrit.wikimedia.org/r/523746 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [16:00:04] godog and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190716T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:30] (03CR) 10Bstorm: [C: 03+2] toolforge-etcd: tell etcd to check client certs [puppet] - 10https://gerrit.wikimedia.org/r/523746 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [16:00:38] (03CR) 10jerkins-bot: [V: 04-1] hiera backend: version 1 cant handle the throw [puppet] - 10https://gerrit.wikimedia.org/r/523756 (owner: 10Jbond) [16:01:10] (03CR) 10RobH: [C: 03+2] setting kafka-main100[1-5] prod dns [dns] - 10https://gerrit.wikimedia.org/r/523753 (https://phabricator.wikimedia.org/T226274) (owner: 10RobH) [16:02:03] (03CR) 10Effie Mouzeli: [C: 04-1] "I think it would be better if we would make connect_timeout configurable via hiera, than changing the default" [puppet] - 10https://gerrit.wikimedia.org/r/523194 (https://phabricator.wikimedia.org/T228063) (owner: 10EBernhardson) [16:05:46] 10Operations, 10ops-eqiad: (Need By: June 30) rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10RobH) p:05Triage→03Normal [16:12:48] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, and 3 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Jdforrester-WMF) [16:16:38] (03PS1) 10RobH: kafka-main100[1-5].eqiad.wmnet install parameters [puppet] - 10https://gerrit.wikimedia.org/r/523757 (https://phabricator.wikimedia.org/T226274) [16:17:23] (03CR) 10RobH: [C: 03+2] kafka-main100[1-5].eqiad.wmnet install parameters [puppet] - 10https://gerrit.wikimedia.org/r/523757 (https://phabricator.wikimedia.org/T226274) (owner: 10RobH) [16:18:25] (03PS2) 10Jbond: hiera backend: version 1 cant handle the throw [puppet] - 10https://gerrit.wikimedia.org/r/523756 [16:18:54] 10Operations, 10Release Pipeline, 10serviceops, 10Core Platform Team (RESTBase Split (CDP2)), and 4 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10Jdforrester-WMF) [16:18:58] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, and 3 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Jdforrester-WMF) [16:23:49] (03PS2) 10Jforrester: Add mediawiki development chart. [deployment-charts] - 10https://gerrit.wikimedia.org/r/522584 (owner: 10Jeena Huneidi) [16:28:07] !log reindexing wikidata (elastic@eqiad) T227136 [16:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:15] T227136: Reindexing search index wikidatawiki for eqiad fails - https://phabricator.wikimedia.org/T227136 [16:30:18] (03PS3) 10Jbond: hiera backend: version 1 cant handle the throw [puppet] - 10https://gerrit.wikimedia.org/r/523756 [16:31:40] (03CR) 10Jbond: [C: 03+2] hiera backend: version 1 cant handle the throw [puppet] - 10https://gerrit.wikimedia.org/r/523756 (owner: 10Jbond) [16:35:46] !log jiji@deploy1001 Started deploy [cpjobqueue/deploy@5d8128e]: Migrating videoscaling jobs to PHP7 - T219150 [16:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:53] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [16:36:37] !log jiji@deploy1001 Finished deploy [cpjobqueue/deploy@5d8128e]: Migrating videoscaling jobs to PHP7 - T219150 (duration: 00m 50s) [16:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:42] 10Operations, 10ops-eqiad: (Need By: June 30) rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10RobH) [16:38:54] 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) [16:39:31] (03PS1) 10Ema: 0.3: implement fifo-log-tailer in go [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/523768 (https://phabricator.wikimedia.org/T227668) [16:39:36] Hey! I'm not really sure who to ask but I'm getting "filesystem layer verification failed for digest" for all the images I can think of pulling from docker-registry.wikimedia.org [16:41:45] fsero: ^ ? [16:41:58] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [16:43:02] (03PS2) 10Ema: 0.3: implement fifo-log-tailer in go [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/523768 (https://phabricator.wikimedia.org/T227668) [16:43:26] tarrow: can share the images please? [16:44:00] fsero: sure for example: `docker pull docker-registry.wikimedia.org/releng/quibble-jessie` [16:44:15] I noticed a lot of : "straight insufficient bytes" in logstash [16:44:16] Lemme check give me some mins [16:45:09] That is what I know worked for me before locally. I actually noticed the problem because I'm getting an ImagePullBackoff on my k8s staging attempt [16:45:26] which was for `docker-registry.wikimedia.org/wikimedia/wikibase-termbox:2019-07-12-144625-production` [16:47:00] Your k8s staging is the staging cluster or a local one? [16:47:08] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review, and 2 others: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200/Loading failed for the