[00:12:54] 10Operations, 10Traffic, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Krinkle) Not sure if this is merely a display issue, but I see fairly odd buckets on the dashboard: * 0ms - 438... [00:38:27] (03PS1) 10Papaul: Add thanos-bw200[1234] MAC address and to role insetup [puppet] - 10https://gerrit.wikimedia.org/r/596529 (https://phabricator.wikimedia.org/T251634) [00:41:49] (03PS2) 10Papaul: Add thanos-bw200[1234] MAC address and to role insetup [puppet] - 10https://gerrit.wikimedia.org/r/596529 (https://phabricator.wikimedia.org/T251634) [00:42:04] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:43:46] (03CR) 10Papaul: [C: 03+2] Add thanos-bw200[1234] MAC address and to role insetup [puppet] - 10https://gerrit.wikimedia.org/r/596529 (https://phabricator.wikimedia.org/T251634) (owner: 10Papaul) [00:59:51] 10Operations, 10Android-app-Bugs, 10Traffic, 10Wikipedia-Android-App-Backlog, and 4 others: When fetching siteinfo from the MediaWiki API, set the `maxage` and `smaxage` parameters. - https://phabricator.wikimedia.org/T245033 (10JMinor) 05Open→03Resolved [01:01:25] 10Operations, 10Traffic, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10dpifke) That's because I forgot to change query format to "heatmap" in the panel settings. :) Fixed. [01:03:22] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:05:52] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [01:07:02] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:09:50] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [01:21:02] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:22:12] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:57:03] (03CR) 10Andrew Bogott: [C: 03+2] cloud: Whitelist testlabs-dns-manager for access from cloud subnets [puppet] - 10https://gerrit.wikimedia.org/r/596528 (https://phabricator.wikimedia.org/T252732) (owner: 10Alex Monk) [02:43:38] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:49:18] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:53:54] PROBLEM - Freshness of OCSP Stapling files -ATS-TLS- on cp5006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.106: Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [02:54:06] PROBLEM - traffic-pool service on cp5006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.106: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:54:06] PROBLEM - Logs skipped by trafficserver-tls on cp5006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.106: Connection reset by peer https://wikitech.wikimedia.org/wiki/ATS [02:54:06] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp5006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.106: Connection reset by peer https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [02:54:06] PROBLEM - Ensure traffic_manager is running for instance backend on cp5006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.106: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:54:06] PROBLEM - Ensure traffic_manager is running for instance tls on cp5006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.106: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:54:07] PROBLEM - Confd vcl based reload on cp5006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.106: Connection reset by peer https://wikitech.wikimedia.org/wiki/Varnish [02:54:07] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp5006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.106: Connection reset by peer https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:54:08] PROBLEM - Default ATS Lua configuration file on cp5006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.132.0.106: Connection reset by peer https://wikitech.wikimedia.org/wiki/ATS [02:54:34] PROBLEM - TLS Lua configuration file on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ATS [02:54:34] PROBLEM - purged service on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:54:38] PROBLEM - Check systemd state on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:54:38] PROBLEM - Ensure traffic_server is running for instance tls on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:54:42] PROBLEM - check_trafficserver_backend_config_status on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:54:42] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:54:58] PROBLEM - Webrequests Varnishkafka log producer on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [02:55:12] PROBLEM - confd service on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:55:26] PROBLEM - Ensure traffic_server is running for instance backend on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:55:26] PROBLEM - Freshness of OCSP Stapling files -ATS-TLS acme-chief- on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [02:55:42] PROBLEM - dhclient process on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [02:55:54] PROBLEM - Logs skipped by trafficserver on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ATS [02:57:32] PROBLEM - puppet last run on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:57:44] PROBLEM - check_trafficserver_log_fifo_analytics_tls on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:58:56] PROBLEM - MD RAID on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:59:12] PROBLEM - check_trafficserver_log_fifo_notpurge_backend on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:01:22] PROBLEM - Check the NTP synchronisation status of timesyncd on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [03:01:28] PROBLEM - Check the last execution of trafficserver_tls_stek_job on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:02:30] PROBLEM - DPKG on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [03:04:37] 10Operations, 10Traffic: Maxmind data update issues for DNS (and others?) - https://phabricator.wikimedia.org/T252577 (10wkandek) Just FYI: my machine is being served from `esams` again. [03:10:14] PROBLEM - Disk space on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5006&var-datasource=eqsin+prometheus/ops [03:14:42] PROBLEM - configured eth on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [03:21:04] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:21:04] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:33:46] PROBLEM - IPMI Sensor Status on cp5006 is CRITICAL: connect to address 10.132.0.106 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [03:36:34] (03PS1) 10Tim Starling: Explicitly set SwiftFileBackend timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596538 (https://phabricator.wikimedia.org/T245170) [03:39:16] (03CR) 10Tim Starling: "Needs +1 with approval for self-merge and deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596538 (https://phabricator.wikimedia.org/T245170) (owner: 10Tim Starling) [04:28:45] !log depool and reboot cp5006 [04:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:36] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [04:32:44] RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 231.28 ms [04:32:48] RECOVERY - Freshness of OCSP Stapling files -ATS-TLS- on cp5006 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [04:32:56] RECOVERY - Logs skipped by trafficserver on cp5006 is OK: OK: no matches found in journal for unit trafficserver https://wikitech.wikimedia.org/wiki/ATS [04:33:02] RECOVERY - traffic-pool service on cp5006 is OK: OK - traffic-pool is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:33:04] RECOVERY - Default ATS Lua configuration file on cp5006 is OK: OK https://wikitech.wikimedia.org/wiki/ATS [04:33:04] RECOVERY - Ensure traffic_manager is running for instance tls on cp5006 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:33:06] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp5006 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [04:33:06] RECOVERY - Ensure traffic_manager is running for instance backend on cp5006 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:33:06] RECOVERY - Logs skipped by trafficserver-tls on cp5006 is OK: OK: no matches found in journal for unit trafficserver-tls https://wikitech.wikimedia.org/wiki/ATS [04:33:06] RECOVERY - Confd vcl based reload on cp5006 is OK: reload-vcl has not been executed yet. https://wikitech.wikimedia.org/wiki/Varnish [04:33:32] RECOVERY - TLS Lua configuration file on cp5006 is OK: OK https://wikitech.wikimedia.org/wiki/ATS [04:33:36] RECOVERY - Ensure traffic_server is running for instance tls on cp5006 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:33:38] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp5006 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:33:40] RECOVERY - check_trafficserver_backend_config_status on cp5006 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:33:44] RECOVERY - Disk space on cp5006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp5006&var-datasource=eqsin+prometheus/ops [04:34:00] RECOVERY - Check the NTP synchronisation status of timesyncd on cp5006 is OK: OK: synced at Fri 2020-05-15 04:33:58 UTC. https://wikitech.wikimedia.org/wiki/NTP [04:34:08] RECOVERY - confd service on cp5006 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:34:22] RECOVERY - Freshness of OCSP Stapling files -ATS-TLS acme-chief- on cp5006 is OK: OK https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [04:34:22] RECOVERY - Ensure traffic_server is running for instance backend on cp5006 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:35:06] RECOVERY - DPKG on cp5006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [04:35:32] RECOVERY - IPMI Sensor Status on cp5006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [04:36:34] RECOVERY - MD RAID on cp5006 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:37:08] RECOVERY - puppet last run on cp5006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:37:20] PROBLEM - Check systemd state on cp5006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:14] RECOVERY - Check the last execution of trafficserver_tls_stek_job on cp5006 is OK: OK: Status of the systemd unit trafficserver_tls_stek_job https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:42:23] !log repool cp5006 [04:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:12] RECOVERY - check_trafficserver_log_fifo_analytics_tls on cp5006 is OK: OK: read 8 bytes as expected https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:43:16] 10Operations, 10MediaWiki-General: Segmentation fault creating thumbnail - https://phabricator.wikimedia.org/T159242 (10AntiCompositeNumber) 05Open→03Declined That's not a bug in the SVG rendering though, that's a problem with the SVG code itself. The SVG is being generated correctly on the modern thumbnai... [04:44:52] RECOVERY - check_trafficserver_log_fifo_notpurge_backend on cp5006 is OK: OK: read 8 bytes as expected https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [04:47:16] RECOVERY - configured eth on cp5006 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [04:48:12] PROBLEM - HP RAID on ms-be2016 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:48:14] ACKNOWLEDGEMENT - HP RAID on ms-be2016 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T252851 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Run [04:48:14] aid_Information_Gathering [04:50:28] 10Operations, 10ops-codfw: Degraded RAID on ms-be2016 - https://phabricator.wikimedia.org/T252851 (10ops-monitoring-bot) [04:52:50] !log volker-e@deploy1001 Started deploy [design/style-guide@dc956a3]: Deploy design/style-guide: [04:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:00] !log volker-e@deploy1001 Finished deploy [design/style-guide@dc956a3]: Deploy design/style-guide: (duration: 00m 10s) [04:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:12] RECOVERY - dhclient process on cp5006 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [05:34:06] 10Operations, 10Wikimedia-Mailing-lists, 10Malayalam-Sites: Wikiml-l mail archives are empty after August 2019 (moderation enabled but nobody moderates, hence no emails get delivered) - https://phabricator.wikimedia.org/T251554 (10Praveenp) Please remove inactive admins and appoint [[ https://ml.wikipedia.or... [05:35:44] !log stop replication on pc2009, pc2010 for benchmarking T252761 [05:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:48] T252761: Degraded performance on parsercache with buster/mariadb upgrade - https://phabricator.wikimedia.org/T252761 [06:24:14] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:50:20] 10Operations, 10Traffic, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10Gilles) 05Open→03Resolved I think that I've fixed the display further, the format of the heatmap needed to... [06:50:24] 10Operations, 10Traffic, 10Performance-Team (Radar): Depooling single text caching server in esams had a disproportionate performance impact - https://phabricator.wikimedia.org/T238085 (10Gilles) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200515T0700) [07:09:17] 10Operations, 10Wikimedia-Mailing-lists, 10Malayalam-Sites: Wikiml-l mail archives are empty after August 2019 (moderation enabled but nobody moderates, hence no emails get delivered) - https://phabricator.wikimedia.org/T251554 (10Adithyak1997) > Let us know the new email addresses to add as admins My email... [07:14:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Bump rt-test clients from 12 to 20 [puppet] - 10https://gerrit.wikimedia.org/r/596496 (owner: 10Subramanya Sastry) [07:15:29] (03PS3) 10Ema: 5.1.3-1wm15: add 0037-force-discard.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/596237 (https://phabricator.wikimedia.org/T236754) [07:18:52] 10Operations, 10Traffic, 10Performance-Team (Radar), 10Sustainability (Incident Prevention): Edge cache response time per server should be monitored - https://phabricator.wikimedia.org/T238086 (10ema) >>! In T238086#6139615, @Gilles wrote: > @ema @Vgutierrez you can now use [[ https://grafana.wikimedia.org... [07:28:48] (03PS1) 10Ayounsi: BGP: standardize fixed part of IX4/IX6 groups [homer/public] - 10https://gerrit.wikimedia.org/r/596597 [07:31:26] (03PS2) 10Ayounsi: BGP: standardize fixed part of IX4/IX6 groups [homer/public] - 10https://gerrit.wikimedia.org/r/596597 [07:31:34] (03CR) 10jerkins-bot: [V: 04-1] 5.1.3-1wm15: add 0037-force-discard.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/596237 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [07:32:57] (03PS3) 10Ayounsi: BGP: standardize fixed part of IX4/IX6 groups [homer/public] - 10https://gerrit.wikimedia.org/r/596597 [07:36:11] !log bumps prefix limit for AS16735 in eqiad [07:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:46] RECOVERY - Check systemd state on cp5006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:44:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] maintenance: Migrate initsitestats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/593772 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [07:44:58] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, change itself LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596517 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [07:45:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] maintenance: Migrate startupregistrystats to periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/593774 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [07:49:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "While this change seems correct, I think the code would be more readable if, nowadays, we just used a loop over all db sections and define" [puppet] - 10https://gerrit.wikimedia.org/r/593797 (https://phabricator.wikimedia.org/T211250) (owner: 10RLazarus) [07:50:24] (03PS1) 10Elukey: Introduce java::java_8 [puppet] - 10https://gerrit.wikimedia.org/r/596601 [07:51:06] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (7) node(s) change every puppet run: cloudservices2003-dev.wikimedia.org, logstash1010.eqiad.wmnet, logstash1011.eqiad.wmnet, logstash2003.codfw.wmnet, logstash2001.codfw.wmnet, logstash2002.codfw.wmnet, logstash1012.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [07:51:29] (03CR) 10jerkins-bot: [V: 04-1] Introduce java::java_8 [puppet] - 10https://gerrit.wikimedia.org/r/596601 (owner: 10Elukey) [07:52:40] yes yes you are right [07:53:03] (03PS2) 10Elukey: Introduce java::java_8 [puppet] - 10https://gerrit.wikimedia.org/r/596601 [07:54:05] (03CR) 10jerkins-bot: [V: 04-1] Introduce java::java_8 [puppet] - 10https://gerrit.wikimedia.org/r/596601 (owner: 10Elukey) [07:54:19] <_joe_> lol [07:55:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove mc1036/mc2036 from the Redis Nutcracker config [puppet] - 10https://gerrit.wikimedia.org/r/595810 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [07:55:52] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Please check none of those servers is used for the "lock redis" functionality." [puppet] - 10https://gerrit.wikimedia.org/r/595810 (https://phabricator.wikimedia.org/T252391) (owner: 10Elukey) [07:58:32] (03Abandoned) 10Elukey: Introduce java::java_8 [puppet] - 10https://gerrit.wikimedia.org/r/596601 (owner: 10Elukey) [08:03:26] 10Operations, 10ops-codfw: Degraded RAID on ms-be2016 - https://phabricator.wikimedia.org/T252851 (10fgiunchedi) Looks like the battery is unhappy indeed: ` Cache Board Present: True Cache Status: Temporarily Disabled Cache Status Details: Cable Error Cache Ratio: 10% Read / 90% Write Drive Wri... [08:03:56] 10Operations, 10ops-codfw: BBU faulty on ms-be2016 - https://phabricator.wikimedia.org/T252851 (10fgiunchedi) [08:04:58] 10Operations, 10ops-codfw: BBU faulty on ms-be2016 - https://phabricator.wikimedia.org/T252851 (10fgiunchedi) @Papaul the host is technically fine to be taken down at any time, prior to a `poweroff` from the operating system. I'm guessing a battery replacement is needed here ? [08:06:46] (03PS1) 10Ema: varnish: add abuse_networks to cloud yaml [puppet] - 10https://gerrit.wikimedia.org/r/596602 (https://phabricator.wikimedia.org/T233945) [08:11:17] (03CR) 10Ema: [C: 03+2] varnish: add abuse_networks to cloud yaml [puppet] - 10https://gerrit.wikimedia.org/r/596602 (https://phabricator.wikimedia.org/T233945) (owner: 10Ema) [08:21:49] (03CR) 10Ema: [C: 03+2] ATS: cap TTL for cacheable 404 responses to 10 minutes [puppet] - 10https://gerrit.wikimedia.org/r/595877 (https://phabricator.wikimedia.org/T251537) (owner: 10Ema) [08:27:36] (03PS1) 10Elukey: Resolve duplicate declaration of openjdk-8 on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/596604 [08:30:26] 10Operations, 10Traffic, 10Patch-For-Review: ATS: Add the ability to check if origin server responses can be cached and their lifetime to the Lua plugin - https://phabricator.wikimedia.org/T251537 (10ema) 05Open→03Resolved a:03ema Done, 404 TTL capping now in place: ` root@cp3050:~# timeout 1 atslog-b... [08:33:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] termbox: deploy up to date chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/596227 (https://phabricator.wikimedia.org/T235411) (owner: 10JMeybohm) [08:34:34] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/22543/" [puppet] - 10https://gerrit.wikimedia.org/r/596604 (owner: 10Elukey) [08:41:04] (03PS1) 10Filippo Giunchedi: configmaster: add thanos-query to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/596607 [08:43:13] 10Operations, 10Traffic, 10Patch-For-Review: cache_upload varnish-fe exhausting transient memory - https://phabricator.wikimedia.org/T249809 (10ema) It might be worth experimenting with **enabling** request coalescing for large files. That could help reducing pressure on transient I think, worth giving it a... [08:50:48] (03CR) 10Ema: [V: 03+2 C: 03+2] "Builds fine and passes all tests in pbuilder on deneb, ignore CI." [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/596237 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [08:51:18] !log cp2029: try out varnish 5.1.3-1wm15 T236754 [08:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:22] T236754: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests - https://phabricator.wikimedia.org/T236754 [08:51:51] (03CR) 10Elukey: [C: 03+2] role::archiva: move to profile::java::analytics [puppet] - 10https://gerrit.wikimedia.org/r/596425 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [08:52:46] volans: thanks for taking care of the iegreview issue. yea, my bad. those files were meant to be temp and i just deleted them all, also on cumin1001 [08:53:08] RECOVERY - Disk space on miscweb1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=miscweb1002&var-datasource=eqiad+prometheus/ops [08:53:26] (03CR) 10Elukey: "Also no-op for https://puppet-compiler.wmflabs.org/compiler1001/22544/" [puppet] - 10https://gerrit.wikimedia.org/r/596604 (owner: 10Elukey) [08:53:35] that also resolved this alert, back to 70% [08:55:33] (03CR) 10Jbond: Add debian/ directory to the build overlay (WIP) (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/594718 (https://phabricator.wikimedia.org/T233947) (owner: 10Muehlenhoff) [08:55:35] (03CR) 10Giuseppe Lavagetto: [C: 03+1] configmaster: add thanos-query to disc_desired_state.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596607 (owner: 10Filippo Giunchedi) [08:56:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "Haven't looked at it in depth but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/596604 (owner: 10Elukey) [08:57:40] (03PS1) 10Vgutierrez: Release 8.0.7-1wm7 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596609 [08:57:42] (03PS2) 10Filippo Giunchedi: configmaster: add thanos-query to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/596607 [08:58:00] (03CR) 10Filippo Giunchedi: configmaster: add thanos-query to disc_desired_state.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596607 (owner: 10Filippo Giunchedi) [08:58:41] (03CR) 10Filippo Giunchedi: [C: 03+2] configmaster: add thanos-query to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/596607 (owner: 10Filippo Giunchedi) [09:04:25] 10Operations, 10Wikimedia-Mailing-lists, 10Malayalam-Sites: Wikiml-l mail archives are empty after August 2019 (moderation enabled but nobody moderates, hence no emails get delivered) - https://phabricator.wikimedia.org/T251554 (10Dzahn) 05Open→03Resolved a:03Dzahn @Praveenp @Adithyak1997 Thanks for th... [09:05:14] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:14] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:18] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:22] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:30] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:46] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:06:45] hello aqs [09:06:48] checking [09:09:11] 10Operations, 10MediaWiki-extensions-CodeReview: Set up static-codereview.wikimedia.org to host static HTML dump of CodeReview - https://phabricator.wikimedia.org/T243056 (10Dzahn) 05Open→03Resolved Boldly calling it resolved. Unless I'm missing anything @Legoktm [09:09:33] !log restart druid brokers on druid100[4-6] - locked up due to datasources dropped - T226035 [09:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:37] T226035: Dropping data from druid takes down aqs hosts - https://phabricator.wikimedia.org/T226035 [09:10:04] (03PS3) 10Ayounsi: Add sre.network.prepare-upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 [09:10:38] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:10:40] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:10:44] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:10:48] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:10:58] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:11:04] (03CR) 10Ayounsi: "Thanks for the quick review!" (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/596444 (owner: 10Ayounsi) [09:11:12] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:21:06] !log cp2029: attempt forced discard of stuck VCL T236754 [09:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:10] T236754: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests - https://phabricator.wikimedia.org/T236754 [09:23:11] 10Operations, 10Core Platform Team, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, and 5 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10jcrespo) ping @bblack to know if you prefer to make tempor... [09:24:55] (03PS2) 10Jcrespo: bacula: Fix small typos on pool class documentation [puppet] - 10https://gerrit.wikimedia.org/r/558383 [09:25:40] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix small typos on pool class documentation [puppet] - 10https://gerrit.wikimedia.org/r/558383 (owner: 10Jcrespo) [09:26:00] (03PS1) 10Giuseppe Lavagetto: Make kafka metrics also report the topic [software/purged] - 10https://gerrit.wikimedia.org/r/596614 [09:30:12] PROBLEM - Varnish frontend child restarted on cp2029 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2029&var-datasource=codfw+prometheus/ops [09:30:34] (03Abandoned) 10Jcrespo: dbproxy: add prometheus node monitoring [puppet] - 10https://gerrit.wikimedia.org/r/306937 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [09:33:12] 10Operations, 10ops-eqord, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10ayounsi) From Telia after asking them the light levels they're getting. > Looks like we are still at times seeing low light and errors in Chicago and transmitting those to San Francis... [09:33:29] (03CR) 10Ema: [C: 03+1] Release 8.0.7-1wm7 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596609 (owner: 10Vgutierrez) [09:33:36] (03PS1) 10Jcrespo: mariadb: Remove redundant include of prometheus node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/596615 (https://phabricator.wikimedia.org/T143896) [09:34:29] (03PS2) 10Jcrespo: mariadb: Remove redundant include of prometheus node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/596615 (https://phabricator.wikimedia.org/T143896) [09:37:16] (03CR) 10Jcrespo: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22545/" [puppet] - 10https://gerrit.wikimedia.org/r/596615 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [09:37:18] (03CR) 10Ema: [C: 03+1] "LGTM, please add an entry to d/changelog mentioning this change (added benefit: we get CI tests)" [software/purged] - 10https://gerrit.wikimedia.org/r/596614 (owner: 10Giuseppe Lavagetto) [09:38:38] (03PS1) 10Filippo Giunchedi: swift: migrate off swift::params [puppet] - 10https://gerrit.wikimedia.org/r/596617 (https://phabricator.wikimedia.org/T252537) [09:38:44] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.7-1wm7 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596609 (owner: 10Vgutierrez) [09:39:37] (03PS2) 10Jcrespo: mariadb: Increase core memory usage to 80% of physical memory [puppet] - 10https://gerrit.wikimedia.org/r/455769 [09:41:10] (03CR) 10Jcrespo: "Let's undig this." [puppet] - 10https://gerrit.wikimedia.org/r/455769 (owner: 10Jcrespo) [09:42:22] (03CR) 10Filippo Giunchedi: "PCC noop effectively, as expected https://puppet-compiler.wmflabs.org/compiler1002/22546/" [puppet] - 10https://gerrit.wikimedia.org/r/596617 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [09:43:37] (03PS5) 10Jcrespo: mariadb: Remove conditional for system user [puppet] - 10https://gerrit.wikimedia.org/r/461035 (https://phabricator.wikimedia.org/T100501) [09:45:23] (03CR) 10Jcrespo: [C: 03+2] mariadb: Remove conditional for system user [puppet] - 10https://gerrit.wikimedia.org/r/461035 (https://phabricator.wikimedia.org/T100501) (owner: 10Jcrespo) [09:47:17] 10Operations, 10DBA, 10Patch-For-Review: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501 (10jcrespo) 05Stalled→03Resolved a:03jcrespo All mysql users are system users. [09:47:21] 10Operations, 10DBA, 10Patch-For-Review: Prepare mysql hosts for stretch - https://phabricator.wikimedia.org/T168356 (10jcrespo) [09:49:21] (03PS2) 10Giuseppe Lavagetto: Make kafka metrics also report the topic [software/purged] - 10https://gerrit.wikimedia.org/r/596614 [09:54:42] (03CR) 10Ema: [C: 03+1] Make kafka metrics also report the topic [software/purged] - 10https://gerrit.wikimedia.org/r/596614 (owner: 10Giuseppe Lavagetto) [09:57:00] !log upload trafficserver 8.0.7-1wm7 to apt.wm.o (buster) [09:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:38] (03PS1) 10Zoranzoki21: RESTRouter: Add awawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/596619 (https://phabricator.wikimedia.org/T252865) [10:16:00] (03Abandoned) 10Jcrespo: [WIP] Puppetize netboot installer creation [puppet] - 10https://gerrit.wikimedia.org/r/292906 (owner: 10Jcrespo) [10:17:27] (03CR) 10Jcrespo: "Dzhan do you happen to know what is the status of dependencies?" [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [10:18:41] (03CR) 10Jcrespo: "Nevermind, I can see it here: https://phabricator.wikimedia.org/T162070#6118277" [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [10:20:14] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10jcrespo) [10:20:58] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10jcrespo) As per T162070#4942720. [10:21:51] (03CR) 10Dzahn: "@Jcrespo Very close now, just recently there was another surge of activity and i was able to remove it from 2 projects after talking to th" [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [10:22:14] RECOVERY - Varnish frontend child restarted on cp2029 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2029&var-datasource=codfw+prometheus/ops [10:22:57] (03CR) 10Jcrespo: "> If these don't answer me soon i will just remove it myself i think." [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [10:24:08] (03CR) 10Volans: [C: 03+2] tests: relax Bandit dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/596077 (owner: 10Volans) [10:25:04] (03CR) 10Volans: [C: 03+2] actions: new module to track cookbook actions [software/spicerack] - 10https://gerrit.wikimedia.org/r/596078 (owner: 10Volans) [10:30:18] (03Merged) 10jenkins-bot: tests: relax Bandit dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/596077 (owner: 10Volans) [10:31:06] (03Merged) 10jenkins-bot: actions: new module to track cookbook actions [software/spicerack] - 10https://gerrit.wikimedia.org/r/596078 (owner: 10Volans) [10:43:40] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10jcrespo) 05Resolved→03Open @Papaul see the FAILED above for db2140, as well as the ` db2140 missing physical device in PuppetDB: state Staged in Net... [10:44:44] 10Operations, 10ops-eqiad, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10Dzahn) >>! In T218751#6134304, @Cmjohnson wrote: > there are more because of the mw's that need to be decom'd. I did not see a decommission task for them. I'll take care of this soon and link to a decom task. [10:46:56] (03PS2) 10Jcrespo: backups: Add backup1002 as the eqiad host for ES db backups [puppet] - 10https://gerrit.wikimedia.org/r/596255 (https://phabricator.wikimedia.org/T79922) [10:49:30] (03PS1) 10Dzahn: icinga: add qchris to contactgroup for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/596624 (https://phabricator.wikimedia.org/T200739) [10:53:32] (03PS2) 10Volans: wmf-auto-reimage: fix autodetected rename MGMT [puppet] - 10https://gerrit.wikimedia.org/r/595931 (https://phabricator.wikimedia.org/T214314) [10:53:34] (03CR) 10Dzahn: [C: 03+2] icinga: add qchris to contactgroup for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/596624 (https://phabricator.wikimedia.org/T200739) (owner: 10Dzahn) [10:54:35] (03CR) 10Elukey: [C: 03+1] "Limited understanding of the code, had a chat with Riccardo and it looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/595931 (https://phabricator.wikimedia.org/T214314) (owner: 10Volans) [11:00:02] (03PS3) 10Jcrespo: backups: Add backup1002 as the eqiad host for ES db backups [puppet] - 10https://gerrit.wikimedia.org/r/596255 (https://phabricator.wikimedia.org/T79922) [11:00:48] 10Operations, 10LDAP-Access-Requests: LDAP access request - add Christian Aistleitner to "nda" (or "wmf") - https://phabricator.wikimedia.org/T252875 (10Dzahn) [11:01:10] 10Operations, 10LDAP-Access-Requests: LDAP access request - add Christian Aistleitner to "nda" (or "wmf") - https://phabricator.wikimedia.org/T252875 (10Dzahn) [11:01:25] 10Operations, 10LDAP-Access-Requests: LDAP access request - add Christian Aistleitner to "nda" (or "wmf") - https://phabricator.wikimedia.org/T252875 (10Dzahn) p:05Triage→03Medium [11:01:56] 10Operations, 10LDAP-Access-Requests: LDAP access request - add Christian Aistleitner to "nda" (or "wmf") - https://phabricator.wikimedia.org/T252875 (10Dzahn) [11:02:34] (03CR) 10Jcrespo: "Let me know what you think of the server selection:" [puppet] - 10https://gerrit.wikimedia.org/r/596255 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [11:03:58] 10Operations, 10LDAP-Access-Requests: LDAP access request - add Christian Aistleitner to "nda" (or "wmf") - https://phabricator.wikimedia.org/T252875 (10Dzahn) @KFrancis Could you please contact @QChris to start the NDA process? Or confirm if one already exists (maybe from the past)? Thanks! [11:05:28] 10Operations, 10LDAP-Access-Requests: LDAP access request - add Christian Aistleitner to "nda" (or "wmf") - https://phabricator.wikimedia.org/T252875 (10Dzahn) @QChris Katie will need some details from you. Sorry if this already happened. The background here is to get you Icinga access as we have talked about. [11:07:56] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) I have since gotten replies from 2 project owners and was able to remove the role from their instances. 2 are still missing b... [11:10:10] 10Operations, 10ops-eqiad, 10netops: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10ayounsi) If they're dead: * Either we need them (eg. short on ports), and in that case we need to replace the switch. Which is a heavy operations. * Or we mark the ports... [11:11:08] (03CR) 10Dzahn: "also https://phabricator.wikimedia.org/T252875" [puppet] - 10https://gerrit.wikimedia.org/r/596624 (https://phabricator.wikimedia.org/T200739) (owner: 10Dzahn) [11:21:15] (03PS5) 10Privacybatm: transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) [11:21:39] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [11:26:35] (03PS6) 10Privacybatm: transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) [11:38:10] (03PS1) 10Ema: 5.1.3-1wm15: don't set temperature to cold [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/596626 (https://phabricator.wikimedia.org/T236754) [11:54:24] (03CR) 10jerkins-bot: [V: 04-1] 5.1.3-1wm15: don't set temperature to cold [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/596626 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [12:15:40] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [12:16:15] 10Operations, 10ops-eqiad, 10netops: Three ports on asw2-d-eqiad are not working as expected - https://phabricator.wikimedia.org/T247881 (10faidon) If three ports are permanently failed, I'm not sure how we could ever trust that switch again. Perhaps it's better to do a painful but //planned// replacement ra... [12:16:34] (03PS1) 10Elukey: role::druid::analytics::worker: move config to /srv [puppet] - 10https://gerrit.wikimedia.org/r/596633 (https://phabricator.wikimedia.org/T252771) [12:20:38] PROBLEM - Varnish frontend child restarted on cp3050 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3050&var-datasource=esams+prometheus/ops [12:21:19] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/22547/druid1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/596633 (https://phabricator.wikimedia.org/T252771) (owner: 10Elukey) [12:28:26] PROBLEM - Varnish frontend child restarted on cp2029 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2029&var-datasource=codfw+prometheus/ops [12:32:27] (03CR) 10Elukey: [C: 03+2] role::druid::analytics::worker: move config to /srv [puppet] - 10https://gerrit.wikimedia.org/r/596633 (https://phabricator.wikimedia.org/T252771) (owner: 10Elukey) [12:54:20] (03PS1) 10Kormat: Rename 'mysql' module to 'mysql_legacy' [software/spicerack] - 10https://gerrit.wikimedia.org/r/596638 [12:55:32] (03PS1) 10Kormat: Update cookbooks for 'mysql' -> 'mysql_legacy' rename. [cookbooks] - 10https://gerrit.wikimedia.org/r/596639 [12:57:11] (03CR) 10jerkins-bot: [V: 04-1] Update cookbooks for 'mysql' -> 'mysql_legacy' rename. [cookbooks] - 10https://gerrit.wikimedia.org/r/596639 (owner: 10Kormat) [13:01:58] !log increasing sysctl net.ipv4.udp_mem on netflow3001 [13:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:12] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [13:05:22] duh? [13:05:25] * vgutierrez checking [13:05:36] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:29] May 15 13:00:04 acmechief1001 acme-chief-backend[27544]: requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: http://ocsp.int-x3.letsencrypt.org/ [13:06:30] hmmm [13:08:27] https://letsencrypt.status.io/ according to their status page, the OCSP responder seems to be up [13:10:28] (03PS1) 10Jbond: puppetmaster: add type checking [puppet] - 10https://gerrit.wikimedia.org/r/596640 [13:10:59] 10Operations, 10Acme-chief, 10Traffic: acme-chief crashes upon OCSP responder errors - https://phabricator.wikimedia.org/T252881 (10Vgutierrez) [13:11:02] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add type checking [puppet] - 10https://gerrit.wikimedia.org/r/596640 (owner: 10Jbond) [13:11:57] (03PS2) 10Jbond: puppetmaster: add type checking [puppet] - 10https://gerrit.wikimedia.org/r/596640 [13:12:31] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add type checking [puppet] - 10https://gerrit.wikimedia.org/r/596640 (owner: 10Jbond) [13:13:42] !log increase samplicator recvbuf on netflow3001 & restart samplicator [13:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:07] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1001 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [13:17:35] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:46] (03PS3) 10Jbond: puppetmaster: add type checking [puppet] - 10https://gerrit.wikimedia.org/r/596640 [13:25:22] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/22549/" [puppet] - 10https://gerrit.wikimedia.org/r/596640 (owner: 10Jbond) [13:33:18] (03CR) 10Jcrespo: [C: 03+2] Firewall.py: Store target_host as an instance property of Firewall object [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/596464 (https://phabricator.wikimedia.org/T252172) (owner: 10Privacybatm) [13:34:18] (03PS1) 10CDanis: samplicator: embiggen SO_RCVBUF to prevent drops [puppet] - 10https://gerrit.wikimedia.org/r/596642 [13:35:28] (03CR) 10BBlack: [C: 03+1] samplicator: embiggen SO_RCVBUF to prevent drops [puppet] - 10https://gerrit.wikimedia.org/r/596642 (owner: 10CDanis) [13:35:49] (03PS4) 10Jbond: puppetmaster: add type checking [puppet] - 10https://gerrit.wikimedia.org/r/596640 [13:37:35] !log rsyncing gerrit git data from gerrit1001 to gerrit1002 (T200739) [13:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:38] T200739: Upgrade to Gerrit 2.16.13 - https://phabricator.wikimedia.org/T200739 [13:38:04] (03CR) 10CDanis: [C: 03+2] "pcc lgtm https://puppet-compiler.wmflabs.org/compiler1002/22550/netflow3001.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/596642 (owner: 10CDanis) [13:38:30] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/596640 (owner: 10Jbond) [13:40:35] (03PS1) 10CDanis: samplicator: add service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/596643 [13:42:19] !log upgrade ats to version 8.0.7-1wm8 on cp4032 [13:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:43] (03CR) 10CDanis: "pcc lgtm https://puppet-compiler.wmflabs.org/compiler1002/22552/netflow3001.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/596643 (owner: 10CDanis) [13:42:55] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) @jcrespo the server first boot was set to NIC1 and not Hard drive 1 so when it completed the OS install the first time re rebooted to PXE again it... [13:44:18] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2140.codfw.wmnet ` The log can be found in `/var... [13:44:22] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2140.codfw.wmnet'] ` Of which those **FAILED**: ` ['db2140.codfw.wmnet'] ` [13:44:31] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` db2140.codfw.wmnet ` The log can be found in `/var... [13:45:29] 10Operations, 10Acme-chief, 10Traffic: acme-chief crashes upon OCSP responder errors - https://phabricator.wikimedia.org/T252881 (10Vgutierrez) p:05Triage→03Medium [13:47:12] !log downgrade ats to version 8.0.7-1wm7 on cp4032 [13:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:33] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:47:49] (03PS7) 10Privacybatm: transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) [13:47:57] !log cp2029, cp3050: varnish-fe-restart to clear 'child restarted' alerts [13:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:01] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp4032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:48:13] (03CR) 10jerkins-bot: [V: 04-1] transfer.py: Add the ability to auto-detect free port for netcat to listen [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [13:48:37] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:49:17] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp4032 is OK: HTTP OK: HTTP/1.0 200 OK - 22733 bytes in 0.246 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:49:17] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp4032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:49:47] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp4032 is OK: HTTP OK: HTTP/1.1 200 Ok - 31785 bytes in 0.402 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:49:49] RECOVERY - Varnish frontend child restarted on cp3050 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3050&var-datasource=esams+prometheus/ops [13:50:05] (03CR) 10Ppchelko: [C: 03+2] "This is not used and we gotta remove it. But for now for consistency - good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/596619 (https://phabricator.wikimedia.org/T252865) (owner: 10Zoranzoki21) [13:50:15] RECOVERY - Varnish frontend child restarted on cp2029 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp2029&var-datasource=codfw+prometheus/ops [13:50:23] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4032 is OK: HTTP OK: HTTP/1.1 200 Ok - 34620 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:50:24] (03Merged) 10jenkins-bot: RESTRouter: Add awawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/596619 (https://phabricator.wikimedia.org/T252865) (owner: 10Zoranzoki21) [13:51:03] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp4032 is OK: HTTP OK: HTTP/1.0 200 OK - 25173 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:51:05] (03CR) 10Jcrespo: "I like the port=0, it looks much nicer and retains compatibility." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [13:53:42] 10Operations, 10fundraising-tech-ops: set up offhost_backups for fundraising frnetmon role - https://phabricator.wikimedia.org/T252882 (10Jgreen) [13:54:41] (03CR) 10Privacybatm: "> Regarding tests, as this is a non-trivial change, I wonder if we should have some kind of integration tests (not unit, and not part of a" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [13:55:33] (03CR) 10Jcrespo: "Let's start commenting new methods from now on 0:-D" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [13:55:52] (03PS5) 10Jbond: puppetmaster: add type checking [puppet] - 10https://gerrit.wikimedia.org/r/596640 [13:56:36] 10Operations, 10fundraising-tech-ops: set up offhost_backups for fundraising frnetmon role - https://phabricator.wikimedia.org/T252882 (10Jgreen) 05Open→03Resolved a:05Dwisehaupt→03Jgreen [13:56:38] 10Operations, 10fundraising-tech-ops: rack/setup/install frnetmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T232137 (10Jgreen) [13:58:10] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/596640 (owner: 10Jbond) [14:01:34] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [14:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:42] (03PS1) 10DCausse: [wdqs] do not transfer the aliases when reload wikidata.jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/596645 [14:01:53] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:04:00] (03PS2) 10DCausse: [wdqs] do not transfer the aliases when reloading wikidata.jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/596645 [14:04:04] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:25] PROBLEM - Samplicator process on netflow4001 is CRITICAL: PROCS CRITICAL: 0 processes with command name samplicate https://wikitech.wikimedia.org/wiki/Netflow%23Process [14:04:31] PROBLEM - Samplicator process on netflow1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name samplicate https://wikitech.wikimedia.org/wiki/Netflow%23Process [14:05:11] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:37] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:12] cdanis: ^ [14:06:13] PROBLEM - Samplicator process on netflow2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name samplicate https://wikitech.wikimedia.org/wiki/Netflow%23Process [14:06:15] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:22] argh [14:06:33] PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:41] PROBLEM - Samplicator process on netflow5001 is CRITICAL: PROCS CRITICAL: 0 processes with command name samplicate https://wikitech.wikimedia.org/wiki/Netflow%23Process [14:06:41] (03PS1) 10CDanis: Revert "samplicator: embiggen SO_RCVBUF to prevent drops" [puppet] - 10https://gerrit.wikimedia.org/r/596646 [14:06:42] ty bblack [14:06:51] (03CR) 10CDanis: [V: 03+2 C: 03+2] Revert "samplicator: embiggen SO_RCVBUF to prevent drops" [puppet] - 10https://gerrit.wikimedia.org/r/596646 (owner: 10CDanis) [14:06:59] (03CR) 10Ema: [C: 03+1] prometheus: export NIC firmware versions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [14:07:19] hmmm /usr/bin/samplicate: invalid option -- 'B' [14:07:29] (03PS1) 10Vgutierrez: Release 8.0.7-1wm8 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596647 [14:07:31] (03CR) 10Privacybatm: "> Patch Set 7:" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/595516 (https://phabricator.wikimedia.org/T252171) (owner: 10Privacybatm) [14:07:34] lowercase apparently heh [14:07:37] it's `-b` yeah [14:07:49] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:08:07] RECOVERY - Samplicator process on netflow2001 is OK: PROCS OK: 1 process with command name samplicate https://wikitech.wikimedia.org/wiki/Netflow%23Process [14:08:07] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:13] RECOVERY - Samplicator process on netflow4001 is OK: PROCS OK: 1 process with command name samplicate https://wikitech.wikimedia.org/wiki/Netflow%23Process [14:08:17] RECOVERY - Samplicator process on netflow1001 is OK: PROCS OK: 1 process with command name samplicate https://wikitech.wikimedia.org/wiki/Netflow%23Process [14:08:27] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:28] sigh :( [14:08:35] RECOVERY - Samplicator process on netflow5001 is OK: PROCS OK: 1 process with command name samplicate https://wikitech.wikimedia.org/wiki/Netflow%23Process [14:08:57] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:25] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:30] (03PS1) 10CDanis: samplicator: *properly* embiggen SO_RCVBUF [puppet] - 10https://gerrit.wikimedia.org/r/596648 [14:09:53] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2140.codfw.wmnet'] ` and were **ALL** successful. [14:10:04] (03PS2) 10CDanis: samplicator: *properly* embiggen SO_RCVBUF [puppet] - 10https://gerrit.wikimedia.org/r/596648 [14:10:39] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10jcrespo) 05Open→03Resolved Thanks, @Papaul [14:12:38] (03PS2) 10Vgutierrez: Release 8.0.7-1wm8 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596647 [14:14:07] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 31st May) rack/setup/install db213[6-9] and db2140 - https://phabricator.wikimedia.org/T251639 (10Papaul) @jcrespo all good sorry about the problem [14:14:20] (03CR) 10CDanis: [C: 03+2] samplicator: *properly* embiggen SO_RCVBUF [puppet] - 10https://gerrit.wikimedia.org/r/596648 (owner: 10CDanis) [14:14:26] !log disable puppet on netflow* [14:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:51] (03PS1) 10Jbond: puppetmaster::gitclone: add pre-commit to private repo [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) [14:18:04] okay, no diffs from hand config on netflow3001, and worked fine on netflow5001 as well [14:18:08] !log re-enable puppet on netflow* [14:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:34] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::gitclone: add pre-commit to private repo [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) (owner: 10Jbond) [14:19:55] !log reverting sysctl net.ipv4.udp_mem to original on netflow3001 [14:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:51] (03PS4) 10Jcrespo: backups: Add backup1002 as the eqiad host for ES db backups [puppet] - 10https://gerrit.wikimedia.org/r/596255 (https://phabricator.wikimedia.org/T79922) [14:22:53] (03PS1) 10Jcrespo: install: Disable reimage of db114[1-9], db213[6-9] and db2140 [puppet] - 10https://gerrit.wikimedia.org/r/596650 (https://phabricator.wikimedia.org/T252512) [14:23:17] (03PS2) 10Jcrespo: install: Disable reimage of db114[1-9], db213[6-9] and db2140 [puppet] - 10https://gerrit.wikimedia.org/r/596650 (https://phabricator.wikimedia.org/T252512) [14:25:02] (03PS1) 10Giuseppe Lavagetto: cache::text: enable consuming from kafka everywhere [puppet] - 10https://gerrit.wikimedia.org/r/596651 (https://phabricator.wikimedia.org/T133821) [14:26:58] (03CR) 10Gehel: [C: 03+1] "LGTM. As discussed, a cleaner option would be to not duplicate the declaration of the JDK but rely on a common profile. This can come late" [puppet] - 10https://gerrit.wikimedia.org/r/596604 (owner: 10Elukey) [14:29:02] (03CR) 10Gehel: [C: 03+2] [wdqs] do not transfer the aliases when reloading wikidata.jnl [cookbooks] - 10https://gerrit.wikimedia.org/r/596645 (owner: 10DCausse) [14:30:00] (03CR) 10Ppchelko: [C: 03+1] "Exciting!" [puppet] - 10https://gerrit.wikimedia.org/r/596651 (https://phabricator.wikimedia.org/T133821) (owner: 10Giuseppe Lavagetto) [14:30:46] (03PS3) 10Vgutierrez: Release 8.0.7-1wm8 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596647 [14:35:01] (03CR) 10Ema: [C: 03+1] Release 8.0.7-1wm8 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/596647 (owner: 10Vgutierrez) [14:38:55] (03PS1) 10DCausse: [wdqs] fix DCAT-AP reload and load it to the categories endpoint [puppet] - 10https://gerrit.wikimedia.org/r/596655 [14:39:05] (03PS2) 10Filippo Giunchedi: swift: migrate off swift::params [puppet] - 10https://gerrit.wikimedia.org/r/596617 (https://phabricator.wikimedia.org/T252537) [14:39:07] (03PS1) 10Filippo Giunchedi: hieradata: move swift drives variables to common [puppet] - 10https://gerrit.wikimedia.org/r/596656 (https://phabricator.wikimedia.org/T252537) [14:39:09] (03PS1) 10Filippo Giunchedi: swift: read hash_path_suffix with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/596657 (https://phabricator.wikimedia.org/T252537) [14:39:11] (03PS1) 10Filippo Giunchedi: swift: enable s3api [puppet] - 10https://gerrit.wikimedia.org/r/596658 (https://phabricator.wikimedia.org/T252186) [14:39:21] (03CR) 10jerkins-bot: [V: 04-1] [wdqs] fix DCAT-AP reload and load it to the categories endpoint [puppet] - 10https://gerrit.wikimedia.org/r/596655 (owner: 10DCausse) [14:40:25] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:42:05] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:42:09] (03PS2) 10CDanis: samplicator: add service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/596643 [14:42:11] (03PS1) 10CDanis: fastnetmon: fix comment mispaste [puppet] - 10https://gerrit.wikimedia.org/r/596659 [14:42:13] (03PS1) 10CDanis: pmacct: embiggen rcvbuf [puppet] - 10https://gerrit.wikimedia.org/r/596660 [14:43:32] (03CR) 10jerkins-bot: [V: 04-1] pmacct: embiggen rcvbuf [puppet] - 10https://gerrit.wikimedia.org/r/596660 (owner: 10CDanis) [14:44:39] (03PS2) 10CDanis: pmacct: embiggen rcvbuf [puppet] - 10https://gerrit.wikimedia.org/r/596660 [14:44:48] (03CR) 10DCausse: "according to an alias file it seems to have break around 2019-10-31" [puppet] - 10https://gerrit.wikimedia.org/r/596655 (owner: 10DCausse) [14:47:18] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Carly Bogen - https://phabricator.wikimedia.org/T252887 (10CBogen) [14:51:59] (03PS3) 10Cwhite: profile: add anchor to mailman monitoring section [puppet] - 10https://gerrit.wikimedia.org/r/596517 (https://phabricator.wikimedia.org/T236505) [14:52:15] (03CR) 10CDanis: "pcc lgtm https://puppet-compiler.wmflabs.org/compiler1001/22554/netflow3001.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/596660 (owner: 10CDanis) [14:52:27] (03CR) 10Cwhite: profile: add anchor to mailman monitoring section (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596517 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [14:53:04] (03CR) 10Elukey: [C: 03+2] Resolve duplicate declaration of openjdk-8 on logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/596604 (owner: 10Elukey) [14:56:15] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:59:02] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add anchor to mailman monitoring section [puppet] - 10https://gerrit.wikimedia.org/r/596517 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [15:02:05] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 46 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:03:59] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [15:04:07] FFS :) [15:04:13] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:21] (03PS1) 10Elukey: role::druid::analytics::worker: set java.io.tmpdir=/srv/druid/tmp [puppet] - 10https://gerrit.wikimedia.org/r/596662 (https://phabricator.wikimedia.org/T252771) [15:07:28] (03CR) 10Elukey: [C: 03+2] role::druid::analytics::worker: set java.io.tmpdir=/srv/druid/tmp [puppet] - 10https://gerrit.wikimedia.org/r/596662 (https://phabricator.wikimedia.org/T252771) (owner: 10Elukey) [15:20:52] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WQDS Data Reload - https://phabricator.wikimedia.org/T252068 (10Gehel) Note: there was a bug (now [[ https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/596645 | fixed ]]) in the cookbook that transferred categories... [15:22:13] 10Operations, 10ORES, 10Scoring-platform-team: [Discuss] ORES without celery - https://phabricator.wikimedia.org/T216838 (10Halfak) I don't know if we discuss de-duplication in here. If not, we should. [15:24:11] (03PS2) 10Jbond: puppetmaster::gitclone: add pre-commit to private repo [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) [15:27:28] !log gehel@cumin2001 START - Cookbook sre.wdqs.data-transfer [15:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:02] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) (owner: 10Jbond) [15:31:08] !log gehel@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:04] !log gehel@cumin2001 START - Cookbook sre.wdqs.data-transfer [15:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:22] 10Operations, 10Acme-chief, 10Traffic: acme-chief crashes upon OCSP responder errors - https://phabricator.wikimedia.org/T252881 (10Vgutierrez) OCSP responder issues reported to LE in https://community.letsencrypt.org/t/ocsp-responder-returning-503-errors/122846 [15:36:28] !log gehel@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:13] !log gehel@cumin2001 START - Cookbook sre.wdqs.data-transfer [15:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:59] (03PS5) 10Ppchelko: Add tool and configuration for generating beta configuration from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [15:44:39] 10Operations, 10netops: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10CDanis) [15:44:52] !log gehel@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:32] !log gehel@cumin2001 START - Cookbook sre.wdqs.data-transfer [15:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:11] (03CR) 10Ppchelko: [C: 03+1] "Hugh has verified this works, so LGTM. I'll +1 it for now to let Alex chip in on where would we put this. Maybe we need a separate 'tools'" [deployment-charts] - 10https://gerrit.wikimedia.org/r/596209 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [15:49:06] !log gehel@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:13] !log gehel@cumin2001 START - Cookbook sre.wdqs.data-transfer [15:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:34] !log gehel@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [15:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:26] !log gehel@cumin2001 START - Cookbook sre.wdqs.data-transfer [15:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:06] (03CR) 10Cwhite: [C: 03+2] profile: add anchor to mailman monitoring section [puppet] - 10https://gerrit.wikimedia.org/r/596517 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [15:59:19] hey jayme o/, is this tmux session still needed? https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster1001&service=Long+running+screen%2Ftmux [16:00:57] 10Operations, 10observability, 10Patch-For-Review: Monitor mailman outbound mail queue - https://phabricator.wikimedia.org/T236505 (10colewhite) 05Open→03Resolved Monitoring deployed and updated some docs as well. [16:02:54] !log gehel@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [16:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:25] 10Operations: Add annotations from ops vendor maintenance calendar to Grafana - https://phabricator.wikimedia.org/T223934 (10colewhite) [16:03:27] 10Operations, 10observability: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) [16:08:57] 10Operations, 10netops: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10CDanis) Most transit providers don't participate in RIPE Atlas. Here's the ones who do, in order of CAIDA AS rank: * NTT [[ https://atlas.ripe.net/probes/6066/ | us-atl-as291... [16:10:53] (03CR) 10Volans: [C: 03+1] "LGTM, as agreed on IRC. I can make a new release on Monday" [software/spicerack] - 10https://gerrit.wikimedia.org/r/596638 (owner: 10Kormat) [16:10:59] (03PS1) 10Elukey: Add role::druid::analytics::worker to an-druid100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/596678 (https://phabricator.wikimedia.org/T252771) [16:11:16] 10Operations, 10observability: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) Loki looks like a feasible option to try given the resource constraints on the Grafana VM. It appears there is headroom on the host long as we keep events reasonably l... [16:11:26] 10Operations, 10observability: Leverage Grafana annotations to show events in graphs - https://phabricator.wikimedia.org/T222826 (10colewhite) a:03colewhite [16:12:27] (03PS2) 10Elukey: Add role::druid::analytics::worker to an-druid100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/596678 (https://phabricator.wikimedia.org/T252771) [16:12:37] (03CR) 10Kormat: [C: 03+2] Rename 'mysql' module to 'mysql_legacy' [software/spicerack] - 10https://gerrit.wikimedia.org/r/596638 (owner: 10Kormat) [16:13:28] (03PS3) 10Elukey: Add role::druid::analytics::worker to an-druid100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/596678 (https://phabricator.wikimedia.org/T252771) [16:13:30] (03PS1) 10Vgutierrez: acme_chief: Handle OCSP Request issues [software/acme-chief] - 10https://gerrit.wikimedia.org/r/596679 (https://phabricator.wikimedia.org/T252881) [16:14:42] (03CR) 10Volans: [C: 04-1] "LGTM, +1. As it depends on Ie7785f9b66edb069df1debd730c1b5caae7300aa being actually depoyed voting -1 for now as a blocker." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/596639 (owner: 10Kormat) [16:16:26] (03CR) 10jerkins-bot: [V: 04-1] acme_chief: Handle OCSP Request issues [software/acme-chief] - 10https://gerrit.wikimedia.org/r/596679 (https://phabricator.wikimedia.org/T252881) (owner: 10Vgutierrez) [16:17:02] herron: thanks for the ping. Wasn't aware of this being a problem/alerted on. Just closed the session [16:17:17] jayme: great thank you! [16:17:40] (03CR) 10Ayounsi: [C: 03+1] "nit: add it to the Parameters comment list." [puppet] - 10https://gerrit.wikimedia.org/r/596660 (owner: 10CDanis) [16:18:03] (03CR) 10Ayounsi: [C: 03+1] fastnetmon: fix comment mispaste [puppet] - 10https://gerrit.wikimedia.org/r/596659 (owner: 10CDanis) [16:18:27] (03CR) 10Ayounsi: [C: 03+1] samplicator: add service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/596643 (owner: 10CDanis) [16:18:56] (03Merged) 10jenkins-bot: Rename 'mysql' module to 'mysql_legacy' [software/spicerack] - 10https://gerrit.wikimedia.org/r/596638 (owner: 10Kormat) [16:23:40] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Adithyak1997) I have received around 100 mails within 30 minutes today [16:24:48] (03CR) 10Volans: Update cookbooks for 'mysql' -> 'mysql_legacy' rename. (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/596639 (owner: 10Kormat) [16:24:50] (03PS2) 10Vgutierrez: acme_chief: Handle OCSP Request issues [software/acme-chief] - 10https://gerrit.wikimedia.org/r/596679 (https://phabricator.wikimedia.org/T252881) [16:24:53] (03PS1) 10Vgutierrez: tests: use unittest.mock instead of the 3rd party mock module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/596680 [16:26:50] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Adithyak1997) >>! In T232417#6140790, @Adithyak1997 wrote: > I have received around 100 mails within 30 minutes today. Does this have to do anything with T61731? [16:28:51] ACKNOWLEDGEMENT - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Vgutierrez T252881 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:51] ACKNOWLEDGEMENT - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 0 processes with args acme-chief-backend Vgutierrez T252881 https://wikitech.wikimedia.org/wiki/Acme-chief [16:30:11] (03CR) 10Vgutierrez: [C: 03+2] tests: use unittest.mock instead of the 3rd party mock module [software/acme-chief] - 10https://gerrit.wikimedia.org/r/596680 (owner: 10Vgutierrez) [16:32:11] (03PS1) 10Papaul: Partman: Add thanos-be200[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/596681 (https://phabricator.wikimedia.org/T251634) [16:32:14] (03PS1) 10Dzahn: delete role::simplelamp [puppet] - 10https://gerrit.wikimedia.org/r/596682 (https://phabricator.wikimedia.org/T252190) [16:32:30] (03CR) 10Cwhite: "This is awesome! Thank you! Some questions and suggestions inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) (owner: 10Jbond) [16:33:01] (03CR) 10Papaul: [C: 03+2] Partman: Add thanos-be200[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/596681 (https://phabricator.wikimedia.org/T251634) (owner: 10Papaul) [16:37:14] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:23] (03CR) 10Andrew Bogott: [C: 03+1] delete role::simplelamp [puppet] - 10https://gerrit.wikimedia.org/r/596682 (https://phabricator.wikimedia.org/T252190) (owner: 10Dzahn) [16:39:49] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install thanos-be200[1-4] - https://phabricator.wikimedia.org/T251634 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` thanos-be2001.codfw.wmnet ` The log can be fou... [16:43:33] 10Puppet, 10Cloud-VPS, 10serviceops, 10Patch-For-Review, and 2 others: upgrade simplelamp class (apache -> httpd and mysql -> mariadb) or deprecate it - https://phabricator.wikimedia.org/T215662 (10Dzahn) a:03Dzahn [16:44:28] 10Puppet, 10Cloud-VPS, 10serviceops, 10Patch-For-Review, and 2 others: upgrade simplelamp class (apache -> httpd and mysql -> mariadb) or deprecate it - https://phabricator.wikimedia.org/T215662 (10Dzahn) This is basically done after we merge the above. The role has been removed or replaced with role::sim... [16:45:04] (03PS1) 10Dzahn: ci::worker_localhost: replace apache with httpd module [puppet] - 10https://gerrit.wikimedia.org/r/596687 (https://phabricator.wikimedia.org/T252190) [16:45:24] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:32] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install thanos-be200[1-4] - https://phabricator.wikimedia.org/T251634 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-be2001.codfw.wmnet'] ` Of which those **FAILED**: ` ['thanos-be2001.codfw.wmnet'] ` [16:47:26] (03CR) 10Dzahn: "This is included in role/manifests/ci/slave/labs.pp, role/manifests/ci/slave/labs/docker.pp, role/manifests/ci/slave/labs/pipelinebuilder." [puppet] - 10https://gerrit.wikimedia.org/r/596687 (https://phabricator.wikimedia.org/T252190) (owner: 10Dzahn) [16:49:30] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:17] (03PS1) 10Dzahn: deployment::server: replace apache module with httpd module [puppet] - 10https://gerrit.wikimedia.org/r/596692 (https://phabricator.wikimedia.org/T252190) [16:52:02] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:44] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [16:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:50] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [16:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:52] (03PS1) 10Dzahn: delete the apache module, replaced by httpd [puppet] - 10https://gerrit.wikimedia.org/r/596694 (https://phabricator.wikimedia.org/T252190) [16:58:18] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [16:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:16] (03PS1) 10Krinkle: Enable $wgResourceLoaderUseObjectCacheForDeps for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596696 (https://phabricator.wikimedia.org/T113916) [17:00:20] (03CR) 10Dzahn: "running puppet compiler on "C:apache" confirms only deploy1001 uses it: https://puppet-compiler.wmflabs.org/compiler1002/22555/" [puppet] - 10https://gerrit.wikimedia.org/r/596694 (https://phabricator.wikimedia.org/T252190) (owner: 10Dzahn) [17:00:52] !log depooled wqds1007 in preparation for impending wdqs data xfer [17:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:47] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:00] PROBLEM - netbox HTTPS on netbox1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Netbox [17:02:11] hrm [17:02:52] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [17:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:57] (03CR) 10Krinkle: [C: 03+2] Enable $wgResourceLoaderUseObjectCacheForDeps for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596696 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [17:03:55] (03Merged) 10jenkins-bot: Enable $wgResourceLoaderUseObjectCacheForDeps for Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596696 (https://phabricator.wikimedia.org/T113916) (owner: 10Krinkle) [17:04:48] RECOVERY - netbox HTTPS on netbox1001 is OK: HTTP OK: HTTP/1.1 302 Found - 379 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Netbox [17:05:08] !noop-log Pulled beta change to deploy1001 [17:08:45] (03CR) 10Jbond: "thanks for the quick review updated" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) (owner: 10Jbond) [17:08:58] (03PS3) 10Jbond: puppetmaster::gitclone: add pre-commit to private repo [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) [17:09:37] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::gitclone: add pre-commit to private repo [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) (owner: 10Jbond) [17:13:09] (03PS4) 10Jbond: puppetmaster::gitclone: add pre-commit to private repo [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) [17:17:20] (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [17:18:00] (03CR) 10CDanis: "Thanks for the review! Will merge Monday" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549683 (https://phabricator.wikimedia.org/T236744) (owner: 10CDanis) [17:20:05] 10Puppet, 10Cloud-VPS, 10Patch-For-Review: role::simplelamp fails to start mysql due to apparmor - https://phabricator.wikimedia.org/T128642 (10Dzahn) @EBernhardson You can now use `role::simplelamp2` which replaced this and uses mariadb classes instead of mysql. The "use_apparmor" part only appears in the m... [17:20:25] 10Puppet, 10Cloud-VPS, 10Patch-For-Review: role::simplelamp fails to start mysql due to apparmor - https://phabricator.wikimedia.org/T128642 (10Dzahn) a:03Dzahn [17:21:06] 10Puppet, 10Cloud-VPS, 10serviceops, 10Patch-For-Review, and 2 others: upgrade simplelamp class (apache -> httpd and mysql -> mariadb) or deprecate it - https://phabricator.wikimedia.org/T215662 (10Dzahn) [17:22:37] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Handle OCSP Request issues [software/acme-chief] - 10https://gerrit.wikimedia.org/r/596679 (https://phabricator.wikimedia.org/T252881) (owner: 10Vgutierrez) [17:22:48] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): convert cloud VPS projects from apache to httpd module - https://phabricator.wikimedia.org/T202574 (10Dzahn) a:03Dzahn [17:23:30] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): convert cloud VPS projects from apache to httpd module - https://phabricator.wikimedia.org/T202574 (10Dzahn) [17:26:08] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): convert cloud VPS projects from apache to httpd module - https://phabricator.wikimedia.org/T202574 (10Dzahn) As of today, nothing uses role::simplelamp anymore in cloud VPS per openstack-browser: The last 2 projects it has been... [17:26:36] 10Operations, 10ops-eqord, 10netops: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 (10ayounsi) From Telia: > Your service was affected by an outage along the transmission path, but the Loss of Signal we saw in Chicago happened after that outage had already started so i... [17:26:51] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Aklapper) >>! In T232417#6140796, @Adithyak1997 wrote: > I have received around 100 mails within 30 minutes today. And that is totally fine if it's about bounces for invalid addre... [17:27:48] !log renumber cr2-eqord:xe-0/1/1 to xe-0/1/3 - T221259 [17:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:53] T221259: eqord - ulsfo Telia link down - IC-313592 - https://phabricator.wikimedia.org/T221259 [17:28:26] (03PS1) 10Dzahn: delete role/profile hmmp, duplicate of simplelamp2 [puppet] - 10https://gerrit.wikimedia.org/r/596704 [17:30:21] (03PS1) 10Vgutierrez: Release 0.25 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/596705 (https://phabricator.wikimedia.org/T252881) [17:31:58] 10Puppet, 10Cloud-VPS, 10Patch-For-Review: role::simplelamp fails to start mysql due to apparmor - https://phabricator.wikimedia.org/T128642 (10Dzahn) 05Open→03Resolved I am closing this as resolved since nothing uses role::simplelamp anymore and there is a replacement for it. [17:32:16] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:17] 10Puppet, 10Cloud-VPS: role::simplelamp fails to start mysql due to apparmor - https://phabricator.wikimedia.org/T128642 (10Dzahn) [17:32:26] 10Operations, 10Puppet, 10Cloud-VPS: role::simplelamp fails to start mysql due to apparmor - https://phabricator.wikimedia.org/T128642 (10Dzahn) [17:34:33] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) Removed simplelamp from the last 2 remaining projects listed in openstack-browser. I think it's good to go now. Commented on... [17:34:35] hmmm the CI is working? [17:35:07] (03Abandoned) 10Dzahn: wmcs/simplelamp: remove mysql puppetization [puppet] - 10https://gerrit.wikimedia.org/r/495864 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [17:35:27] vgutierrez: yea, works for me [17:35:48] vgutierrez: algo lenta quizá? [17:35:49] ack... I'll wait a little bit more... :) [17:35:55] (03PS1) 10Ayounsi: Move ulsfo-eqord port to cr2-eqord:xe-0/1/3 [homer/public] - 10https://gerrit.wikimedia.org/r/596710 (https://phabricator.wikimedia.org/T221259) [17:36:02] hauskatze: yup.. that's probably it :) [17:37:46] 10Operations, 10SRE-Access-Requests: Request for srv/phab/phabricator/bin/bulk make-silent --id * command via SSH for moving tasks quarterly - https://phabricator.wikimedia.org/T251349 (10MBinder_WMF) {F31820004} Public key (generated today) attached :) [17:39:41] (03CR) 10Vgutierrez: [C: 03+2] Release 0.25 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/596705 (https://phabricator.wikimedia.org/T252881) (owner: 10Vgutierrez) [17:40:16] RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:43:56] (03PS1) 10Vgutierrez: tests: use unittest.mock instead of the 3rd party mock module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/596712 [17:43:58] (03PS1) 10Vgutierrez: acme_chief: Handle OCSP Request issues [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/596713 (https://phabricator.wikimedia.org/T252881) [17:44:00] (03PS1) 10Vgutierrez: Release 0.25 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/596714 (https://phabricator.wikimedia.org/T252881) [17:44:02] (03PS1) 10Vgutierrez: debian: Add release 0.25 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/596715 (https://phabricator.wikimedia.org/T252881) [17:45:23] (03CR) 10Vgutierrez: [C: 03+2] tests: use unittest.mock instead of the 3rd party mock module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/596712 (owner: 10Vgutierrez) [17:45:31] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Handle OCSP Request issues [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/596713 (https://phabricator.wikimedia.org/T252881) (owner: 10Vgutierrez) [17:45:36] (03CR) 10Vgutierrez: [C: 03+2] Release 0.25 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/596714 (https://phabricator.wikimedia.org/T252881) (owner: 10Vgutierrez) [17:50:47] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.25 to the changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/596715 (https://phabricator.wikimedia.org/T252881) (owner: 10Vgutierrez) [17:55:44] !log upload acme-chief 0.25 to apt.wm.o (buster) - T252881 [17:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:48] T252881: acme-chief crashes upon OCSP responder errors - https://phabricator.wikimedia.org/T252881 [17:57:22] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:26] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1001 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [17:57:33] 10Operations, 10Acme-chief, 10Traffic, 10Patch-For-Review: acme-chief crashes upon OCSP responder errors - https://phabricator.wikimedia.org/T252881 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [18:03:13] (03PS5) 10Krinkle: Enable $wgResourceLoaderUseObjectCacheForDeps for testwiki/test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591388 (https://phabricator.wikimedia.org/T113916) (owner: 10Aaron Schulz) [18:03:20] 10Operations, 10Acme-chief, 10Traffic: Let's Encrypt OCSP responders are showing 503 errors - https://phabricator.wikimedia.org/T252901 (10Vgutierrez) [18:03:58] 10Operations, 10Acme-chief, 10Traffic: Let's Encrypt OCSP responders are showing 503 errors - https://phabricator.wikimedia.org/T252901 (10Vgutierrez) p:05Triage→03Medium [18:13:44] (03CR) 10CDanis: [C: 03+2] samplicator: add service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/596643 (owner: 10CDanis) [18:13:53] (03CR) 10CDanis: [C: 03+2] fastnetmon: fix comment mispaste [puppet] - 10https://gerrit.wikimedia.org/r/596659 (owner: 10CDanis) [18:16:17] (03PS3) 10CDanis: pmacct: embiggen rcvbuf [puppet] - 10https://gerrit.wikimedia.org/r/596660 [18:18:47] (03CR) 10CDanis: [C: 03+2] pmacct: embiggen rcvbuf [puppet] - 10https://gerrit.wikimedia.org/r/596660 (owner: 10CDanis) [18:23:21] (03CR) 10CDanis: [C: 03+1] swift: migrate off swift::params [puppet] - 10https://gerrit.wikimedia.org/r/596617 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [18:23:51] (03CR) 10Krinkle: [C: 03+2] Enable $wgResourceLoaderUseObjectCacheForDeps for testwiki/test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591388 (https://phabricator.wikimedia.org/T113916) (owner: 10Aaron Schulz) [18:24:35] (03Merged) 10jenkins-bot: Enable $wgResourceLoaderUseObjectCacheForDeps for testwiki/test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591388 (https://phabricator.wikimedia.org/T113916) (owner: 10Aaron Schulz) [18:27:47] * Krinkle testing on mwdebug1001 [18:32:13] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:09] !log depooled wdqs2003 while lag catches up [18:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:49] 10Operations, 10LDAP-Access-Requests: LDAP access request - add Christian Aistleitner to "nda" (or "wmf") - https://phabricator.wikimedia.org/T252875 (10KFrancis) @Dzahn, @QChris has an existing NDA on file. Thanks! [18:53:10] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [18:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:02] (03PS1) 10Jforrester: SpecialVersionVersionUrl: Don't use confusing local variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596726 [19:17:19] (03PS1) 10Papaul: DHCP: Fix MAC address for thanos-be2001 and thanos-be2002 [puppet] - 10https://gerrit.wikimedia.org/r/596729 (https://phabricator.wikimedia.org/T251634) [19:28:40] (03CR) 10Papaul: [C: 03+2] DHCP: Fix MAC address for thanos-be2001 and thanos-be2002 [puppet] - 10https://gerrit.wikimedia.org/r/596729 (https://phabricator.wikimedia.org/T251634) (owner: 10Papaul) [19:38:30] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install thanos-be200[1-4] - https://phabricator.wikimedia.org/T251634 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` thanos-be2001.codfw.wmnet ` The log can be fou... [19:46:37] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: If0fd1b51 (duration: 01m 08s) [19:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:03] 10Operations, 10Acme-chief, 10Traffic: Let's Encrypt OCSP responders are showing 503 errors - https://phabricator.wikimedia.org/T252901 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez `May 15 19:43:27 acmechief1001 acme-chief-backend[30417]: Refreshing live OCSP response for certificate non-canonical-r... [19:59:20] 10Operations, 10observability: Better manage java updates for ELK7 - https://phabricator.wikimedia.org/T252913 (10herron) [19:59:27] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install thanos-be200[1-4] - https://phabricator.wikimedia.org/T251634 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thanos-be2001.codfw.wmnet'] ` Of which those **FAILED**: ` ['thanos-be2001.codfw.wmnet'] ` [20:33:45] !log pooled wdqs2003 and wdqs1007 following successful query tests [20:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:58] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [20:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:22] https://usercontent.irccloud-cdn.com/file/mLrGtlxb/1589576240.JPG [20:59:22] (03PS1) 10Krinkle: Move wgResourceLoaderUseObjectCacheForDeps to IS (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596750 [20:59:24] (03PS1) 10Krinkle: Move wgResourceLoaderUseObjectCacheForDeps to IS (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596751 [21:06:25] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/596640 (owner: 10Jbond) [21:11:13] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/596617 (https://phabricator.wikimedia.org/T252537) (owner: 10Filippo Giunchedi) [21:24:53] (03CR) 10Cwhite: [C: 03+1] puppetmaster::gitclone: add pre-commit to private repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/596649 (https://phabricator.wikimedia.org/T251247) (owner: 10Jbond) [21:27:19] (03PS1) 10Andrew Bogott: nova-compute: set ceph nodes to use CPU features available on all cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/596762 (https://phabricator.wikimedia.org/T225320) [21:38:26] (03CR) 10CDanis: [C: 03+2] Move ulsfo-eqord port to cr2-eqord:xe-0/1/3 [homer/public] - 10https://gerrit.wikimedia.org/r/596710 (https://phabricator.wikimedia.org/T221259) (owner: 10Ayounsi) [21:38:45] (03Merged) 10jenkins-bot: Move ulsfo-eqord port to cr2-eqord:xe-0/1/3 [homer/public] - 10https://gerrit.wikimedia.org/r/596710 (https://phabricator.wikimedia.org/T221259) (owner: 10Ayounsi) [21:40:14] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:55] !log depooled wdqs2007 while it catches up on lag [21:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:46] (03CR) 10Bstorm: [C: 03+1] "Oooh, that sucks." [puppet] - 10https://gerrit.wikimedia.org/r/596762 (https://phabricator.wikimedia.org/T225320) (owner: 10Andrew Bogott) [21:49:46] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 144 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:55:40] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:26:55] (03PS1) 10Jbond: delete old repo [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596776 [22:32:20] (03PS2) 10Jbond: docker build: update the build process to us docker [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596776 [22:34:41] (03Abandoned) 10Jbond: docker build: update the build process to us docker [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596776 (owner: 10Jbond) [22:36:36] (03PS1) 10Jbond: clean out old repo [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596778 [22:36:38] (03PS1) 10Jbond: docker build: update the build process to us docker [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 [22:38:19] (03PS2) 10Jbond: docker build: update the build process to us docker [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 [22:38:36] (03PS3) 10Jbond: docker build: update the build process to us docker [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/596779 [22:41:00] (03CR) 10Krinkle: [C: 03+2] Move wgResourceLoaderUseObjectCacheForDeps to IS (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596750 (owner: 10Krinkle) [22:42:09] (03Merged) 10jenkins-bot: Move wgResourceLoaderUseObjectCacheForDeps to IS (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596750 (owner: 10Krinkle) [22:43:54] * Krinkle staging on mwdebug1001 [22:50:53] (03CR) 10Krinkle: [C: 03+2] Move wgResourceLoaderUseObjectCacheForDeps to IS (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596751 (owner: 10Krinkle) [22:51:40] (03Merged) 10jenkins-bot: Move wgResourceLoaderUseObjectCacheForDeps to IS (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/596751 (owner: 10Krinkle) [22:51:48] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Iaa240eb8cf9 (duration: 01m 06s) [22:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:11] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: I1b1578a57ef5 (duration: 01m 07s) [22:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:18] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:04:10] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:33:40] !og Pooled wdqs2007 following successful query tests (all data transfers are done now) [23:34:38] ryankemper: you forgot a l in that log [23:34:59] lol I would have never noticed that [23:35:01] thanks [23:35:07] !log Pooled wdqs2007 following successful query tests (all data transfers are done now) [23:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:04] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [23:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:06] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:43:06] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [23:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:06] 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10Reedy) [23:46:40] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:46:40] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [23:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:41] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [23:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:57] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [23:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:37] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:24] (03PS1) 10CRusnov: hiera backends: Comment out special Netbox case [puppet] - 10https://gerrit.wikimedia.org/r/596787 [23:55:51] (03CR) 10CRusnov: "Hello! Thanks to Cole we seem to have tracked down the culprit for the little brain teaser we were looking at. This CR comments the specia" [puppet] - 10https://gerrit.wikimedia.org/r/596787 (owner: 10CRusnov)