[00:12:09] (03PS1) 10Cwhite: prometheus: disable shipped node-exporter ipmitool and smartmon timers [puppet] - 10https://gerrit.wikimedia.org/r/492408 (https://phabricator.wikimedia.org/T213708) [00:35:24] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Upgrade scap debian package to 3.9.0-1 - https://phabricator.wikimedia.org/T216666 (10thcipriani) 05Resolved→03Open Title of task was a bit ambiguous. To release the new version of scap, we still need to merge t... [00:35:39] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Deploy scap 3.9.0-1 - https://phabricator.wikimedia.org/T216666 (10thcipriani) [01:02:50] 10Operations, 10Proton, 10Core Platform Team Backlog (Watching / External), 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (watching): Proton fails with Chromium 72.0.3626.96 - https://phabricator.wikimedia.org/T216493 (10Tgr) There was some discussion on this in {T213366}. Yeah we could pin... [01:15:35] 10Operations, 10serviceops, 10MW-1.33-notes (1.33.0-wmf.19; 2019-02-26), 10Patch-For-Review, 10User-Joe: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 (10Joe) >>! In T216676#4976730, @Krinkle wrote: > @Joe I understand the choice between VCL in Varnish and client-side... [01:24:35] 10Operations, 10serviceops, 10MW-1.33-notes (1.33.0-wmf.19; 2019-02-26), 10Patch-For-Review, 10User-Joe: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 (10Krinkle) >>! In T216676#4977454, @Joe wrote: > This would be correct if we didn't separate the cache, but we're se... [01:46:58] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:13:04] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [02:27:28] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 37457272 and 1 seconds [02:29:56] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 6728 and 45 seconds [03:52:40] PROBLEM - HHVM rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:44] RECOVERY - HHVM rendering on mw1344 is OK: HTTP OK: HTTP/1.1 200 OK - 74802 bytes in 0.306 second response time [04:28:09] (03Abandoned) 10Gergő Tisza: Make nutcracker's auto_eject_hosts setting configurable [puppet] - 10https://gerrit.wikimedia.org/r/249222 (https://phabricator.wikimedia.org/T109173) (owner: 10Gergő Tisza) [04:48:20] (03PS1) 10Zoranzoki21: Remove enwikivoyage from mobilemainpagelegacy.dblist per T216863 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492430 (https://phabricator.wikimedia.org/T216863) [04:57:57] (03PS2) 10Zoranzoki21: Disable MFSpecialCaseMainPage for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492430 (https://phabricator.wikimedia.org/T216863) [04:59:09] (03PS1) 10Zoranzoki21: Disable MFSpecialCaseMainPage for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492432 (https://phabricator.wikimedia.org/T216865) [06:17:17] 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Aklapper) >>! In T197624#4917948, @Dzahn wrote: >>>! In T197624#4916933, @Aklapper wrote: >> @Dzahn: Which exact type of notifications are you referring to? If it's from Phab itself: `@Ph... [06:28:12] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:29:12] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.158 second response time [06:36:48] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.659 second response time [06:37:04] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:47:38] (03PS1) 10Sau226: Restore bureaucrat rights on hi.wiktionary to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) [06:48:45] (03PS1) 10Zoranzoki21: Added and subdomains of mehrnews.com to wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492448 (https://phabricator.wikimedia.org/T213961) [06:49:50] (03PS2) 10Sau226: Restore bureaucrat rights on hi.wiktionary to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765) [07:54:50] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:55:28] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:06:32] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:18:50] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:32] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:20:04] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:19:42] (03PS3) 10Daimona Eaytoy: Remove $wgAbuseFilterRuntimeProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486470 (https://phabricator.wikimedia.org/T191039) [09:19:49] (03PS12) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [09:24:27] (03PS13) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [09:24:48] (03CR) 10jerkins-bot: [V: 04-1] Remove $wgAbuseFilterRuntimeProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486470 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy) [09:24:50] (03CR) 10jerkins-bot: [V: 04-1] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [09:38:34] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:58:02] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:30:54] (03PS9) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [10:31:34] (03CR) 10jerkins-bot: [V: 04-1] cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [11:02:59] (03PS10) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [11:20:50] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer) [14:12:19] (03PS3) 10Mathew.onipe: tlsproxy: add prometheus option [puppet] - 10https://gerrit.wikimedia.org/r/491972 (https://phabricator.wikimedia.org/T216681) [15:20:46] PROBLEM - grafana-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time [15:26:50] RECOVERY - grafana-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 14007 bytes in 0.105 second response time [15:37:08] PROBLEM - HTTP availability for Varnish at ulsfo on icinga2001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [15:37:12] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:39:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:40:50] RECOVERY - HTTP availability for Varnish at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [17:51:06] PROBLEM - SSH on labvirt1008 is CRITICAL: connect to address 10.64.20.17 and port 22: Connection refused [17:51:23] PROBLEM - nova-compute proc maximum on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused [17:51:35] PROBLEM - kvm ssl cert on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused [17:51:36] PROBLEM - nova-compute proc minimum on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused [17:52:02] PROBLEM - dhclient process on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused [17:52:02] PROBLEM - configured eth on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused [17:52:08] PROBLEM - Disk space on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused [17:52:20] PROBLEM - DPKG on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused [17:52:23] PROBLEM - ensure kvm processes are running on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused [17:52:23] PROBLEM - puppet last run on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused [17:52:37] wuh oh [17:52:50] oh its just nrpe dying [17:52:58] dead server? [17:52:59] ah [17:53:14] PROBLEM - NTP on labvirt1008 is CRITICAL: NTP CRITICAL: No response from NTP server [17:53:48] oh perhaps not [17:53:49] It's ok, that host is about to be rebuilt anyway. I put it in downtime [17:53:59] it's actually down [17:54:01] okay [17:54:03] cool :) [17:54:03] I suspect that it's switched off and an existing downtime expired just now [17:54:07] andrewbogott: Maybe worth checking HW logs anyways, just in case [17:54:51] If andrewbogott is on it I will go back to my Saturday :-) [17:55:04] go go [17:55:10] yeah — the host is definitely idle anyway but I'll see what I can do to shut things up [17:55:38] ACKNOWLEDGEMENT - DPKG on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:44] ACKNOWLEDGEMENT - Disk space on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:44] ACKNOWLEDGEMENT - HP RAID on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:44] ACKNOWLEDGEMENT - IPMI Sensor Status on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:44] ACKNOWLEDGEMENT - Long running screen/tmux on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:44] ACKNOWLEDGEMENT - NTP on labvirt1008 is CRITICAL: NTP CRITICAL: No response from NTP server andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:44] ACKNOWLEDGEMENT - SSH on labvirt1008 is CRITICAL: connect to address 10.64.20.17 and port 22: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:44] ACKNOWLEDGEMENT - configured eth on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:45] ACKNOWLEDGEMENT - dhclient process on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:46] ACKNOWLEDGEMENT - ensure kvm processes are running on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:47] ACKNOWLEDGEMENT - kvm ssl cert on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:48] ACKNOWLEDGEMENT - nova-compute proc maximum on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:49] ACKNOWLEDGEMENT - nova-compute proc minimum on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:49] ACKNOWLEDGEMENT - puppet last run on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661 [17:55:53] hahahha [17:55:55] bye guys :) [17:56:01] ah, shit, that ack is going to send another 12 pages :( [17:56:32] only 1 I think [17:56:42] anyways o/ [17:56:57] * volans|off late to the party [17:58:01] * volans|off likes acknowledgments, let me know the problem is take care of [17:58:46] * volans|off back off, ping/page/call if needed [18:06:34] PROBLEM - Host labvirt1008 is DOWN: PING CRITICAL - Packet loss = 100% [18:07:39] oh, good job icinga! What do you think 'downtime for five years' is supposed to mean? [18:11:43] RECOVERY - kvm ssl cert on labvirt1008 is OK: Cert /etc/ssl/localcerts/labvirt-star.eqiad.wmnet.crt will not expire for at least 30 days. [18:11:44] RECOVERY - nova-compute proc minimum on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [18:11:54] RECOVERY - Host labvirt1008 is UP: PING OK - Packet loss = 0%, RTA = 36.27 ms [18:12:21] RECOVERY - Disk space on labvirt1008 is OK: DISK OK [18:12:22] RECOVERY - dhclient process on labvirt1008 is OK: PROCS OK: 0 processes with command name dhclient [18:12:22] RECOVERY - configured eth on labvirt1008 is OK: OK - interfaces up [18:12:36] RECOVERY - DPKG on labvirt1008 is OK: All packages OK [18:12:42] RECOVERY - SSH on labvirt1008 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0) [18:12:59] RECOVERY - nova-compute proc maximum on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [18:13:26] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:23:20] RECOVERY - NTP on labvirt1008 is OK: NTP OK: Offset -0.0004088878632 secs [19:09:09] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [19:14:09] RECOVERY - High load average on labstore1007 is OK: OK: Less than 85.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [19:24:36] andrewbogott: why not just disable notifications? [19:24:48] so it doesn't matter if the downtime expires? [19:25:04] We normally do that with hosts that are under (long) maintenance [19:50:03] (03PS7) 10MarcoAurelio: Initial configuration for hyw.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481943 (https://phabricator.wikimedia.org/T212597) [23:59:41] (03PS1) 10Paladox: wmflib: Migrate ini.rb, ordered_yaml.rb and php_ini.rb to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518