[00:12:09] <wikibugs>	 (03PS1) 10Cwhite: prometheus: disable shipped node-exporter ipmitool and smartmon timers [puppet] - 10https://gerrit.wikimedia.org/r/492408 (https://phabricator.wikimedia.org/T213708)
[00:35:24] <wikibugs>	 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Upgrade scap debian package to 3.9.0-1 - https://phabricator.wikimedia.org/T216666 (10thcipriani) 05Resolved→03Open Title of task was a bit ambiguous. To release the new version of scap, we still need to merge t...
[00:35:39] <wikibugs>	 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Deploy scap 3.9.0-1 - https://phabricator.wikimedia.org/T216666 (10thcipriani)
[01:02:50] <wikibugs>	 10Operations, 10Proton, 10Core Platform Team Backlog (Watching / External), 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (watching): Proton fails with Chromium 72.0.3626.96 - https://phabricator.wikimedia.org/T216493 (10Tgr) There was some discussion on this in {T213366}.  Yeah we could pin...
[01:15:35] <wikibugs>	 10Operations, 10serviceops, 10MW-1.33-notes (1.33.0-wmf.19; 2019-02-26), 10Patch-For-Review, 10User-Joe: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 (10Joe) >>! In T216676#4976730, @Krinkle wrote: > @Joe I understand the choice between VCL in Varnish and client-side...
[01:24:35] <wikibugs>	 10Operations, 10serviceops, 10MW-1.33-notes (1.33.0-wmf.19; 2019-02-26), 10Patch-For-Review, 10User-Joe: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 (10Krinkle) >>! In T216676#4977454, @Joe wrote: > This would be correct if we didn't separate the cache, but we're se...
[01:46:58] <icinga-wm>	 PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:13:04] <icinga-wm>	 RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[02:27:28] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 37457272 and 1 seconds
[02:29:56] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 6728 and 45 seconds
[03:52:40] <icinga-wm>	 PROBLEM - HHVM rendering on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:53:44] <icinga-wm>	 RECOVERY - HHVM rendering on mw1344 is OK: HTTP OK: HTTP/1.1 200 OK - 74802 bytes in 0.306 second response time
[04:28:09] <wikibugs>	 (03Abandoned) 10Gergő Tisza: Make nutcracker's auto_eject_hosts setting configurable [puppet] - 10https://gerrit.wikimedia.org/r/249222 (https://phabricator.wikimedia.org/T109173) (owner: 10Gergő Tisza)
[04:48:20] <wikibugs>	 (03PS1) 10Zoranzoki21: Remove enwikivoyage from mobilemainpagelegacy.dblist per T216863 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492430 (https://phabricator.wikimedia.org/T216863)
[04:57:57] <wikibugs>	 (03PS2) 10Zoranzoki21: Disable MFSpecialCaseMainPage for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492430 (https://phabricator.wikimedia.org/T216863)
[04:59:09] <wikibugs>	 (03PS1) 10Zoranzoki21: Disable MFSpecialCaseMainPage for srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492432 (https://phabricator.wikimedia.org/T216865)
[06:17:17] <wikibugs>	 10Operations, 10User-herron: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624 (10Aklapper) >>! In T197624#4917948, @Dzahn wrote: >>>! In T197624#4916933, @Aklapper wrote: >> @Dzahn: Which exact type of notifications are you referring to? If it's from Phab itself: `@Ph...
[06:28:12] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:29:12] <icinga-wm>	 PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.158 second response time
[06:36:48] <icinga-wm>	 RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.659 second response time
[06:37:04] <icinga-wm>	 RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational
[06:47:38] <wikibugs>	 (03PS1) 10Sau226: Restore bureaucrat rights on hi.wiktionary to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765)
[06:48:45] <wikibugs>	 (03PS1) 10Zoranzoki21: Added and subdomains of mehrnews.com to wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492448 (https://phabricator.wikimedia.org/T213961)
[06:49:50] <wikibugs>	 (03PS2) 10Sau226: Restore bureaucrat rights on hi.wiktionary to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492447 (https://phabricator.wikimedia.org/T214765)
[07:54:50] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:55:28] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:06:32] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:18:50] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:19:32] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:20:04] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:19:42] <wikibugs>	 (03PS3) 10Daimona Eaytoy: Remove $wgAbuseFilterRuntimeProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486470 (https://phabricator.wikimedia.org/T191039)
[09:19:49] <wikibugs>	 (03PS12) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931)
[09:24:27] <wikibugs>	 (03PS13) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772
[09:24:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove $wgAbuseFilterRuntimeProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486470 (https://phabricator.wikimedia.org/T191039) (owner: 10Daimona Eaytoy)
[09:24:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy)
[09:38:34] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[09:58:02] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[10:30:54] <wikibugs>	 (03PS9) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921)
[10:31:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe)
[11:02:59] <wikibugs>	 (03PS10) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921)
[11:20:50] <wikibugs>	 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering: Incorrect text positioning in SVG rasterization (scale/transform; font-size; kerning) - https://phabricator.wikimedia.org/T36947 (10JoKalliauer)
[14:12:19] <wikibugs>	 (03PS3) 10Mathew.onipe: tlsproxy: add prometheus option [puppet] - 10https://gerrit.wikimedia.org/r/491972 (https://phabricator.wikimedia.org/T216681)
[15:20:46] <icinga-wm>	 PROBLEM - grafana-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.073 second response time
[15:26:50] <icinga-wm>	 RECOVERY - grafana-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 14007 bytes in 0.105 second response time
[15:37:08] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga2001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[15:37:12] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:39:40] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:40:50] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[17:51:06] <icinga-wm>	 PROBLEM - SSH on labvirt1008 is CRITICAL: connect to address 10.64.20.17 and port 22: Connection refused
[17:51:23] <icinga-wm>	 PROBLEM - nova-compute proc maximum on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused
[17:51:35] <icinga-wm>	 PROBLEM - kvm ssl cert on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused
[17:51:36] <icinga-wm>	 PROBLEM - nova-compute proc minimum on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused
[17:52:02] <icinga-wm>	 PROBLEM - dhclient process on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused
[17:52:02] <icinga-wm>	 PROBLEM - configured eth on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused
[17:52:08] <icinga-wm>	 PROBLEM - Disk space on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused
[17:52:20] <icinga-wm>	 PROBLEM - DPKG on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused
[17:52:23] <icinga-wm>	 PROBLEM - ensure kvm processes are running on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused
[17:52:23] <icinga-wm>	 PROBLEM - puppet last run on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused
[17:52:37] <chaomodus>	 wuh oh
[17:52:50] <chaomodus>	 oh its just nrpe dying
[17:52:58] <marostegui>	 dead server?
[17:52:59] <marostegui>	 ah
[17:53:14] <icinga-wm>	 PROBLEM - NTP on labvirt1008 is CRITICAL: NTP CRITICAL: No response from NTP server
[17:53:48] <chaomodus>	 oh perhaps not
[17:53:49] <andrewbogott>	 It's ok, that host is about to be rebuilt anyway.  I put it in downtime
[17:53:59] <chaomodus>	 it's actually down
[17:54:01] <chaomodus>	 okay
[17:54:03] <chaomodus>	 cool :)
[17:54:03] <andrewbogott>	 I suspect that it's switched off and an existing downtime expired just now
[17:54:07] <marostegui>	 andrewbogott: Maybe worth checking HW logs anyways, just in case
[17:54:51] <marostegui>	 If andrewbogott is on it I will go back to my Saturday :-)
[17:55:04] <jijiki>	 go go
[17:55:10] <andrewbogott>	 yeah — the host is definitely idle anyway but I'll see what I can do to shut things up
[17:55:38] <icinga-wm>	 ACKNOWLEDGEMENT - DPKG on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:44] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:44] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:44] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:44] <icinga-wm>	 ACKNOWLEDGEMENT - Long running screen/tmux on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:44] <icinga-wm>	 ACKNOWLEDGEMENT - NTP on labvirt1008 is CRITICAL: NTP CRITICAL: No response from NTP server andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:44] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on labvirt1008 is CRITICAL: connect to address 10.64.20.17 and port 22: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:44] <icinga-wm>	 ACKNOWLEDGEMENT - configured eth on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:45] <icinga-wm>	 ACKNOWLEDGEMENT - dhclient process on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:46] <icinga-wm>	 ACKNOWLEDGEMENT - ensure kvm processes are running on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:47] <icinga-wm>	 ACKNOWLEDGEMENT - kvm ssl cert on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:48] <icinga-wm>	 ACKNOWLEDGEMENT - nova-compute proc maximum on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:49] <icinga-wm>	 ACKNOWLEDGEMENT - nova-compute proc minimum on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:49] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on labvirt1008 is CRITICAL: connect to address 10.64.20.17 port 5666: Connection refused andrew bogott This is an idle host, no need for immediate response. T216661
[17:55:53] <jijiki>	 hahahha
[17:55:55] <marostegui>	 bye guys :)
[17:56:01] <andrewbogott>	 ah, shit, that ack is going to send another 12 pages :(
[17:56:32] <marostegui>	 only 1 I think
[17:56:42] <marostegui>	 anyways o/
[17:56:57] * volans|off late to the party
[17:58:01] * volans|off likes acknowledgments, let me know the problem is take care of
[17:58:46] * volans|off back off, ping/page/call if needed
[18:06:34] <icinga-wm>	 PROBLEM - Host labvirt1008 is DOWN: PING CRITICAL - Packet loss = 100%
[18:07:39] <andrewbogott>	 oh, good job icinga!  What do you think 'downtime for five years' is supposed to mean?
[18:11:43] <icinga-wm>	 RECOVERY - kvm ssl cert on labvirt1008 is OK: Cert /etc/ssl/localcerts/labvirt-star.eqiad.wmnet.crt will not expire for at least 30 days.
[18:11:44] <icinga-wm>	 RECOVERY - nova-compute proc minimum on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[18:11:54] <icinga-wm>	 RECOVERY - Host labvirt1008 is UP: PING OK - Packet loss = 0%, RTA = 36.27 ms
[18:12:21] <icinga-wm>	 RECOVERY - Disk space on labvirt1008 is OK: DISK OK
[18:12:22] <icinga-wm>	 RECOVERY - dhclient process on labvirt1008 is OK: PROCS OK: 0 processes with command name dhclient
[18:12:22] <icinga-wm>	 RECOVERY - configured eth on labvirt1008 is OK: OK - interfaces up
[18:12:36] <icinga-wm>	 RECOVERY - DPKG on labvirt1008 is OK: All packages OK
[18:12:42] <icinga-wm>	 RECOVERY - SSH on labvirt1008 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10 (protocol 2.0)
[18:12:59] <icinga-wm>	 RECOVERY - nova-compute proc maximum on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute
[18:13:26] <icinga-wm>	 RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[18:23:20] <icinga-wm>	 RECOVERY - NTP on labvirt1008 is OK: NTP OK: Offset -0.0004088878632 secs
[19:09:09] <icinga-wm>	 PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[19:14:09] <icinga-wm>	 RECOVERY - High load average on labstore1007 is OK: OK: Less than 85.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[19:24:36] <marostegui>	 andrewbogott: why not just disable notifications?
[19:24:48] <marostegui>	 so it doesn't matter if the downtime expires?
[19:25:04] <marostegui>	 We normally do that with hosts that are under (long) maintenance
[19:50:03] <wikibugs>	 (03PS7) 10MarcoAurelio: Initial configuration for hyw.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481943 (https://phabricator.wikimedia.org/T212597)
[23:59:41] <wikibugs>	 (03PS1) 10Paladox: wmflib: Migrate ini.rb, ordered_yaml.rb and php_ini.rb to modern puppet custom functions [puppet] - 10https://gerrit.wikimedia.org/r/492518