[00:22:05] 10Operations, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review, 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10aaron) a:05aaron→03None [00:53:16] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:15:59] (03PS1) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 [01:16:36] (03CR) 10jerkins-bot: [V: 04-1] Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (owner: 10CRusnov) [01:20:56] (03PS2) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [01:22:24] (03CR) 10jerkins-bot: [V: 04-1] Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [01:27:40] (03PS3) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [01:38:57] (03PS4) 10CRusnov: Add system timer for running ganeti->netbox sync. [puppet] - 10https://gerrit.wikimedia.org/r/493774 (https://phabricator.wikimedia.org/T215229) [04:54:40] Hi, gerrit is slow [04:55:35] Any oppers around? Load looks higher then normal (which happened the last time gerrit slowed down) [05:09:24] Intermittent slowness [05:18:45] I’ll file a task as I have to go [05:33:28] 10Operations, 10Gerrit: Intermittent slowness on grrrit - https://phabricator.wikimedia.org/T217457 (10Paladox) [06:08:25] 10Operations, 10ops-eqiad: Update several hosts status in Netbox - https://phabricator.wikimedia.org/T217429 (10Marostegui) [06:12:50] 10Operations, 10Gerrit: Intermittent slowness on grrrit - https://phabricator.wikimedia.org/T217457 (10Marostegui) Something changed the 28th at around 20:30: https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&orgId=1&var-server=cobalt&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc&pa... [06:13:07] 10Operations, 10Gerrit: Intermittent slowness on gerrit - https://phabricator.wikimedia.org/T217457 (10Marostegui) [06:13:49] PROBLEM - SSH access on cobalt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:57] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.15.8-7-g780f1f2b91 (SSHD-CORE-1.6.0) (protocol 2.0) [06:28:08] 10Operations, 10Gerrit: Intermittent slowness on gerrit - https://phabricator.wikimedia.org/T217457 (10Marostegui) ` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30290 gerrit2 20 0 31.152g 0.022t 256348 S 1058 70.4 14169:44 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx... [06:31:37] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] [06:57:33] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:25:21] PROBLEM - Host labstore1006 is DOWN: PING CRITICAL - Packet loss = 100% [09:29:13] RECOVERY - Host labstore1006 is UP: PING WARNING - Packet loss = 28%, RTA = 36.35 ms [09:31:51] PROBLEM - NFS on labstore1006 is CRITICAL: connect to address 208.80.154.7 and port 2049: Connection refused [09:43:55] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.086 second response time [11:04:27] PROBLEM - DPKG on labvirt1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:05:39] RECOVERY - DPKG on labvirt1001 is OK: All packages OK [11:56:47] RECOVERY - NFS on labstore1006 is OK: TCP OK - 0.036 second response time on 208.80.154.7 port 2049 [12:07:49] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.079 second response time [12:12:18] !log labstore1006 started nfsd T217473 [12:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:23] T217473: labstore1006 spontaneous reboot - https://phabricator.wikimedia.org/T217473 [13:07:23] (03PS13) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [13:07:31] (03PS4) 10Daimona Eaytoy: Remove $wgAbuseFilterRuntimeProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486470 (https://phabricator.wikimedia.org/T191039) [13:07:38] (03PS14) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [13:08:06] (03CR) 10jerkins-bot: [V: 04-1] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [13:25:57] (03PS1) 10GTirloni: openstack: Automatically start/stop VMs on hypervisor boot/shutdown [puppet] - 10https://gerrit.wikimedia.org/r/493807 (https://phabricator.wikimedia.org/T216040) [13:26:41] (03CR) 10jerkins-bot: [V: 04-1] openstack: Automatically start/stop VMs on hypervisor boot/shutdown [puppet] - 10https://gerrit.wikimedia.org/r/493807 (https://phabricator.wikimedia.org/T216040) (owner: 10GTirloni) [13:32:23] (03PS2) 10GTirloni: openstack: Automatically start/stop VMs on hypervisor boot/shutdown [puppet] - 10https://gerrit.wikimedia.org/r/493807 (https://phabricator.wikimedia.org/T216040) [13:34:19] (03CR) 10GTirloni: "Puppet compiler: https://puppet-compiler.wmflabs.org/compiler1002/14948/" [puppet] - 10https://gerrit.wikimedia.org/r/493807 (https://phabricator.wikimedia.org/T216040) (owner: 10GTirloni) [13:48:31] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:14:27] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:34] 10Operations, 10Gerrit: Intermittent slowness on gerrit - https://phabricator.wikimedia.org/T217457 (10Paladox) Thank you @Marostegui! The load/cpu has gone down now. [15:21:07] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:21:21] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:21:27] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:21:35] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:21:51] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:21:51] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:23:33] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [15:24:55] RECOVERY - Disk space on notebook1003 is OK: DISK OK [15:24:58] restarted nagios etc.. [15:25:03] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [15:25:09] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [15:25:27] RECOVERY - DPKG on notebook1003 is OK: All packages OK [15:25:27] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [15:25:55] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [15:28:47] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures [17:07:33] 10Operations, 10Domains, 10Traffic: Redirecting incoming queries to non-existent subpages (due to Godaddy behavior on some external WikiJournal sites) - https://phabricator.wikimedia.org/T212914 (10Aklapper) [17:13:56] 10Operations, 10Packaging: Upgrade php5-json .deb to at least 1.3.8 - https://phabricator.wikimedia.org/T160101 (10Aklapper) @EBernhardson: As we don't use PHP 5.x anymore, can this be closed? Or not? [17:14:56] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10Aklapper) [18:11:39] PROBLEM - configured eth on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:11:49] PROBLEM - Disk space on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:12:03] PROBLEM - puppet last run on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:12:07] PROBLEM - dhclient process on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:12:11] PROBLEM - Check systemd state on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:12:31] PROBLEM - MD RAID on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:12:31] PROBLEM - DPKG on notebook1003 is CRITICAL: connect to address 10.64.21.109 port 5666: Connection refused [18:16:29] RECOVERY - configured eth on notebook1003 is OK: OK - interfaces up [18:16:57] RECOVERY - dhclient process on notebook1003 is OK: PROCS OK: 0 processes with command name dhclient [18:17:01] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [18:17:15] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:17:21] RECOVERY - MD RAID on notebook1003 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [18:17:21] RECOVERY - DPKG on notebook1003 is OK: All packages OK [18:18:43] unexpected restart or nrpe dying? [18:35:14] sigh [Sat Mar 2 18:33:11 2019] Out of memory: Kill process 23303 (python3) score 155 or sacrifice child [18:35:31] next week I'll dedicate some times to tune the cgroups [18:35:37] it is getting really annoying [18:35:49] in theory the notebooks are meant to avoid this kind of churning [18:36:07] they should be used only as a tool for Hadoop computations [18:38:03] PROBLEM - puppet last run on an-worker1095 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:48:02] roger at leat it's not like anything critical [18:48:46] nono stat/notebooks are hosts for users, if they break is usually somebody doing heavy computations [18:49:16] I have to play with systemd slices and see if I can find a good model [18:49:51] (people can use as much resource as needed for crunching data, but others are able to log in and possibly to lightweight things like read files, launch spark jobs on hadoop, etc..) [18:50:07] (and of course no OOM killer party :) [18:56:23] RECOVERY - Disk space on notebook1003 is OK: DISK OK [19:09:11] RECOVERY - puppet last run on an-worker1095 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:25:18] 10Operations, 10Packaging: Upgrade php5-json .deb to at least 1.3.8 - https://phabricator.wikimedia.org/T160101 (10EBernhardson) 05Open→03Declined [19:25:50] party time at oom killer's house [19:25:58] that's the name of my linux themed slasher movie [19:27:39] lol [19:37:49] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:47:29] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:53:49] 10Operations, 10Mail, 10Phabricator: DomainKeys Identified Mail (DKIM) for phabricator.wikimedia.org - https://phabricator.wikimedia.org/T116805 (10Aklapper) >>! In T116805#4965322, @Niedzielski wrote: > This is only a single datapoint but I noticed a Phab comment email notification from gerritbot was mistak... [20:55:48] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10akosiaris) >>! In T213976#4971554, @Ladsgroup wrote: > In general it would be great if the storage would be decoupl... [21:02:56] (03PS2) 10Alexandros Kosiaris: Add citoid specific statsd mappings [deployment-charts] - 10https://gerrit.wikimedia.org/r/493669 (https://phabricator.wikimedia.org/T213194) [21:02:58] (03PS2) 10Alexandros Kosiaris: Publish citoid 0.0.2 version [deployment-charts] - 10https://gerrit.wikimedia.org/r/493670 (https://phabricator.wikimedia.org/T213194) [21:41:32] 10Operations, 10Analytics, 10Discovery, 10Research: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ladsgroup) >>! In T213976#4995838, @akosiaris wrote: > The idea is fine (I've been having the same for over 1 year... [22:36:32] 10Operations, 10Gerrit, 10Release-Engineering-Team: Disable jgit gc on gerrit - https://phabricator.wikimedia.org/T217497 (10Paladox) [22:36:39] 10Operations, 10Gerrit, 10Release-Engineering-Team: Disable jgit gc on gerrit - https://phabricator.wikimedia.org/T217497 (10Paladox) p:05Triage→03High [22:38:01] (03PS1) 10Paladox: gerrit: Disable jgit gc [puppet] - 10https://gerrit.wikimedia.org/r/493963 [22:39:34] (03PS2) 10Paladox: gerrit: Disable jgit gc [puppet] - 10https://gerrit.wikimedia.org/r/493963 (https://phabricator.wikimedia.org/T217497) [22:43:18] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Patch-For-Review: Disable jgit gc on gerrit - https://phabricator.wikimedia.org/T217497 (10Paladox) [22:44:01] (03PS3) 10Krinkle: Re-instate "Turn wikimedia.org docroot into symlink to standard-docroot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480055 (owner: 10Reedy) [22:44:53] (03CR) 10Krinkle: [C: 03+1] Re-instate "Turn wikimedia.org docroot into symlink to standard-docroot" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480055 (owner: 10Reedy) [22:45:26] (03PS3) 10Paladox: gerrit: Disable jgit gc [puppet] - 10https://gerrit.wikimedia.org/r/493963 (https://phabricator.wikimedia.org/T217497) [22:45:31] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/493963 (https://phabricator.wikimedia.org/T217497) (owner: 10Paladox) [22:45:48] (03CR) 10Krinkle: [C: 03+1] "Per 0d0bbbe22a6f8df, I guess we need to look on some servers to make sure the submodule that used to be there is no longer existent locall" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480055 (owner: 10Reedy) [22:52:23] (03PS4) 10Paladox: gerrit: Disable jgit gc [puppet] - 10https://gerrit.wikimedia.org/r/493963 (https://phabricator.wikimedia.org/T217497) [22:52:28] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/493963 (https://phabricator.wikimedia.org/T217497) (owner: 10Paladox)