[00:00:07] <grrrit-wm>	 (03PS1) 10Dzahn: mysql_wmf, protoproxy: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211326 
[00:00:34] <grrrit-wm>	 (03PS2) 10Dzahn: mysql_wmf, protoproxy: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211326 
[00:01:43] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] puppet,puppet_compiler: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211322 (owner: 10Dzahn)
[00:02:26] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] kibana, labs_vmbuilder: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211323 (owner: 10Dzahn)
[00:03:11] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] logstash: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211324 (owner: 10Dzahn)
[00:04:07] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0]
[00:05:07] <icinga-wm>	 PROBLEM - SSH on labvirt1003 is CRITICAL - Socket timeout after 10 seconds
[00:05:16] <icinga-wm>	 PROBLEM - RAID on labvirt1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:05:17] <icinga-wm>	 PROBLEM - puppet last run on labvirt1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:05:38] <icinga-wm>	 PROBLEM - configured eth on labvirt1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:08:09] <mutante>	 uhm.. libvirt on labvirt being really busy ^
[00:21:12] <grrrit-wm>	 (03PS1) 10Dzahn: salt, spamassassin: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211329 
[00:25:39] <grrrit-wm>	 (03Abandoned) 10Dzahn: lots of indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/204696 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn)
[00:28:23] <grrrit-wm>	 (03PS1) 10Dzahn: ganglia: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211333 
[00:31:25] <grrrit-wm>	 (03PS1) 10Dzahn: geoip: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211334 
[00:33:02] <grrrit-wm>	 (03PS13) 10BryanDavis: [WIP] Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) 
[00:33:26] <icinga-wm>	 PROBLEM - NTP on labvirt1003 is CRITICAL: NTP CRITICAL: No response from NTP server
[00:37:09] <grrrit-wm>	 (03PS1) 10Dzahn: mediawiki_singlenode: rename defined type [puppet] - 10https://gerrit.wikimedia.org/r/211335 
[00:44:49] <grrrit-wm>	 (03PS1) 10Dzahn: contint: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211337 
[00:48:40] <grrrit-wm>	 (03PS1) 10Dzahn: monitoring, RT: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211339 
[00:52:50] <grrrit-wm>	 (03PS1) 10Dzahn: osm, rsync: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211341 
[00:58:27] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 3.39% of data above the critical threshold [1000.0]
[01:03:44] <grrrit-wm>	 (03PS1) 10Dzahn: lvs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211343 
[01:04:31] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] lvs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211343 (owner: 10Dzahn)
[01:08:58] <grrrit-wm>	 (03PS2) 10Dzahn: lvs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211343 
[01:09:54] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] lvs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211343 (owner: 10Dzahn)
[01:11:38] <grrrit-wm>	 (03PS1) 10Dzahn: mirrors: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211345 
[01:15:13] <grrrit-wm>	 (03PS3) 10Dzahn: lvs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211343 
[01:20:07] <grrrit-wm>	 (03PS2) 10Dzahn: mirrors: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211345 
[01:20:21] <grrrit-wm>	 (03PS1) 10Dzahn: labs_lvm: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211346 
[01:22:41] <grrrit-wm>	 (03PS1) 10Dzahn: dynamicproxy: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211347 
[01:29:33] <grrrit-wm>	 (03PS1) 10Dzahn: snapshot: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211348 
[01:32:04] <grrrit-wm>	 (03PS2) 10Dzahn: snapshot: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211348 
[01:34:58] <grrrit-wm>	 (03PS1) 10Dzahn: ocg: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211349 
[01:38:47] <grrrit-wm>	 (03PS1) 10Dzahn: jenkins,package_builder,labs_bootstrapvz: lint [puppet] - 10https://gerrit.wikimedia.org/r/211350 
[01:44:13] <grrrit-wm>	 (03PS1) 10Dzahn: statistics: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211351 
[01:48:42] <grrrit-wm>	 (03PS2) 10Dzahn: statistics: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211351 
[01:58:46] <grrrit-wm>	 (03PS1) 10Dzahn: varnish: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211352 
[02:00:42] <grrrit-wm>	 (03PS1) 10Dzahn: gitblit: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211353 
[02:04:47] <grrrit-wm>	 (03PS1) 10Dzahn: quarry: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211354 
[02:20:41] <grrrit-wm>	 (03PS1) 10Dzahn: nrpe: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211355 
[02:20:43] <grrrit-wm>	 (03PS1) 10Dzahn: openstack: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211356 
[02:22:02] <grrrit-wm>	 (03PS1) 10Dzahn: mariadb: indentation fixes [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/211357 
[02:22:59] <manybubbles>	 !og es-tool restart-fast on elastic1029
[02:25:05] <grrrit-wm>	 (03PS1) 10Dzahn: mysql: indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/211358 
[02:25:07] <logmsgbot>	 !log l10nupdate Synchronized php-1.26wmf5/cache/l10n: (no message) (duration: 05m 55s)
[02:25:23] <morebots>	 Logged the message, Master
[02:28:37] <grrrit-wm>	 (03PS1) 10Dzahn: phabricator: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211359 
[02:29:41] <logmsgbot>	 !log LocalisationUpdate completed (1.26wmf5) at 2015-05-16 02:28:37+00:00
[02:29:48] <morebots>	 Logged the message, Master
[02:29:57] <icinga-wm>	 PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (91005s 90000s)
[02:33:31] <grrrit-wm>	 (03PS1) 10Dzahn: git: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/211360 
[02:35:07] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0]
[02:39:37] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL 87.50% of data above the critical threshold [24.0]
[02:43:19] <logmsgbot>	 !log l10nupdate Synchronized php-1.26wmf6/cache/l10n: (no message) (duration: 04m 55s)
[02:43:29] <morebots>	 Logged the message, Master
[02:47:11] <logmsgbot>	 !log LocalisationUpdate completed (1.26wmf6) at 2015-05-16 02:46:08+00:00
[02:47:24] <morebots>	 Logged the message, Master
[02:47:37] <icinga-wm>	 PROBLEM - puppet last run on mw2108 is CRITICAL Puppet has 1 failures
[02:55:06] <icinga-wm>	 PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1330 bytes in 0.254 second response time
[02:58:43] <manybubbles>	 !og es-tool restart-fast on elastic1030
[03:04:07] <icinga-wm>	 RECOVERY - puppet last run on mw2108 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures
[03:25:37] <icinga-wm>	 RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1329 bytes in 0.244 second response time
[03:53:46] <andrewbogott>	 mutante: still there?
[04:11:47] <andrewbogott>	 !log restarting sshd and generally poking around on labvirt1003
[04:11:53] <morebots>	 Logged the message, Master
[04:13:28] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[04:17:07] <icinga-wm>	 PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures
[04:18:22] <wikibugs>	 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290156 (10Andrew) 3NEW a:3Cmjohnson
[04:19:40] <wikibugs>	 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290168 (10Andrew) There's a fair amount of other ugliness in dmesg, e.g.  [1843134.114144] INFO: task gmond:61831 blocked for more than 120 seconds. [1843134.145729]       Not tainted 3.13.0-49-generic #83-Ub...
[04:20:34] <wikibugs>	 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290169 (10Andrew) I'm going to leave the system up for now, since we might as well minimize the labs outage.  I can't imagine this isn't going to require a dc visit though :(
[04:20:53] <wikibugs>	 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290170 (10Andrew) p:5Triage>3Unbreak!
[04:25:07] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL 62.50% of data above the critical threshold [24.0]
[04:30:07] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[04:33:37] <icinga-wm>	 RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures
[04:36:51] <wikibugs>	 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290172 (10Andrew) Oh, btw, sshd and ganglia-monitor are comatose on that system for reasons that are unclear to me.  The mgmt console is working fine.
[04:41:38] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0]
[04:52:37] <icinga-wm>	 PROBLEM - puppet last run on mw2125 is CRITICAL puppet fail
[05:02:05] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sat May 16 05:01:02 UTC 2015 (duration 1m 1s)
[05:02:12] <morebots>	 Logged the message, Master
[05:04:46] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[05:08:26] <icinga-wm>	 RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (18051 90000s)
[05:12:18] <icinga-wm>	 RECOVERY - puppet last run on mw2125 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:26:38] <icinga-wm>	 PROBLEM - puppet last run on db1073 is CRITICAL Puppet has 1 failures
[05:39:17] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL 75.00% of data above the critical threshold [24.0]
[05:44:56] <icinga-wm>	 RECOVERY - puppet last run on db1073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:50:57] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[06:01:07] <icinga-wm>	 PROBLEM - nova-compute process on labvirt1003 is CRITICAL: Connection refused by host
[06:01:07] <icinga-wm>	 PROBLEM - DPKG on labvirt1003 is CRITICAL: Connection refused by host
[06:01:36] <icinga-wm>	 PROBLEM - salt-minion processes on labvirt1003 is CRITICAL: Connection refused by host
[06:02:07] <icinga-wm>	 PROBLEM - Disk space on labvirt1003 is CRITICAL: Connection refused by host
[06:02:17] <icinga-wm>	 PROBLEM - dhclient process on labvirt1003 is CRITICAL: Connection refused by host
[06:03:23] <_joe_>	 !log killed nrpe on labvirt1003 - see T99341
[06:03:29] <morebots>	 Logged the message, Master
[06:05:13] <wikibugs>	 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290209 (10Joe) @andrew why leaving this up would have "minimized the labs outage" is not clear to me. You've basically left a completely broken system (and an UBN!) ticket open to be consumed over the weekend...
[06:06:12] <wikibugs>	 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290210 (10Joe) The only reason why I'm not rebooting this machine is that Andrew implied it would mean having downtime for labs, but I don't really see an alternative to an hard powercycle for now.
[06:09:48] <apergos>	 moring _joe_
[06:10:45] <apergos>	 *morning
[06:13:14] <wikibugs>	 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290226 (10mark) According to the ILO sensors both the fans and the temp sensors indicate OK/good health, so I doubt it's actually a matter of overheating.
[06:22:33] <mark>	 so does anyone have shell access to labvirt1003?
[06:23:43] <mark>	 serial console doesn't do much for me
[06:24:26] <ori>	 mark: salt 'labvirt1003*' cmd.run 'some command' works
[06:24:28] <ori>	 on palladium
[06:25:21] <_joe_>	 mark: yeah the serial console is my fault, I tried to run ip link show with strace, and it got in D state immediately
[06:25:32] <mark>	 pretty fucked
[06:25:35] <_joe_>	 I am grepping the logs via salt
[06:25:36] <mark>	 but I don't think it's really overheating
[06:25:38] <mark>	 ok
[06:25:40] <_joe_>	 it's not
[06:26:04] <_joe_>	 we have mce signalling errors, but I was trying to figure out why mcelog --client gives nothing back
[06:26:20] <mark>	 we've seen such things on dells with some bios settings
[06:26:25] <mark>	 but not sure how that relates to these HPs
[06:26:25] <_joe_>	 then we have a ton of processes stuck trying to reach the network
[06:26:41] <_joe_>	 in close_wait state, to be precise
[06:27:05] <ori>	 the cpu alarms show up on a number of hosts and don't appear to be at all related
[06:27:15] <_joe_>	 and lockups in the kernel, related to sending network traffic
[06:27:30] <_joe_>	 ori: the cpu alarms are a well known false positive
[06:27:43] <_joe_>	 everyone at the WMF has alarmed for those at least once :)
[06:27:50] <mark>	 yeah
[06:28:32] <_joe_>	 so, my best bet would be a hard reboot, but since the instances on it are running fine, I'd wait for it to be a problem for users in fact, or monday
[06:28:36] <_joe_>	 whatever comes first
[06:29:10] <_joe_>	 I'm 99% sure it's a software issue, but still, mce logs (that I'm trying to understand how to read)
[06:29:46] <ori>	 i'm still curious about why multiple labvirt* hosts starting showing an "error while receiving frame on vnetXX: Network is down" within a few minutes of each other at around 18:00
[06:29:56] <icinga-wm>	 PROBLEM - puppet last run on mw2092 is CRITICAL puppet fail
[06:29:57] <mark>	 texting coren
[06:30:13] <_joe_>	 ori: maybe related to some changes andrew was attempting
[06:30:22] <_joe_>	 right, coren is in the right TZ now :)
[06:30:44] <_joe_>	 mark: I'm getting off for a few minutes, my daughter wants breakfast :)
[06:30:48] <mark>	 ok
[06:31:57] <icinga-wm>	 PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures
[06:32:27] <icinga-wm>	 PROBLEM - puppet last run on wtp2015 is CRITICAL Puppet has 1 failures
[06:32:28] <icinga-wm>	 PROBLEM - puppet last run on cp3042 is CRITICAL Puppet has 1 failures
[06:33:16] <icinga-wm>	 PROBLEM - puppet last run on mw1042 is CRITICAL Puppet has 2 failures
[06:33:17] <icinga-wm>	 PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures
[06:34:17] <icinga-wm>	 PROBLEM - puppet last run on mw1166 is CRITICAL Puppet has 2 failures
[06:45:08] <icinga-wm>	 RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures
[06:45:17] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0]
[06:45:47] <icinga-wm>	 RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures
[06:48:36] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[07:02:41] <wikibugs>	 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290273 (10ArielGlenn) The labs instances on the box seem to be working fine fwiw.
[07:05:46] <icinga-wm>	 RECOVERY - puppet last run on mw1166 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures
[07:07:57] <icinga-wm>	 RECOVERY - puppet last run on mw2092 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures
[07:07:57] <icinga-wm>	 RECOVERY - puppet last run on mw1042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:08:07] <icinga-wm>	 RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:08:48] <icinga-wm>	 RECOVERY - puppet last run on wtp2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:23:27] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0]
[07:25:06] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[07:39:46] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0]
[07:40:09] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL 77.78% of data above the critical threshold [24.0]
[07:46:47] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[07:49:01] <godog>	 !log restart hhvm on mw1234, still pushing xhprof metrics
[07:49:07] <morebots>	 Logged the message, Master
[08:00:36] <yurik>	 akosiaris, hi, do you know if i have been granted graphoid cluster access? I see strange patterns of restarts on the graphoid
[08:00:41] <yurik>	 http://grafana.wikimedia.org/#/dashboard/db/graphoid
[08:00:59] <yurik>	 gwicke, ^
[08:21:01] <wikibugs>	 6operations, 10ops-esams: Implement CWDM between knams and esams - https://phabricator.wikimedia.org/T98971#1290280 (10mark) 5Open>3Resolved a:3mark Both CWDM systems are now in use, with one channel on each fiber. We can add channels (e.g. management from now on), but that doesn't need to block this tic...
[08:25:27] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL 87.50% of data above the critical threshold [24.0]
[08:59:57] <icinga-wm>	 RECOVERY - DPKG on labvirt1003 is OK: All packages OK
[08:59:57] <icinga-wm>	 RECOVERY - nova-compute process on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute
[09:00:27] <icinga-wm>	 RECOVERY - RAID on labvirt1003 is OK no RAID installed
[09:00:37] <icinga-wm>	 RECOVERY - salt-minion processes on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[09:00:56] <icinga-wm>	 RECOVERY - configured eth on labvirt1003 is OK - interfaces up
[09:00:57] <icinga-wm>	 RECOVERY - SSH on labvirt1003 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0)
[09:01:07] <icinga-wm>	 RECOVERY - Disk space on labvirt1003 is OK: DISK OK
[09:01:27] <icinga-wm>	 RECOVERY - dhclient process on labvirt1003 is OK: PROCS OK: 0 processes with command name dhclient
[09:01:48] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0]
[09:10:00] <godog>	 !log bounce hhvm on mw1141
[09:10:07] <morebots>	 Logged the message, Master
[09:12:07] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0]
[09:15:43] <godog>	 !log bounce hhvm on mw1196
[09:15:50] <morebots>	 Logged the message, Master
[09:16:15] <yurik>	 godog, around?  could you send me the logs from graphoid please?
[09:16:22] <yurik>	 its failing for some reason
[09:17:05] <godog>	 yurik: sure, which hosts in particular?
[09:17:56] <yurik>	 godog, they are running on sca1... and sca2..., no idea what host is actually causing the restarts
[09:18:09] <yurik>	 i am guessing that sca1 is the only one active atm
[09:18:18] <yurik>	 in eqiad
[09:19:36] <yurik>	 godog, if you want, put them on stat1002 into my homedir
[09:22:00] <godog>	 yurik: should be in your home
[09:22:06] <yurik>	 thanks!
[09:22:17] <godog>	 yw
[09:30:07] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 745
[09:35:07] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 2580575 Threads: 1 Questions: 7918388 Slow queries: 17140 Opens: 42915 Flush tables: 2 Open tables: 64 Queries per second avg: 3.068 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[10:01:48] <grrrit-wm>	 (03PS1) 10KartikMistry: Beta: Enable all languages in source [puppet] - 10https://gerrit.wikimedia.org/r/211371 (https://phabricator.wikimedia.org/T98946) 
[10:06:07] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 62.50% of data above the critical threshold [35.0]
[10:38:26] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0]
[11:21:46] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0]
[11:37:08] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Revert "depool ulsfo due to traffic issues" [dns] - 10https://gerrit.wikimedia.org/r/211389 
[11:37:15] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: Revert "depool ulsfo due to traffic issues" [dns] - 10https://gerrit.wikimedia.org/r/211389 
[11:37:29] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Revert "depool ulsfo due to traffic issues" [dns] - 10https://gerrit.wikimedia.org/r/211389 (owner: 10Faidon Liambotis)
[12:10:57] <icinga-wm>	 PROBLEM - puppet last run on mw2023 is CRITICAL Puppet has 1 failures
[12:17:37] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0]
[12:27:17] <icinga-wm>	 RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures
[12:46:16] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0]
[12:57:56] <grrrit-wm>	 (03PS1) 10KartikMistry: Typo: Fix typo in cxserver module [puppet] - 10https://gerrit.wikimedia.org/r/211392 
[12:59:33] <grrrit-wm>	 (03PS1) 10Chmarkine: transparency - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/211394 (https://phabricator.wikimedia.org/T40516) 
[13:05:37] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0]
[13:18:27] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0]
[13:25:12] <manybubbles>	 !log es-tool restart-fast on elastic1031
[13:25:23] <morebots>	 Logged the message, Master
[13:26:58] <manybubbles>	 !log that was the last server in the elasticsearch rolling restart. all done. now we have new versions of the plugins. Lets try not to do that again.
[13:27:05] <morebots>	 Logged the message, Master
[13:30:43] <wikibugs>	 6operations, 10ops-eqiad, 6Labs: labvirt1003 overheating - https://phabricator.wikimedia.org/T99341#1290542 (10Andrew) 5Open>3Resolved Detailed report is here:  https://wikitech.wikimedia.org/wiki/Incident_documentation/20150515-LabsOutage
[14:21:06] <wikibugs>	 6operations, 10Wikimedia-DNS, 10Wikimedia-Video: Please set up a CNAME for videoserver.wikimedia.org to Video Editing Server - https://phabricator.wikimedia.org/T99216#1290616 (1080686) good point about the privacy policy. My suggestion is that we point at the WMF privacy policy, as this is a service set up...
[14:30:37] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0]
[14:52:36] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0]
[15:00:39] <grrrit-wm>	 (03PS2) 10KartikMistry: Beta: Enable all languages in source and target [puppet] - 10https://gerrit.wikimedia.org/r/211371 (https://phabricator.wikimedia.org/T98946) 
[15:01:45] <grrrit-wm>	 (03PS3) 10KartikMistry: Beta: CX: Enable all languages in source and target [puppet] - 10https://gerrit.wikimedia.org/r/211371 (https://phabricator.wikimedia.org/T98946) 
[15:09:00] <grrrit-wm>	 (03CR) 10Santhosh: [C: 031] Beta: CX: Enable all languages in source and target [puppet] - 10https://gerrit.wikimedia.org/r/211371 (https://phabricator.wikimedia.org/T98946) (owner: 10KartikMistry)
[15:16:57] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0]
[15:20:17] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0]
[15:25:43] <grrrit-wm>	 (03CR) 10Nikerabbit: [C: 031] CX: Enable 'cxstats' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211116 (owner: 10KartikMistry)
[15:26:31] <grrrit-wm>	 (03PS1) 10Dereckson: Fixed whitespace issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211407 
[15:26:33] <grrrit-wm>	 (03PS1) 10Dereckson: Site name configuration on ast.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211408 (https://phabricator.wikimedia.org/T99315) 
[15:33:05] <grrrit-wm>	 (03CR) 10JanZerebecki: [C: 031] transparency - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/211394 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine)
[15:36:28] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0]
[15:41:27] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0]
[15:50:08] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0]
[16:05:38] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0]
[16:21:57] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0]
[16:30:20] <grrrit-wm>	 (03CR) 10Bartosz Dziewoński: [C: 04-1] "This is a step backwards. See my comment on the task for more." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175406 (https://phabricator.wikimedia.org/T73477) (owner: 10Glaisher)
[16:50:08] <icinga-wm>	 PROBLEM - puppet last run on mw1205 is CRITICAL Puppet has 1 failures
[17:06:17] <icinga-wm>	 RECOVERY - puppet last run on mw1205 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:08:37] <icinga-wm>	 RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0]
[17:37:12] <yurik>	 I have fixed the issue that caused constant graphoid restarts. Will do deployment now, should be low impact. CC: greg-g godog 
[17:38:20] <yurik>	 https://gerrit.wikimedia.org/r/#/c/211431/
[17:39:18] <apergos>	 now?  you know it's a saturday evening / morning, if something breaks who will be around to fix it?
[17:39:23] <apergos>	 yurik: 
[17:39:39] <yurik>	 apergos, i will be here for as long as it takes to fix it
[17:40:07] <apergos>	 please do babysit it after the deploy then for awhile
[17:40:17] <yurik>	 i mean, i could wait for monday, but its crashing graphoid nonstop
[17:40:34] <yurik>	 apergos, http://grafana.wikimedia.org/#/dashboard/db/graphoid
[17:40:37] <yurik>	 take a look at the top line
[17:40:55] <yurik>	 the red indicates crashes
[17:41:45] <yurik>	 apergos, come to think of it, you are right, i shouldn't mess with it for such low numbers.
[17:41:52] <yurik>	 on weekend
[17:41:54] <yurik>	 will wait
[17:42:02] <apergos>	 whew
[17:42:06] <yurik>	 :D
[17:42:11] <apergos>	 just because... nothing ever takes 5 minutes :-D
[17:42:19] <apergos>	 see you back here on monday then
[17:42:30] <yurik>	 apergos, its not about time - i can totally babysit it until it gets working again
[17:42:36] <apergos>	 I know but.
[17:55:26] <grrrit-wm>	 (03PS1) 10ArielGlenn: nova monitoring instaces and salt keys: add new options [puppet] - 10https://gerrit.wikimedia.org/r/211432 
[17:56:15] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] nova monitoring instaces and salt keys: add new options [puppet] - 10https://gerrit.wikimedia.org/r/211432 (owner: 10ArielGlenn)
[17:56:37] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0]
[17:58:45] <grrrit-wm>	 (03PS2) 10ArielGlenn: nova monitoring instaces and salt keys: add new options [puppet] - 10https://gerrit.wikimedia.org/r/211432 
[17:59:38] <gwicke>	 yurik: you around?
[17:59:44] <yurik>	 gwicke, yep
[17:59:48] <gwicke>	 hi
[17:59:50] <yurik>	 hey
[18:00:04] <gwicke>	 the restart issues are not surprising
[18:00:41] <yurik>	 yeah - but its not good - because if the same service is handling something else at the same time, it will drop processing it
[18:00:48] <yurik>	 the deploy repo is ready to be synced
[18:00:58] <yurik>	 but apergos convinced me to wait until monday :)
[18:01:12] <apergos>	 I did!
[18:01:43] <apergos>	 I'm taking today to get a few patches in too, but no merges.
[18:01:47] <gwicke>	 what do you mean with 'if the service is handling something else at the same time'?
[18:01:54] <gwicke>	 you mean the restart is not graceful?
[18:02:10] <yurik>	 gwicke, on the grafena graph, what's the stepping of the graph? e.g. if it says 2 requests, whats the time unit?
[18:02:22] <yurik>	 re restarts - it crashes with an exception
[18:02:29] <gwicke>	 yurik: rates are per second
[18:02:30] <yurik>	 the exception is unhandled
[18:02:38] <yurik>	 hmm, fairly hi
[18:02:40] <yurik>	 high
[18:03:05] <yurik>	 gwicke, https://phabricator.wikimedia.org/T99349
[18:03:25] <gwicke>	 for some things 2/s can maybe be considered high, but I'm not sure if this is one of them ;)
[18:03:44] <yurik>	 hehe, current is 4
[18:03:45] <yurik>	 http://grafana.wikimedia.org/#/dashboard/db/graphoid
[18:03:59] <yurik>	 4 restarts per second is a bit high
[18:04:30] <yurik>	 gwicke, what do you think, should i sync up the graphoid deploy repo today?
[18:05:00] <yurik>	 the good thing - its fairly isolated from the rest of the platform
[18:05:16] <gwicke>	 if this is so common, how come it didn't show up in tests?
[18:05:32] <yurik>	 gwicke, because i suspect its one graph thats causing it
[18:05:44] <gwicke>	 can you verify in logstash?
[18:05:55] <yurik>	 logstash is not showing anything related to graphoid
[18:06:12] <yurik>	 the issue is that the external data that some graph is ussing is not properly formed
[18:06:15] <yurik>	 thus causing an exception
[18:07:05] <gwicke>	 is logging not set up properly?
[18:07:25] <yurik>	 gwicke, i don't know - godog has sent me the logs from the sca1001 machine
[18:07:36] <yurik>	 that's how i figured it out
[18:08:10] <gwicke>	 we have generally progressed past local logging
[18:08:22] <gwicke>	 would be good to fix that
[18:09:26] <yurik>	 gwicke, https://phabricator.wikimedia.org/T97615
[18:09:37] <yurik>	 2 weeks ago :)
[18:09:39] <gwicke>	 yurik: what is the rate of restarts relative to the total # of requests?
[18:10:06] <yurik>	 gwicke, per that graph above (the blue line) - considerable
[18:10:10] <yurik>	 1:1
[18:10:14] <yurik>	 or 1:2
[18:10:55] <gwicke>	 hmm
[18:11:13] <gwicke>	 so you are saying it's basically completely broken
[18:12:00] <yurik>	 so maybe i should deploy.  Could you take a look at the patch https://gerrit.wikimedia.org/r/#/c/211431/
[18:12:27] <gwicke>	 hmm, there doesn't seem to be any logging config in config.yaml
[18:12:28] <yurik>	 gwicke, i think its not - because the errors would not show as successes
[18:12:33] <gwicke>	 did you just forget to set that up?
[18:12:44] <yurik>	 gwicke, config is generated on the fyl
[18:12:44] <yurik>	 fly
[18:12:48] <apergos>	 is this a critical service/is it an emergency?  that's what I would ask before a weekend deployment
[18:13:10] <yurik>	 critical - well, not much in WP is really critical, its an info service :D
[18:13:22] <gwicke>	 apergos: it's not going to break the site
[18:14:26] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[18:14:36] <yurik>	 gwicke, so i am not exactly sure what config puppets generate on the actual service - will wait for the sca1001 access to look
[18:14:45] <yurik>	 its a template from puppet
[18:14:59] <gwicke>	 yurik: https://github.com/wikimedia/operations-puppet/blob/production/modules/graphoid/templates/config.yaml.erb is missing a logging stanza
[18:15:20] <gwicke>	 example: https://github.com/wikimedia/operations-puppet/blob/production/modules/restbase/templates/config.yaml.erb#L11-L18
[18:15:47] <yurik>	 gwicke, i would speak with mobrovac first- maybe he did it for a reason?
[18:16:21] <yurik>	 him and akosiaris worked on setting it up, would want to ping them first
[18:17:42] <yurik>	 gwicke, btw, strange - if the logging is not setup, how could there be a log file on sca1001
[18:19:01] <gwicke>	 default is to log to stdout
[18:19:16] <yurik>	 stdout is captured?
[18:19:35] <gwicke>	 and afaik alex wanted to redirect that to a local log file, despite the risk of taking the service down on full disk
[18:20:14] <gwicke>	 a lesson we learned with Parsoid
[18:20:49] <yurik>	 exactly why i don't want to touch it without consulting them first
[18:21:03] <yurik>	 gwicke, anyway, you thoughts, should i deploy now?
[18:21:33] <gwicke>	 yurik: my main concern with deploying now is that the testing for graphoid doesn't seem to be very comprehensive
[18:22:12] <gwicke>	 on the surface it looks like it's very broken already, so any change is likely to improve things, but it worries me
[18:22:14] <yurik>	 gwicke, that is true - but testing in this case is fairly complex - its not like we have to test if the service is working, we have to test the entire Vega grammar
[18:22:47] <yurik>	 e.g. what if vega does some extra processing async, and throws up?
[18:23:15] <yurik>	 with async processing, it is not possible to just wrap the call in try/catch and handle it
[18:23:54] <yurik>	 unless js has some global unhandled async handler, which i could simply log and continue... not the best solution
[18:24:06] <gwicke>	 do you have a link to your changes?
[18:24:19] <gwicke>	 you can catch exceptions globally
[18:24:24] <yurik>	 ^^^^
[18:24:27] <yurik>	 sec, will repost
[18:24:37] <gwicke>	 and promises let you propagate exceptions asynchronously
[18:24:39] <yurik>	 https://gerrit.wikimedia.org/r/#/c/211431/
[18:25:01] <yurik>	 gwicke, that's only if the lib uses promises :)
[18:25:08] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.72% of data above the critical threshold [1000.0]
[18:25:08] <yurik>	 which in this case it doesn't :(
[18:26:04] <yurik>	 i will try to speak with their devs, see if i can convince them to start working in the open, instead of simply publishing their internal work once in a while
[18:26:15] <gwicke>	 did you actually verify that it helps, and that it doesn't break other graphs?
[18:26:32] <yurik>	 partially - i don't know which graph is causing this issue
[18:26:44] <yurik>	 i suspect i understand what causes it, but don't know for suer
[18:26:45] <yurik>	 sure
[18:26:52] <yurik>	 i am ok with waiting until monday
[18:28:28] <yurik>	 gwicke, question for you: does a single service instance handle multiple requests? e.g. it handles one at a time, but when it does something async, it handles another request in the mean time
[18:28:32] <gwicke>	 I can have a look at the varnish log to see if I can identify the graph
[18:28:39] <yurik>	 sounds good
[18:29:11] <yurik>	 thanks
[18:29:18] <yurik>	 could i do it too/
[18:30:08] <gwicke>	 not sure if you have shell on cp1045 and cp1058
[18:30:40] <yurik>	 nope (
[18:33:44] <gwicke>	 the paths will all have 'png' in them, right?
[18:34:47] <yurik>	 gwicke, yep
[18:35:13] <yurik>	 hmm, i just realized i could have checked in hive
[18:35:47] <gwicke>	 varnishncsca doesn't seem to show any requests for pngs
[18:38:43] <yurik>	 gwicke, i will check via hive
[18:38:54] <grrrit-wm>	 (03PS1) 10ArielGlenn: deployment server init should configure repo every time [puppet] - 10https://gerrit.wikimedia.org/r/211435 
[18:39:29] <gwicke>	 yurik: my recommendation is to find the failing graph, create a test for it & verify that it's fixed
[18:39:36] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] deployment server init should configure repo every time [puppet] - 10https://gerrit.wikimedia.org/r/211435 (owner: 10ArielGlenn)
[18:39:49] <yurik>	 gwicke, will try )
[18:39:49] <gwicke>	 to find the graph, you might have to improve logging
[18:40:56] <yurik>	 apparently hue is no longer letting me in, i will try to do it via direct querying
[18:42:03] <grrrit-wm>	 (03PS2) 10ArielGlenn: deployment server init should configure repo every time [puppet] - 10https://gerrit.wikimedia.org/r/211435 
[18:51:42] <yurik>	 gwicke, is it possible that varnish is not logging any %graphoid% access to hadoop?
[18:52:12] <yurik>	 this returns no results: select * FROM wmf.webrequest WHERE year=2015 AND month=5 AND day=16 AND hour=16 AND uri_host like '%graphoid%' limit 100;
[18:54:33] <gwicke>	 it looks likely
[18:54:57] <yurik>	 great
[18:56:58] <gwicke>	 the good news is that RB logs backend errors
[18:58:40] <yurik>	 RB?
[18:58:44] <gwicke>	 restbase
[19:20:08] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0]
[19:24:10] <grrrit-wm>	 (03CR) 10ArielGlenn: "I'd like to see that second approach implemented; updating the source repo from itself is just wrong." [puppet] - 10https://gerrit.wikimedia.org/r/201344 (https://phabricator.wikimedia.org/T94754) (owner: 10BryanDavis)
[19:46:25] <grrrit-wm>	 (03PS1) 10Yurik: Removed localhost access by graphoid [puppet] - 10https://gerrit.wikimedia.org/r/211450 
[19:48:27] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0]
[20:06:26] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[20:09:37] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0]
[20:36:31] <gwicke>	 yurik: https://github.com/wikimedia/restbase/pull/247
[21:09:16] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0]
[22:08:56] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0]
[22:20:16] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[22:33:08] <icinga-wm>	 PROBLEM - puppet last run on ms-be2001 is CRITICAL puppet fail
[22:35:27] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0]
[22:50:56] <icinga-wm>	 RECOVERY - puppet last run on ms-be2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:58:46] <icinga-wm>	 RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0]
[23:06:27] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0]
[23:11:16] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0]
[23:30:17] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0]
[23:31:38] <ori>	 kart_: where and how can i use contenttranslation?