[00:01:22] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[00:07:20] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[00:44:41] <icinga-wm>	 RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old.
[01:04:49] <icinga-wm>	 PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: puppet fail
[01:11:49] <icinga-wm>	 PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail
[01:31:49] <icinga-wm>	 RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[01:37:49] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 601 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6228166 keys - replication_delay is 601
[01:38:49] <icinga-wm>	 RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[01:45:29] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6223765 keys - replication_delay is 0
[01:56:00] <icinga-wm>	 PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: puppet fail
[02:20:00] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 610 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6226079 keys - replication_delay is 610
[02:21:59] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6222531 keys - replication_delay is 0
[02:24:50] <icinga-wm>	 RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[02:24:54] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.3) (duration: 09m 04s)
[02:25:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:30:46] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Mon May 30 02:30:46 UTC 2016 (duration 5m 52s)
[02:30:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:31:21] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-c CQL 10.64.32.196:9042 on restbase1008 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-06-01 02:30:53.
[03:48:50] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 618 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6229343 keys - replication_delay is 618
[03:56:31] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6219247 keys - replication_delay is 0
[03:58:30] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active
[04:04:20] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[04:29:29] <icinga-wm>	 PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Puppet has 1 failures
[04:42:45] <grrrit-wm>	 (03PS1) 10Ori.livneh: Drop dependency on wikimedia/cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291681 
[04:54:31] <icinga-wm>	 RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures
[04:57:49] <ori>	 bd808 tried to undo the breakage by reinstalling with composer 1.0.x, but he did not revert erik's patch, so this left the busted ./composer/autoload_static.php in place
[04:58:15] <ori>	 i ran composer update with 1.1 and thought i'd be ok if i don't commit anything related to the composer update
[04:58:29] <ori>	 this added a line to autoload_static.php which caused it to be linted
[04:59:05] * bd808 feels a disturbance in the force
[04:59:30] <bd808>	 do we have broken vendor for php 5.6+ again?
[04:59:48] <ori>	 not really, but things are a bit wonky
[05:00:10] <ori>	 when you reinstalled with 1.0.x, you did not remove autoload_static.php
[05:00:26] <bd808>	 Really? that was wrong
[05:00:32] <bd808>	 T135161 has the gory details
[05:00:32] <ori>	 1.0.x does not generate a file by that name, so your reinstall left it as an orphan
[05:00:32] <stashbot>	 T135161: Composer v1.1.0 generated vendor dirs will fail lint by PHP <5.6 - https://phabricator.wikimedia.org/T135161
[05:00:47] <bd808>	 ah crap. we need to kill it
[05:48:14] <icinga-wm>	 PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 1 failures
[05:58:58] <grrrit-wm>	 (03CR) 10Mobrovac: "Oh, I see. Cool. But let's also remove tilerator/deploy from hieradata/common/role/deployment.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/291268 (https://phabricator.wikimedia.org/T129146) (owner: 10Thcipriani)
[06:12:34] <grrrit-wm>	 (03CR) 10Mobrovac: Partially port RESTBaseUpdateJobs to change propagation. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291201 (owner: 10Ppchelko)
[06:13:15] <icinga-wm>	 RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:25:04] <icinga-wm>	 PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:26:23] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::hhvm: debian jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/291687 (https://phabricator.wikimedia.org/T131749) 
[06:26:53] <icinga-wm>	 RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.309 second response time
[06:29:24] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active
[06:30:44] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:03] <icinga-wm>	 PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:03] <icinga-wm>	 PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:23] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "pcc at puppet-compiler.wmflabs.org/2979 is quite happy, making jenkins happy now" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[06:35:24] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[06:35:30] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::hhvm: debian jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/291687 (https://phabricator.wikimedia.org/T131749) 
[06:38:41] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki::hhvm: debian jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/291687 (https://phabricator.wikimedia.org/T131749) (owner: 10Giuseppe Lavagetto)
[06:40:34] <icinga-wm>	 PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:40:38] <icinga-wm>	 PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:40:43] <icinga-wm>	 PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:41:54] <akosiaris>	 hmm
[06:42:34] <icinga-wm>	 RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 8.032 second response time
[06:42:37] <icinga-wm>	 RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 8.067 second response time
[06:42:37] <icinga-wm>	 RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 8.271 second response time
[06:43:14] <akosiaris>	 not sure how this got fixed
[06:43:39] <_joe_>	 not sure why we're getting flooded by these messages
[06:43:48] <akosiaris>	 that too
[06:53:23] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: networks::constants: Hieraize all_network_subnets [puppet] - 10https://gerrit.wikimedia.org/r/291263 
[06:53:25] <grrrit-wm>	 (03PS25) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[06:53:27] <grrrit-wm>	 (03PS5) 10Alexandros Kosiaris: network: Move into module [puppet] - 10https://gerrit.wikimedia.org/r/291234 
[06:53:29] <grrrit-wm>	 (03PS10) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 
[06:56:54] <icinga-wm>	 RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[06:57:34] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[06:57:34] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:53] <icinga-wm>	 RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:23] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active
[07:03:04] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[07:03:55] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: nutcracker: add an additional guard on the master version [puppet] - 10https://gerrit.wikimedia.org/r/291688 (https://phabricator.wikimedia.org/T131749) 
[07:09:11] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] nutcracker: add an additional guard on the master version [puppet] - 10https://gerrit.wikimedia.org/r/291688 (https://phabricator.wikimedia.org/T131749) (owner: 10Giuseppe Lavagetto)
[07:09:21] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [V: 032] nutcracker: add an additional guard on the master version [puppet] - 10https://gerrit.wikimedia.org/r/291688 (https://phabricator.wikimedia.org/T131749) (owner: 10Giuseppe Lavagetto)
[07:09:37] <_joe_>	 5 mintues and no jenkins-bot verified
[07:09:42] <_joe_>	 this is ridiculous
[07:12:04] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[07:26:43] <nani>	 Hello
[07:27:10] <nani>	 Alexz what I doing today?
[07:27:25] <nani>	 You
[07:27:47] <ori>	 _joe_: zuul is hung
[07:27:57] <ori>	 there is a job that has been running for 2 hrs
[07:27:59] <ori>	 https://integration.wikimedia.org/zuul/
[07:28:06] <ori>	 hashar: ^
[07:28:45] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] varnish: Fix PEP8 violations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291187 (owner: 10BryanDavis)
[07:29:01] <grrrit-wm>	 (03PS5) 10Alexandros Kosiaris: varnish: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291187 (owner: 10BryanDavis)
[07:29:08] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] varnish: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291187 (owner: 10BryanDavis)
[07:30:29] <grrrit-wm>	 (03PS4) 10Alexandros Kosiaris: mailman: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291180 (owner: 10BryanDavis)
[07:30:51] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] mailman: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291180 (owner: 10BryanDavis)
[07:34:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1262 is CRITICAL: Connection timed out
[07:35:17] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] pybal: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291183 (owner: 10BryanDavis)
[07:35:19] <icinga-wm>	 PROBLEM - nutcracker port on mw1262 is CRITICAL: Timeout while attempting connection
[07:35:39] <icinga-wm>	 PROBLEM - nutcracker process on mw1262 is CRITICAL: Timeout while attempting connection
[07:36:00] <icinga-wm>	 PROBLEM - puppet last run on mw1262 is CRITICAL: Timeout while attempting connection
[07:36:19] <icinga-wm>	 PROBLEM - salt-minion processes on mw1262 is CRITICAL: Timeout while attempting connection
[07:36:40] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1262 is CRITICAL: Timeout while attempting connection
[07:36:59] <icinga-wm>	 PROBLEM - DPKG on mw1262 is CRITICAL: Timeout while attempting connection
[07:37:18] <icinga-wm>	 PROBLEM - Disk space on mw1262 is CRITICAL: Timeout while attempting connection
[07:37:48] <icinga-wm>	 PROBLEM - RAID on mw1262 is CRITICAL: Timeout while attempting connection
[07:38:19] <icinga-wm>	 PROBLEM - configured eth on mw1262 is CRITICAL: Timeout while attempting connection
[07:38:38] <icinga-wm>	 PROBLEM - dhclient process on mw1262 is CRITICAL: Timeout while attempting connection
[07:38:39] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1262 is CRITICAL: Host mw1262 is not in mediawiki-installation dsh group
[07:39:50] <elukey>	 anybody working on --^ ?
[07:42:01] <elukey>	 ah probably one of the newer app severs
[07:42:05] <elukey>	 *servers
[07:42:18] <_joe_>	 elukey: yes
[07:42:24] <_joe_>	 it's me :)
[07:42:39] <_joe_>	 not in lvs, not even in the scap sync file
[07:42:56] <elukey>	 o/
[07:43:48] <icinga-wm>	 RECOVERY - salt-minion processes on mw1262 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[07:43:58] <icinga-wm>	 RECOVERY - configured eth on mw1262 is OK: OK - interfaces up
[07:43:59] <icinga-wm>	 RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.009 second response time
[07:44:09] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1262 is OK: OK: nf_conntrack is 0 % full
[07:44:09] <icinga-wm>	 RECOVERY - dhclient process on mw1262 is OK: PROCS OK: 0 processes with command name dhclient
[07:44:39] <icinga-wm>	 RECOVERY - nutcracker port on mw1262 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[07:44:48] <icinga-wm>	 RECOVERY - Disk space on mw1262 is OK: DISK OK
[07:44:59] <icinga-wm>	 RECOVERY - nutcracker process on mw1262 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[07:45:18] <icinga-wm>	 RECOVERY - RAID on mw1262 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0
[07:46:28] <icinga-wm>	 RECOVERY - DPKG on mw1262 is OK: All packages OK
[07:47:34] <grrrit-wm>	 (03PS2) 10Volans: MariaDB: use 0/1 instead of off/on for read_only [puppet] - 10https://gerrit.wikimedia.org/r/291299 (https://phabricator.wikimedia.org/T133333) 
[07:55:49] <icinga-wm>	 PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused
[07:56:30] <icinga-wm>	 PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server
[07:58:00] <_joe_>	 is someone working on zuul?
[07:58:19] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active
[08:00:16] <grrrit-wm>	 (03CR) 10DCausse: [C: 031] Send wmf.4 search and ttmserver traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291257 (https://phabricator.wikimedia.org/T133124) (owner: 10EBernhardson)
[08:01:42] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and Set up new application servers mw1284-1306 - https://phabricator.wikimedia.org/T134309#2338122 (10Joe) @Southparkfan apparently for some reason the same DNS record for mw1090 has been assigned to mw1305, which is still turned off for good for now.  So w...
[08:02:17] <icinga-wm>	 PROBLEM - Apache HTTP on mw1262 is CRITICAL: Connection refused
[08:02:47] <_joe_>	 I'll ack all alerts on mw1262
[08:04:26] <icinga-wm>	 RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730
[08:04:45] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[08:04:56] <icinga-wm>	 RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server
[08:06:54] <ori>	 _joe_: I was, I've been logging on #wikimedia-releng
[08:07:11] <ori>	 as the instructions on https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues asked
[08:07:20] <ori>	 the alerts echo there, so i didn't notice them here
[08:07:30] <_joe_>	 ori: ok thanks for taking care of it :)
[08:07:45] <ori>	 it seems to be ok now
[08:08:31] <ori>	 I'm not sure why releng !logs on #wikimedia-releng, using a separate bot and a separate SAL
[08:09:37] <ori>	 if responsibilities were completely and hygienically separated, that'd be one thing, but this channel gets alerts for contint service failures
[08:10:16] <ori>	 IMO we're not so big and the main SAL is not so busy that a separate SAL is warranted
[08:10:20] <ori>	 we can just all log here
[08:10:23] <_joe_>	 ops: the kitchen sink where all tech debt gets turned on and off again
[08:10:43] <_joe_>	 (the inversion was intentional)
[08:10:54] <grrrit-wm>	 (03PS1) 10Gehel: Elasticsearch - configure bind_networks. [puppet] - 10https://gerrit.wikimedia.org/r/291689 
[08:10:59] <ori>	 heh
[08:14:59] <grrrit-wm>	 (03PS2) 10Ppchelko: Partially port RESTBaseUpdateJobs to change propagation. [puppet] - 10https://gerrit.wikimedia.org/r/291201 
[08:15:42] <grrrit-wm>	 (03CR) 10DCausse: [C: 031] Elasticsearch - configure bind_networks. [puppet] - 10https://gerrit.wikimedia.org/r/291689 (owner: 10Gehel)
[08:16:03] <grrrit-wm>	 (03CR) 10Volans: [C: 032] MariaDB: use 0/1 instead of off/on for read_only [puppet] - 10https://gerrit.wikimedia.org/r/291299 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans)
[08:16:05] <grrrit-wm>	 (03CR) 10Ppchelko: Partially port RESTBaseUpdateJobs to change propagation. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291201 (owner: 10Ppchelko)
[08:17:03] <grrrit-wm>	 (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291299 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans)
[08:18:13] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Elasticsearch - configure bind_networks. [puppet] - 10https://gerrit.wikimedia.org/r/291689 (owner: 10Gehel)
[08:19:30] <grrrit-wm>	 (03PS3) 10Ppchelko: Partially port RESTBaseUpdateJobs to change propagation. [puppet] - 10https://gerrit.wikimedia.org/r/291201 
[08:27:41] <gehel>	 !log starting elasticsearch upgrade on codfw (T133125)
[08:27:42] <stashbot>	 T133125: Upgrade codfw data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133125
[08:27:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:28:59] <grrrit-wm>	 (03PS3) 10Volans: MariaDB: use 0/1 instead of off/on for read_only [puppet] - 10https://gerrit.wikimedia.org/r/291299 (https://phabricator.wikimedia.org/T133333) 
[08:33:37] <grrrit-wm>	 (03CR) 10DCausse: [C: 031] Upgrade cluster plugins for elasticsearch 2.3.3 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/290679 (https://phabricator.wikimedia.org/T134937) (owner: 10DCausse)
[08:34:51] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Upgrade cluster plugins for elasticsearch 2.3.3 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/290679 (https://phabricator.wikimedia.org/T134937) (owner: 10DCausse)
[08:36:32] <grrrit-wm>	 (03CR) 10Gehel: [V: 032] Upgrade cluster plugins for elasticsearch 2.3.3 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/290679 (https://phabricator.wikimedia.org/T134937) (owner: 10DCausse)
[08:44:34] <volans>	 !log Align thread_pool_max_threads to my.cnf value on 1 slave/shard in eqiad (db1065,db1076,db1078,db1040,db1026,db1061,db1039) T133333
[08:44:34] <stashbot>	 T133333: Audit MySQL configurations - https://phabricator.wikimedia.org/T133333
[08:44:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:53:49] <wikibugs>	 06Operations, 06Project-Admins: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2338184 (10Aklapper) For the records, the following projects were changed from yellow tags to blue components lately: #Diamond, #Elasticsearch, #Icinga, #Shinken. (#Graphite, #LDAP, #P...
[08:55:21] <wikibugs>	 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2338185 (10Aklapper) Proposing to decline as per last two comments.
[08:56:01] <wikibugs>	 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM, 13Patch-For-Review: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2338188 (10Joe) What is left to do:  [] Make mediawiki::cgroup work with systemd or change the way we manage cgroups there...
[08:58:02] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active
[09:01:03] <jynus>	 hashar, do you have some time for T126699 ?
[09:01:04] <stashbot>	 T126699: mediawiki jobs fail intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges" - https://phabricator.wikimedia.org/T126699
[09:01:45] <jynus>	 I want to merge the puppet patch, but want you for CI testing
[09:04:12] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[09:06:50] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: add restbase100[89]-c to seeds [puppet] - 10https://gerrit.wikimedia.org/r/291692 (https://phabricator.wikimedia.org/T134016) 
[09:09:02] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase100[89]-c to seeds [puppet] - 10https://gerrit.wikimedia.org/r/291692 (https://phabricator.wikimedia.org/T134016) (owner: 10Filippo Giunchedi)
[09:09:56] <gehel>	 !log shutting down elasticsearch on codfw for upgrade (T133125)
[09:09:57] <stashbot>	 T133125: Upgrade codfw data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133125
[09:10:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:13:20] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] Partially port RESTBaseUpdateJobs to change propagation. [puppet] - 10https://gerrit.wikimedia.org/r/291201 (owner: 10Ppchelko)
[09:15:58] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on search.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 341 bytes in 0.181 second response time
[09:16:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2004.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2001.codfw.wmnet because of too many down!
[09:16:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2004.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2001.codfw.wmnet because of too many down!
[09:16:26] <_joe_>	 uh?
[09:16:31] <_joe_>	 what the fuck is up?
[09:16:39] <_joe_>	 gehel: any idea>
[09:16:45] <moritzm>	 gehel is upodating elastic in codfw
[09:16:46] <dcausse>	 codfw being restarted (upgrade)
[09:16:58] <jynus>	 no user impact, I assume
[09:17:01] <gehel>	 _joe_: damn, forgot the LVS check again
[09:17:01] <dcausse>	 no
[09:17:06] <_joe_>	 yeah, you might want to do it a bit slower maybe?
[09:17:08] <jynus>	 ok, good to know
[09:17:18] <_joe_>	 I have no idea if that would help
[09:17:29] <gehel>	 nope, we need to take the whole cluster down at once, 1.7 and 2.3 are not compatible
[09:17:41] <dcausse>	 yes full cluster restart :/
[09:17:58] <jynus>	 which is not very HA-friendly :-)
[09:18:02] <dcausse>	 not at all :(
[09:18:56] <gehel>	 as long as we have 2 separate clusters...
[09:20:48] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2004.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2001.codfw.wmnet because of too many down! Gehel shutting down elasticsearch on codfw for upgrade (T133125)
[09:20:54] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2004.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2001.codfw.wmnet because of too many down! Gehel shutting down elasticsearch on codfw for upgrade (T133125)
[09:21:47] <jynus>	 godog, if you haven't seen it, there is a degraded array email for ms-be2012
[09:22:54] <godog>	 jynus: thanks, yeah I filed it as https://phabricator.wikimedia.org/T136395 but will reply to the email too
[09:23:06] <jynus>	 oh, no need, sorry for pinging you
[09:24:47] <grrrit-wm>	 (03PS1) 10DCausse: Elastic: update mandatory plugins for codfw [puppet] - 10https://gerrit.wikimedia.org/r/291694 
[09:24:48] <godog>	 no worries at all jynus, not sure if we can stop the emails once the array is degraded
[09:26:27] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Elastic: update mandatory plugins for codfw [puppet] - 10https://gerrit.wikimedia.org/r/291694 (owner: 10DCausse)
[09:35:46] <wikibugs>	 06Operations, 10DBA: dbtree shows 0 lag for db1047 - https://phabricator.wikimedia.org/T109401#2338289 (10Volans) a:03Volans
[09:36:31] <Poke95>	 Serious stuff...
[09:36:45] <Poke95>	 Serious stuff that this channel is not +t
[09:39:00] <jynus>	 lots of Wikibase\Lib\Store\Sql\SqlEntityInfoBuilder::collectTermsForEntities hitting db1071 
[09:40:45] <grrrit-wm>	 (03PS1) 10Volans: Exclude db1047 (multisource slave) from dbtree [software/dbtree] - 10https://gerrit.wikimedia.org/r/291696 (https://phabricator.wikimedia.org/T109401) 
[09:43:52] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] Exclude db1047 (multisource slave) from dbtree [software/dbtree] - 10https://gerrit.wikimedia.org/r/291696 (https://phabricator.wikimedia.org/T109401) (owner: 10Volans)
[09:49:50] <grrrit-wm>	 (03PS1) 10Elukey: Set Kafka default cleanup policy to 'delete' to avoid any compaction with 0.9 [puppet] - 10https://gerrit.wikimedia.org/r/291697 
[09:50:19] <wikibugs>	 06Operations, 10media-storage, 07Tracking: refresh swift hardware in codfw/eqiad (tracking) - https://phabricator.wikimedia.org/T130012#2338320 (10fgiunchedi) 6x swift systems (all 3TB disks) have been ordered in T130713 and T136336, though we'll be keeping the old swift hw in place for the next 6/9 months a...
[09:50:51] <Poke95>	 Now I am the one that setted the topic lol
[09:51:08] <wikibugs>	 06Operations, 10media-storage, 07Tracking: refresh swift hardware in codfw/eqiad (tracking) - https://phabricator.wikimedia.org/T130012#2338322 (10fgiunchedi)
[09:51:39] <grrrit-wm>	 (03CR) 10Volans: [C: 032 V: 032] Exclude db1047 (multisource slave) from dbtree [software/dbtree] - 10https://gerrit.wikimedia.org/r/291696 (https://phabricator.wikimedia.org/T109401) (owner: 10Volans)
[09:52:17] <volans>	 jynus: do you know if additional steps are needed to deploy dbtree code? ^^^
[09:57:28] <moritzm>	 !log installing libidn security updates on jessie systems
[09:57:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:58:45] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active
[09:59:01] <grrrit-wm>	 (03PS1) 10DCausse: Elastic: add publish_host support [puppet] - 10https://gerrit.wikimedia.org/r/291698 
[10:02:54] <grrrit-wm>	 (03PS2) 10Elukey: Set Kafka default cleanup policy to 'delete' to avoid any compaction with 0.9 [puppet] - 10https://gerrit.wikimedia.org/r/291697 
[10:04:16] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2338334 (10Ladsgroup) @RobH: Thanks for the response. What I need is access to these sudo actions:  ```             'ALL=(root) NOPASSWD: /usr...
[10:04:48] <grrrit-wm>	 (03CR) 10Elukey: "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/2983" [puppet] - 10https://gerrit.wikimedia.org/r/291697 (owner: 10Elukey)
[10:07:04] <grrrit-wm>	 (03PS1) 10Jcrespo: Reduce db1071 load (regular connection exhaustion from jobs) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291703 
[10:08:31] <grrrit-wm>	 (03CR) 10Gehel: [C: 032 V: 032] "Jenkins not reacting, change is trivial enough, so I'll v+2" [puppet] - 10https://gerrit.wikimedia.org/r/291698 (owner: 10DCausse)
[10:09:24] <volans>	 gehel: jenkins is not reacting because jenkins-bot was not subscribed to the change... somethings is wrong
[10:10:48] <gehel>	 volans: I have to admit that I have no idea how this integration works...
[10:11:01] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032 V: 032] Reduce db1071 load (regular connection exhaustion from jobs) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291703 (owner: 10Jcrespo)
[10:12:34] <jynus>	 yeah, hudson is down
[10:12:49] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Reduce db1071 load (duration: 00m 48s)
[10:12:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:20:27] <icinga-wm>	 PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:24:44] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: base::grub: fix the ioscheduler setting [puppet] - 10https://gerrit.wikimedia.org/r/291706 
[10:24:46] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: base::grub: actually use augeas on jessie [puppet] - 10https://gerrit.wikimedia.org/r/291707 
[10:24:52] <_joe_>	 paravoid: ^^
[10:25:26] * akosiaris_ at the hospital, won't be around for a bit
[10:25:41] <jynus>	 :-(
[10:26:47] <godog>	 akosiaris: gah, take care
[10:30:03] <hashar>	 going to fix up zuul
[10:30:09] <hashar>	 !log Zuul deadlocked :(
[10:30:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:30:23] <hashar>	 bah it died :(
[10:31:30] <moritzm>	 !log upgrading hhvm on mw1017 (also picking up updated versions of icu and lcms)
[10:31:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:32:29] <jynus>	 _joe_, isn't that incorrect, too?
[10:32:51] <_joe_>	 jynus: I miss context
[10:32:56] <_joe_>	 what is incorrect?
[10:33:00] <jynus>	 shouldn't we just do elevator=$ioschenduler where elevator=.*
[10:33:33] <_joe_>	 jynus: right
[10:33:42] <_joe_>	 jynus: although we don't want .*
[10:33:53] <_joe_>	 and I have no idea if regexes can be used in selectors
[10:34:07] <jynus>	 yeah, the idea, I can help with the implementation
[10:35:36] <hashar>	 !log Restarted Zuul.
[10:35:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:35:40] <jynus>	 I am interested on this because I would like to try noop give the newest hardware
[10:36:03] <jynus>	 *given
[10:36:21] <grrrit-wm>	 (03CR) 10Hashar: "check experimental" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/291525 (owner: 10BryanDavis)
[10:37:23] <grrrit-wm>	 (03CR) 10Paladox: "check experimental" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/291525 (owner: 10BryanDavis)
[10:37:25] <grrrit-wm>	 (03CR) 10Hashar: [C: 032] Make the builder script less simple [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/291525 (owner: 10BryanDavis)
[10:37:32] <grrrit-wm>	 (03CR) 10Hashar: Make the builder script less simple [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/291525 (owner: 10BryanDavis)
[10:37:37] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: dbtree shows 0 lag for db1047 - https://phabricator.wikimedia.org/T109401#2338381 (10Volans) 05Open>03Resolved For multisource slaves the data in the tendril table `slave_status` is saved with the shard prefix (i.e. `s1.seconds_behind_master`) and is not found by...
[10:39:57] <grrrit-wm>	 (03CR) 10Ema: varnish: jemalloc tuning for frontend caches (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291592 (https://phabricator.wikimedia.org/T135384) (owner: 10BBlack)
[10:43:29] <_joe_>	 jynus: I'll run some tests
[10:46:18] <icinga-wm>	 RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:51:08] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures
[10:52:57] <grrrit-wm>	 (03PS2) 10Ema: tlsproxy: trim indentation in localssl.erb [puppet] - 10https://gerrit.wikimedia.org/r/291253 
[10:55:53] <jynus>	 I was doing the same, it seems that augeas has some issues with the latest grubs
[10:57:05] <jynus>	 I am going to ack elastic codfw errors, I cannot see a thing on icinga
[10:57:06] <grrrit-wm>	 (03CR) 10Ema: [C: 032 V: 032] tlsproxy: trim indentation in localssl.erb [puppet] - 10https://gerrit.wikimedia.org/r/291253 (owner: 10Ema)
[10:57:43] <moritzm>	 !log upgrading hhvm on remaining canaries (also picking up updated versions of icu and lcms)
[10:57:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:00:47] <jynus>	 now we can see the important thingsm like etherpad
[11:05:38] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[11:07:35] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: prometheus: add tools role [puppet] - 10https://gerrit.wikimedia.org/r/291710 
[11:07:38] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[11:08:20] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] prometheus: add tools role [puppet] - 10https://gerrit.wikimedia.org/r/291710 (owner: 10Filippo Giunchedi)
[11:11:10] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "(wears brown paper bag)" [puppet] - 10https://gerrit.wikimedia.org/r/291707 (owner: 10Giuseppe Lavagetto)
[11:11:18] * godog shakes fist at jenkins
[11:11:20] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] base::grub: fix the ioscheduler setting [puppet] - 10https://gerrit.wikimedia.org/r/291706 (owner: 10Giuseppe Lavagetto)
[11:13:37] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 682 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6255588 keys - replication_delay is 682
[11:17:17] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[11:21:29] <wikibugs>	 06Operations, 10MediaWiki-Categories, 07HHVM: Broken sorting and multi-page categories for Cyrillic wikis - https://phabricator.wikimedia.org/T136281#2338413 (10NickK) Thanks, I confirm that the problem is resolved.
[11:32:31] <grrrit-wm>	 (03PS9) 10Filippo Giunchedi: prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) 
[11:32:33] <grrrit-wm>	 (03PS3) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) 
[11:32:35] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: prometheus: add tools role [puppet] - 10https://gerrit.wikimedia.org/r/291710 
[11:42:51] <moritzm>	 !log upgrading hhvm in codfw (also picking up updated versions of icu and lcms)
[11:42:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:55:33] <icinga-wm>	 PROBLEM - HHVM rendering on mw2149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:57:08] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi)
[11:57:25] <icinga-wm>	 RECOVERY - HHVM rendering on mw2149 is OK: HTTP OK: HTTP/1.1 200 OK - 71761 bytes in 0.371 second response time
[11:58:09] <wikibugs>	 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2338489 (10elukey) 05Open>03Resolved
[11:59:10] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2338493 (10Ladsgroup) Also what about adding to "deploy-service" group?
[12:01:24] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[12:02:20] <paravoid>	 gehel: ^
[12:02:41] <gehel>	 paravoid: thanks, having a look right now
[12:03:26] <gehel>	 alert is on eqiad, which has mostly no traffic at the moment, so 95th percentile is most probably not representative...
[12:03:50] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on search.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.174 second response time
[12:03:50] <gehel>	 dcausse: ^ fyi
[12:03:58] <grrrit-wm>	 (03PS1) 10Ladsgroup: Add ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/291716 (https://phabricator.wikimedia.org/T136406) 
[12:04:07] <wikibugs>	 06Operations, 10ops-esams, 06DC-Ops, 10netops: Set up cr2-esams - https://phabricator.wikimedia.org/T118256#2338497 (10faidon) 05Open>03Resolved
[12:04:20] <_joe_>	 gehel: eqiad has no traffic?
[12:04:26] <_joe_>	 or codfw?
[12:04:32] <dcausse>	 codfw I suppose
[12:04:38] <_joe_>	 because codfw was down for most of the morning
[12:04:52] <gehel>	 _joe_: my bad, codfw has no traffic and alert is for codfw
[12:05:44] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy
[12:06:12] <grrrit-wm>	 (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291565 (owner: 10Ladsgroup)
[12:06:13] <wikibugs>	 06Operations, 10ops-esams, 06DC-Ops: Power cr2-esams PEM 2/PEM 3 - https://phabricator.wikimedia.org/T118166#2338512 (10faidon) PEM 2 is powered, but by the same PDU. PEM 3 is not powered and is also unplugged from the chassis, which downgrades the alarm from a Major (red) to a Minor (yellow).  This will pro...
[12:06:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy
[12:07:08] <gehel>	 !log nginx restarted on elasticsearch codfw cluster (T133125)
[12:07:09] <stashbot>	 T133125: Upgrade codfw data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133125
[12:07:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:08:44] <icinga-wm>	 PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: puppet fail
[12:11:18] <wikibugs>	 06Operations, 10ops-esams, 06DC-Ops: Power cr2-esams PEM 2/PEM 3 - https://phabricator.wikimedia.org/T118166#2338519 (10faidon) The Icinga [[ https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr2-esams&service=Juniper+alarms#comments | alert for the chassis alarm ]] has been acknowledged. T...
[12:11:50] <jynus>	 job queue size is in an ascending pattern
[12:12:15] <jynus>	 https://grafana-admin.wikimedia.org/dashboard/db/job-queue-health?from=1464523928505&to=1464610028505&var-jobType=all
[12:13:05] <jynus>	 maybe the second derivative is descending, not sure yet
[12:13:15] <grrrit-wm>	 (03PS1) 10Gergő Tisza: Remove centralauth-autoaccount right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291720 
[12:13:22] <wikibugs>	 06Operations, 10netops: turn-up/implement zayo wave (579171) for ulsfo-codfw - https://phabricator.wikimedia.org/T122885#2338531 (10faidon)
[12:17:30] <wikibugs>	 06Operations, 10netops: codfw-eqiad Zayo link is down (cr2-codfw:xe-5/0/1) - https://phabricator.wikimedia.org/T134930#2338552 (10faidon) This was down again for 48 hours with the same symptoms. I raised it again with Zayo, which got assigned the case TTN-0001073020. They dispatched a tech at both 2323 Bryan a...
[12:20:16] <grrrit-wm>	 (03PS4) 10Faidon Liambotis: Create raid module to hold RAID monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/290986 (https://phabricator.wikimedia.org/T84050) 
[12:20:19] <grrrit-wm>	 (03PS9) 10Faidon Liambotis: raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) 
[12:20:20] <grrrit-wm>	 (03PS4) 10Faidon Liambotis: raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) 
[12:20:22] <grrrit-wm>	 (03PS4) 10Faidon Liambotis: raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 
[12:20:25] <grrrit-wm>	 (03PS4) 10Faidon Liambotis: raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) 
[12:20:27] <grrrit-wm>	 (03PS4) 10Faidon Liambotis: raid: slightly change check-raid's "utility" names [puppet] - 10https://gerrit.wikimedia.org/r/291011 
[12:20:29] <grrrit-wm>	 (03PS4) 10Faidon Liambotis: raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) 
[12:20:32] <paravoid>	 (just rebasing)
[12:23:58] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "this would work on the first run, then continue adding elevator=$ioscheduler on subsequent runs" [puppet] - 10https://gerrit.wikimedia.org/r/291706 (owner: 10Giuseppe Lavagetto)
[12:24:43] <_joe_>	 jynus: ^^ we can't apparently select based on regexes
[12:25:07] <jynus>	 yes, I saw the issue
[12:25:13] <jynus>	 same with the exec
[12:25:15] <jynus>	 later
[12:25:44] <jynus>	 unless I missread, if the config changes, it will add two values
[12:26:44] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6209474 keys - replication_delay is 0
[12:26:49] <jynus>	 unless  => "grep -q '^GRUB_CMDLINE_LINUX=.*elevator=${ioscheduler}' /etc/default/grub",
[12:28:20] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Create raid module to hold RAID monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/290986 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis)
[12:28:28] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis)
[12:30:16] <jynus>	 faidon, let me help testing that
[12:30:45] <paravoid>	 hmm, found a bug already
[12:30:46] <paravoid>	 interesting
[12:31:23] <paravoid>	 fucking puppet
[12:31:28] <jynus>	 ?
[12:31:28] <paravoid>	 stringify facts stupidity
[12:34:05] <icinga-wm>	 PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:34:13] <icinga-wm>	 PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:34:13] <icinga-wm>	 PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:34:34] <icinga-wm>	 PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:34:34] <icinga-wm>	 PROBLEM - puppet last run on mw1096 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:34:44] <icinga-wm>	 PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 1 failures
[12:34:54] <icinga-wm>	 PROBLEM - puppet last run on mc2013 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:34:56] <paravoid>	 transient, I think
[12:35:08] <gehel>	 !log re-enabling puppet on elasticsearch codfw cluster (T133125)
[12:35:09] <stashbot>	 T133125: Upgrade codfw data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133125
[12:35:14] <icinga-wm>	 PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:35:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:35:14] <icinga-wm>	 PROBLEM - puppet last run on mw2162 is CRITICAL: CRITICAL: puppet fail
[12:35:23] <icinga-wm>	 RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:35:24] <icinga-wm>	 PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:35:34] <jynus>	 faidon, puppet facts find db2018.codfw.wmnet --render-as yaml | grep raid -> raid: megaraid
[12:35:44] <paravoid>	 jynus: "facter --puppet"
[12:35:53] <paravoid>	 and I know, I'm looking at the all hosts view
[12:36:03] <jynus>	 ok, ok
[12:36:05] <icinga-wm>	 RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[12:36:26] <icinga-wm>	 ACKNOWLEDGEMENT - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] Gehel Upgrade in progress, low traffic, so 95th percentile not significant at the moment
[12:36:54] <icinga-wm>	 PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:37:03] <icinga-wm>	 PROBLEM - puppet last run on mw2072 is CRITICAL: CRITICAL: Puppet has 1 failures
[12:37:47] <jynus>	 I see it now: raid: "[\x22hpsa\x22]"
[12:37:48] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: raid: always stringify the raid fact [puppet] - 10https://gerrit.wikimedia.org/r/291726 
[12:38:03] <icinga-wm>	 PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures
[12:39:13] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] raid: always stringify the raid fact [puppet] - 10https://gerrit.wikimedia.org/r/291726 (owner: 10Faidon Liambotis)
[12:40:43] <paravoid>	 now need to wait for another half an hour :)
[12:40:47] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 (owner: 10Faidon Liambotis)
[12:41:01] <paravoid>	 lol what the hell jenkns
[12:41:27] <paravoid>	 hashar: any idea why this change has been taking 18 minutes to be checked, consistently?
[12:41:57] <jynus>	 now it says: "raid: hpsa"
[12:42:02] <paravoid>	 jynus: yeah
[12:42:23] <paravoid>	 the backstory is that facter 2.0.0 introduced structured facts, i.e. facts can return booleans, arrays, hashes etc.
[12:42:25] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis)
[12:42:37] <jynus>	 so leave it there for  half an hour, then do a sanity check?
[12:42:48] <paravoid>	 puppet 3.7 can use that, but only if you tweak a setting
[12:42:50] <paravoid>	 for some reason...
[12:42:58] <paravoid>	 that setting is on by default in 4.0
[12:43:15] <paravoid>	 so I tried to play it smart and have my fact work with returning an array
[12:43:18] <paravoid>	 and it blew up on my face
[12:43:19] <paravoid>	 anyway
[12:43:44] <paravoid>	 jynus: yeah, leave it for half an hour, then check https://servermon.wikimedia.org/query
[12:45:25] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[12:53:01] <hashar>	 !log Upgrading Zuul 1cc37f7..66c8e52 T128569
[12:53:02] <stashbot>	 T128569: Zuul deadlocks if unknown repo has activity in Gerrit - https://phabricator.wikimedia.org/T128569
[12:53:06] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:54:49] <jynus>	 so some nodes have the mpt kernel module- not 100% they should
[12:54:52] <jynus>	 *sure
[12:59:25] <gehel>	 !log disabling warmers elasticsearch codfw cluster (T133125)
[12:59:26] <stashbot>	 T133125: Upgrade codfw data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133125
[12:59:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:59:34] <icinga-wm>	 RECOVERY - puppet last run on mc2013 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[13:01:03] <icinga-wm>	 RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[13:01:14] <icinga-wm>	 RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:01:23] <icinga-wm>	 RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:01:33] <icinga-wm>	 RECOVERY - puppet last run on mw1096 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:01:34] <icinga-wm>	 RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:01:45] <icinga-wm>	 RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[13:01:53] <icinga-wm>	 RECOVERY - puppet last run on mw2072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:02:03] <icinga-wm>	 RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:02:13] <icinga-wm>	 RECOVERY - puppet last run on mw2162 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:02:24] <icinga-wm>	 RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[13:02:54] <icinga-wm>	 RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[13:03:05] <wikibugs>	 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2338675 (10elukey) Created a grafana dashboard from Varnishkafka metrics: https://grafana.wikimedia.org/dashboard/db/varnishkafka
[13:20:37] <paravoid>	 http://p.defau.lt/?fKGznI_VPRMXNcM1AUon3A
[13:20:39] <paravoid>	 raid stats
[13:20:42] <paravoid>	 pretty impressive
[13:22:38] <paravoid>	 rdb1005/1006 have no RAID configured
[13:22:49] <paravoid>	 they have a /dev/sdb, which is not formatted at all
[13:22:49] <paravoid>	 /dev/sdb1        2048 976771071 976769024 465.8G  7 HPFS/NTFS/exFAT
[13:23:32] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] wikilabels: make file settings recursive [puppet] - 10https://gerrit.wikimedia.org/r/291572 (owner: 10Ladsgroup)
[13:23:37] <paravoid>	 _joe_: ^^
[13:23:55] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "what other methods of deployment ?" [puppet] - 10https://gerrit.wikimedia.org/r/291527 (owner: 10Ladsgroup)
[13:24:27] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: wikilabels: make file settings recursive [puppet] - 10https://gerrit.wikimedia.org/r/291572 (owner: 10Ladsgroup)
[13:24:31] <akosiaris>	 back btw
[13:25:17] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] wikilabels: make file settings recursive [puppet] - 10https://gerrit.wikimedia.org/r/291572 (owner: 10Ladsgroup)
[13:25:46] <paravoid>	 ytterbium and antimony as well..
[13:26:17] <paravoid>	 and all the snapshot hosts
[13:26:46] <paravoid>	 and a few others
[13:26:57] <paravoid>	 jynus: I see a few databases/dbproxies on that list too
[13:27:02] <paravoid>	 (the "no RAID" list)
[13:27:08] <volans>	 which one paravoid ?
[13:27:19] <paravoid>	 I think these are old
[13:27:21] <paravoid>	 the dbs
[13:27:32] <paravoid>	 db1001, db1043, db1048, dbproxy1001, dbproxy1002
[13:29:53] <volans>	 paravoid: db1043 looks to have an hardware raid10 
[13:30:09] <paravoid>	 oh interesting
[13:30:17] <volans>	 same for db1048
[13:30:30] <paravoid>	 thanks, I'll check those
[13:30:58] <paravoid>	 I'm manually triaging the list to see where my fact has missed stuff
[13:31:01] <volans>	 same for db1001
[13:31:12] <volans>	 the dbproxy I'm not familiar, let me take a quick look
[13:31:53] <paravoid>	 Warning: Could not load fact file /var/lib/puppet/lib/facter/raid.rb: ./raid.rb:37: undefined (?...) sequence: /^\s*\d+\s+(?<dev>\w+)/
[13:31:58] <paravoid>	 uh?
[13:32:02] <paravoid>	 oh god
[13:32:07] <volans>	 those 3 hosts have facter 1.7.5 on precise
[13:32:08] <paravoid>	 broken on precise's ruby
[13:32:09] <paravoid>	 yeah
[13:34:06] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[13:34:14] <paravoid>	 hashar: ping?
[13:34:17] <volans>	 for checking purposes after the fix dbproxy1001/2 have md, they are precise too
[13:34:24] <paravoid>	 thanks :)
[13:37:04] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[13:39:04] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[13:39:35] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor issue, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291716 (https://phabricator.wikimedia.org/T136406) (owner: 10Ladsgroup)
[13:39:56] <paravoid>	 oh ffs
[13:40:03] <paravoid>	 FileTest.exist? is also broken on ruby 1.7
[13:40:12] <akosiaris>	 1.7 ?
[13:40:18] <akosiaris>	 I assume typo, 1.8 
[13:40:19] <paravoid>	 precise,
[13:40:22] <akosiaris>	 ok
[13:40:23] <paravoid>	 nope!
[13:40:24] <volans>	 paravoid: looks like in ruby 1.8 you have to check the MatchData object
[13:40:41] <paravoid>	 volans: yeah, that part I fixed alreayd..
[13:40:42] <volans>	 1.7? the precise I'm looking at have 1.8.7
[13:41:17] <paravoid>	 er, right
[13:42:22] <paravoid>	 hrm, ok, that works
[13:42:29] <volans>	 and it should have the FileTest.exist?(filename)
[13:42:32] <paravoid>	 root@db1043:~# ruby raid.rb 
[13:42:32] <paravoid>	 megaraid
[13:42:34] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Enable firejail for image scaling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291202 (https://phabricator.wikimedia.org/T135111) 
[13:42:34] <paravoid>	 ok
[13:42:35] <paravoid>	 yes
[13:42:40] <volans>	 great!
[13:44:58] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: raid: make the raid fact Ruby 1.8-compatible [puppet] - 10https://gerrit.wikimedia.org/r/291740 
[13:45:01] <jynus>	 I was going to send a proposed fix for mpt
[13:45:05] <paravoid>	 volans: want to review?
[13:45:07] <paravoid>	 jynus: what about it?
[13:45:18] <volans>	 sure
[13:45:51] <jynus>	 there are hosts that have the mpt kernel module loaded (and so, some "files" are created), but no real "raid" device
[13:46:03] <paravoid>	 interesting!
[13:46:05] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] raid: make the raid fact Ruby 1.8-compatible [puppet] - 10https://gerrit.wikimedia.org/r/291740 (owner: 10Faidon Liambotis)
[13:46:07] <paravoid>	 do you have an example?
[13:46:14] <jynus>	 we should check /proc/scsi/mptsas/0 /proc/mpt/ioc0
[13:46:33] <jynus>	 but not, e.g. mptctl or summarty
[13:46:37] <paravoid>	 jenkins being super broken again?
[13:46:39] <jynus>	 that are created by the module
[13:46:43] <jynus>	 on load
[13:46:56] <jynus>	 2 examples
[13:47:05] <jynus>	 db1019 and db1009
[13:47:13] <jynus>	 they have a working megacli
[13:47:26] <volans>	 jenkins: Could not resolve host: gerrit.wikimedia.org
[13:47:37] <jynus>	 but mpt-status -p fails with ioctl: No such device
[13:47:53] <volans>	 and Gem::RemoteFetcher::UnknownHostError: no such name (https://rubygems.org/gems/hiera-1.3.4.gem)
[13:48:05] <volans>	 so looks like DNS or network issues
[13:48:16] <jynus>	 the precise hosts will disappear eventually
[13:48:35] <grrrit-wm>	 (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291740 (owner: 10Faidon Liambotis)
[13:48:37] <jynus>	 but remember I have to several failovers first
[13:48:48] <jynus>	 some of which are blocked
[13:48:54] <icinga-wm>	 PROBLEM - RAID on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:49:04] <icinga-wm>	 PROBLEM - configured eth on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:49:04] <jynus>	 ^mmmm
[13:49:12] <jynus>	 crashed again?
[13:49:14] <grrrit-wm>	 (03PS2) 10Ladsgroup: Add ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/291716 (https://phabricator.wikimedia.org/T136406) 
[13:49:20] <icinga-wm>	 PROBLEM - MariaDB disk space on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:49:20] <icinga-wm>	 PROBLEM - dhclient process on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:49:23] <icinga-wm>	 PROBLEM - Check size of conntrack table on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:49:34] <volans>	 no, not again
[13:49:40] <icinga-wm>	 PROBLEM - mysqld processes on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:49:46] <jynus>	 is that a wish or a statement?
[13:49:55] <icinga-wm>	 PROBLEM - DPKG on es2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[13:50:04] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: es3 on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:50:04] <icinga-wm>	 PROBLEM - puppet last run on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:50:04] <icinga-wm>	 PROBLEM - Disk space on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:50:09] <volans>	 so far a wish
[13:50:25] <icinga-wm>	 PROBLEM - salt-minion processes on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:50:25] <icinga-wm>	 PROBLEM - MariaDB Slave IO: es3 on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:50:40] <volans>	 still pings but no ssh (so far), checking console
[13:50:45] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable firejail for image scaling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291202 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff)
[13:51:09] <jynus>	 I can loging to mysql
[13:51:24] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 4 failures
[13:51:41] <jynus>	 lag is growing
[13:51:52] <jynus>	 but otherwise the host is functional
[13:52:03] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: raid: adjust the mpt heuristic detection [puppet] - 10https://gerrit.wikimedia.org/r/291743 
[13:52:12] <paravoid>	 jynus: ^
[13:52:17] <volans>	 at console I got the login, entered root and waiting for prompt of password... 
[13:52:20] <moritzm>	 !log enable firejail on image scalers
[13:52:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:52:34] <paravoid>	 sorry, didn't realize the outage
[13:52:36] <paravoid>	 nevermind me
[13:52:46] <volans>	 [246461.498936] megaraid_sas 0000:03:00.0: pending commands remain after waiting, will reset adapter scsi0.
[13:52:51] <volans>	 [246477.398591] megaraid_sas 0000:03:00.0: Init cmd success
[13:52:57] <jynus>	 ?
[13:53:00] <volans>	 from console...
[13:53:04] <icinga-wm>	 RECOVERY - RAID on es2017 is OK: OK: optimal, 1 logical, 12 physical
[13:53:06] <jynus>	 is that on es2017?
[13:53:12] <volans>	 yes on mgmt
[13:53:14] <logmsgbot>	 !log jmm@tin Synchronized wmf-config/CommonSettings.php: firejail security hardening for image scalers (duration: 00m 38s)
[13:53:14] <icinga-wm>	 RECOVERY - configured eth on es2017 is OK: OK - interfaces up
[13:53:17] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] raid: adjust the mpt heuristic detection [puppet] - 10https://gerrit.wikimedia.org/r/291743 (owner: 10Faidon Liambotis)
[13:53:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:53:23] <volans>	 now I can ssh
[13:53:30] <icinga-wm>	 RECOVERY - MariaDB disk space on es2017 is OK: DISK OK
[13:53:31] <icinga-wm>	 RECOVERY - dhclient process on es2017 is OK: PROCS OK: 0 processes with command name dhclient
[13:53:44] <icinga-wm>	 RECOVERY - Check size of conntrack table on es2017 is OK: OK: nf_conntrack is 0 % full
[13:53:51] <icinga-wm>	 RECOVERY - mysqld processes on es2017 is OK: PROCS OK: 1 process with command name mysqld
[13:53:58] <volans>	 [246235.851795] INFO: task jbd2/sda1-8:924 blocked for more than 120 seconds.
[13:54:05] <icinga-wm>	 RECOVERY - DPKG on es2017 is OK: All packages OK
[13:54:15] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: es3 on es2017 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[13:54:15] <icinga-wm>	 RECOVERY - Disk space on es2017 is OK: DISK OK
[13:54:15] <icinga-wm>	 RECOVERY - puppet last run on es2017 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures
[13:54:17] <volans>	 there are a bunch of call traces
[13:54:33] <icinga-wm>	 RECOVERY - salt-minion processes on es2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[13:54:34] <icinga-wm>	 RECOVERY - MariaDB Slave IO: es3 on es2017 is OK: OK slave_io_state Slave_IO_Running: Yes
[13:54:34] <jynus>	 lets get the 1) RAID log 2) ipimi
[13:55:04] <volans>	 looks like the controller so far
[13:55:04] <volans>	 [246477.456982] megaraid_sas 0000:03:00.0: 2270 (2s/0x0020/CRIT) - Controller encountered a fatal error and was reset
[13:55:10] <jynus>	 wow
[13:55:14] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: raid: make the raid fact Ruby 1.8-compatible [puppet] - 10https://gerrit.wikimedia.org/r/291740 
[13:55:16] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: raid: adjust the mpt heuristic detection [puppet] - 10https://gerrit.wikimedia.org/r/291743 
[13:55:29] <paravoid>	 gotta appreciate the irony of a RAID controller failing when we're chatting about RAID controllers
[13:55:44] <jynus>	 paravoid, do not discard a direct causality
[13:55:46] <volans>	 talking about the devil? :D
[13:55:59] <jynus>	 it doesn't point to it at all, but still
[13:56:00] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[13:58:07] <grrrit-wm>	 (03PS3) 10Filippo Giunchedi: prometheus: add tools role [puppet] - 10https://gerrit.wikimedia.org/r/291710 
[13:58:32] <jynus>	 this batch of new servers has had more issue than all of the other servers together
[13:59:46] <jynus>	 "Correctable memory error rate exceeded for DIMM_A2." after replacing the memory
[14:00:44] <jynus>	 "Disk 0 in Backplane 1 of Integrated RAID Controller 1 is inserted." at 2016-05-30T13:52:21-0500
[14:06:43] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] raid: make the raid fact Ruby 1.8-compatible [puppet] - 10https://gerrit.wikimedia.org/r/291740 (owner: 10Faidon Liambotis)
[14:06:47] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] raid: adjust the mpt heuristic detection [puppet] - 10https://gerrit.wikimedia.org/r/291743 (owner: 10Faidon Liambotis)
[14:07:13] <paravoid>	 ok, let's wait another 30mins now :)
[14:07:31] <moritzm>	 !log rolling reboot of mc2* to Linux 4.4
[14:07:35] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:07:43] <jynus>	 sorry I let you down instead of sending you that patch
[14:07:50] <paravoid>	 didn't let me down at all
[14:08:00] <paravoid>	 good catch
[14:08:32] <paravoid>	 I kept the all hosts facts output on a text file
[14:08:38] <paravoid>	 so I'll diff after these changes are in effect
[14:11:42] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2338829 (10jcrespo) 05Resolved>03Open es2017:  `Correctable memory error rate exceeded for DIMM_A2.`  just after booting for the first time after replacing the memory  `Disk 0 in Backplane...
[14:11:51] <jynus>	 ^I've reopened this
[14:12:44] <jynus>	 the job queue seems to be going back to normal now
[14:13:53] <volans>	 jynus: thx I was kinda doing the same
[14:15:04] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: fix a couple of puppetmaster failing tests [puppet] - 10https://gerrit.wikimedia.org/r/291747 
[14:15:04] <jynus>	 if you can paste ther more info about the RAID log or status, you are welcome
[14:15:52] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[14:19:31] <icinga-wm>	 PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:20:51] <icinga-wm>	 PROBLEM - IPsec on mc1001 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2001_v4
[14:21:13] <icinga-wm>	 PROBLEM - IPsec on mc1017 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2001_v4
[14:24:33] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[14:26:48] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] prometheus: add tools role [puppet] - 10https://gerrit.wikimedia.org/r/291710 (owner: 10Filippo Giunchedi)
[14:27:23] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] fix a couple of puppetmaster failing tests [puppet] - 10https://gerrit.wikimedia.org/r/291747 (owner: 10Alexandros Kosiaris)
[14:30:09] <grrrit-wm>	 (03PS4) 10Alexandros Kosiaris: networks::constants: Hieraize all_network_subnets [puppet] - 10https://gerrit.wikimedia.org/r/291263 
[14:30:11] <grrrit-wm>	 (03PS26) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[14:30:13] <grrrit-wm>	 (03PS6) 10Alexandros Kosiaris: network: Move into module [puppet] - 10https://gerrit.wikimedia.org/r/291234 
[14:30:15] <grrrit-wm>	 (03PS11) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 
[14:35:29] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
[14:35:54] <wikibugs>	 06Operations, 10ops-codfw: Fauly RAM on mc2001 - https://phabricator.wikimedia.org/T136558#2338860 (10MoritzMuehlenhoff)
[14:36:08] <wikibugs>	 06Operations, 10ops-codfw: Faulty RAM on mc2001 - https://phabricator.wikimedia.org/T136558#2338876 (10MoritzMuehlenhoff)
[14:38:33] <Krenair>	 YuviPanda, Error: Could not retrieve catalog from remote server: Error 400 on SERVER: pick_initscript(): Wrong number of arguments given (6 for 5) at /etc/puppet/modules/base/manifests/service_unit.pp:82 on node deployment-changeprop.deployment-prep.eqiad.wmflabs
[14:40:24] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: uwsgi: Remove uwsgi from service name [puppet] - 10https://gerrit.wikimedia.org/r/291751 
[14:42:18] <grrrit-wm>	 (03PS1) 10Ema: update-ocsp-all: write output to logfile [puppet] - 10https://gerrit.wikimedia.org/r/291752 (https://phabricator.wikimedia.org/T132835) 
[14:44:19] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "Finally pcc is happy, jenkins is happy, I am happy with this change. reviews anyone ?" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[14:46:40] <icinga-wm>	 RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:48:16] <grrrit-wm>	 (03CR) 10Mholloway: "Just added @Hashar since I'm not sure he ever saw this... :)" [puppet] - 10https://gerrit.wikimedia.org/r/264303 (owner: 10Niedzielski)
[14:51:39] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2338880 (10Volans) As a confirmation that I/O was stuck, rom dmesg after a bunch of call traces we got: ``` [246461.498936] megaraid_sas 0000:03:00.0: pending commands remain after waiting, wi...
[14:52:14] <Krenair>	 jouncebot, next
[14:52:14] <jouncebot>	 In 0 hour(s) and 7 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160530T1500)
[14:56:49] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6204417 keys - replication_delay is 626
[14:57:49] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[14:58:33] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2338894 (10fgiunchedi) I've uploaded python-statsd and pexif to jessie-backports, they should appear in the next few days
[15:00:04] <jouncebot>	 anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160530T1500).
[15:00:04] <jouncebot>	 Urbanecm dcausse: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[15:00:39] <dcausse>	 o/
[15:00:55] <Urbanecm>	 I'm around. 
[15:02:52] <wikibugs>	 06Operations, 10DNS, 10Traffic, 10Wikimedia-Language-setup: Create moon.wikimedia.org and redirect it to https://meta.wikimedia.org/wiki/Wikipedia_to_the_Moon - https://phabricator.wikimedia.org/T136557#2338897 (10Danny_B)
[15:04:41] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] Send wmf.4 search and ttmserver traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291257 (https://phabricator.wikimedia.org/T133124) (owner: 10EBernhardson)
[15:04:55] <grrrit-wm>	 (03PS3) 10EBernhardson: Send wmf.4 search and ttmserver traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291257 (https://phabricator.wikimedia.org/T133124) 
[15:05:01] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] Send wmf.4 search and ttmserver traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291257 (https://phabricator.wikimedia.org/T133124) (owner: 10EBernhardson)
[15:05:03] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2267027 (10fgiunchedi) yanc uploaded too, when that is approved we can also go ahead with preggy -> pyvows -> derpconf
[15:05:59] <grrrit-wm>	 (03Merged) 10jenkins-bot: Send wmf.4 search and ttmserver traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291257 (https://phabricator.wikimedia.org/T133124) (owner: 10EBernhardson)
[15:06:42] <ebernhardson>	 dcausse: merged
[15:07:05] <dcausse>	 ebernhardson: no particular order for this one?
[15:07:30] <ebernhardson>	 dcausse: doesn't matter i think
[15:08:02] <ebernhardson>	 you could probably `scap sync-dir wmf-config ...` and would be fine
[15:08:23] <dcausse>	 ok
[15:10:19] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[15:10:35] <logmsgbot>	 !log dcausse@tin Synchronized wmf-config: Send wmf.4 search and ttmserver traffic to codfw (duration: 00m 33s)
[15:10:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:12:51] <dcausse>	 Urbanecm: around?
[15:12:56] <Urbanecm>	 yep
[15:13:12] <dcausse>	 I can swat for you
[15:13:23] <dcausse>	 I just need ebernhardson to +2 your patch :)
[15:14:55] <ebernhardson>	 sure se
[15:15:04] <grrrit-wm>	 (03PS3) 10EBernhardson: Changetags should be granted only to sysops and bots in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290680 (https://phabricator.wikimedia.org/T136187) (owner: 10Urbanecm)
[15:15:10] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 032] Changetags should be granted only to sysops and bots in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290680 (https://phabricator.wikimedia.org/T136187) (owner: 10Urbanecm)
[15:15:52] <grrrit-wm>	 (03Merged) 10jenkins-bot: Changetags should be granted only to sysops and bots in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290680 (https://phabricator.wikimedia.org/T136187) (owner: 10Urbanecm)
[15:17:17] <logmsgbot>	 !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: Changetags should be granted only to sysops and bots in ruwiki (duration: 00m 26s)
[15:17:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:17:31] <dcausse>	 Urbanecm: please check if tou can ^
[15:18:41] <Urbanecm>	 It seems that it's ok. Thanks. 
[15:18:49] <dcausse>	 Urbanecm: thanks!
[15:22:16] <dcausse>	 Krenair: I see that you added a patch, can I help?
[15:23:18] <Krenair>	 hey
[15:23:19] <Krenair>	 yes
[15:24:18] <dcausse>	 Krenair: I can deploy but unfortunately I can't +2 on wmf-config ... mind +2ing ?
[15:24:56] <grrrit-wm>	 (03PS52) 10Alexandros Kosiaris: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup)
[15:26:07] <Krenair>	 oh, need to make a quick change
[15:26:26] <dcausse>	 sure
[15:26:30] <grrrit-wm>	 (03PS2) 10Alex Monk: Make VE RB URLs domain-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) 
[15:26:58] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] dynamicproxy: make invisible-unicorn.py python3 compatible [puppet] - 10https://gerrit.wikimedia.org/r/291562 (owner: 10Ladsgroup)
[15:27:03] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: dynamicproxy: make invisible-unicorn.py python3 compatible [puppet] - 10https://gerrit.wikimedia.org/r/291562 (owner: 10Ladsgroup)
[15:27:10] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] dynamicproxy: make invisible-unicorn.py python3 compatible [puppet] - 10https://gerrit.wikimedia.org/r/291562 (owner: 10Ladsgroup)
[15:28:37] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Make VE RB URLs domain-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) (owner: 10Alex Monk)
[15:33:22] <dcausse>	 Krenair: you need to rebase I think
[15:34:37] <grrrit-wm>	 (03PS3) 10Alex Monk: Make VE RB URLs domain-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) 
[15:34:55] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Make VE RB URLs domain-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) (owner: 10Alex Monk)
[15:35:40] <grrrit-wm>	 (03Merged) 10jenkins-bot: Make VE RB URLs domain-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) (owner: 10Alex Monk)
[15:36:39] <wikibugs>	 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10faidon)
[15:36:59] <logmsgbot>	 !log dcausse@tin Synchronized wmf-config/CommonSettings.php: Make VE RB URLs domain-relative (duration: 00m 26s)
[15:37:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:37:13] <dcausse>	 Krenair: check please ^
[15:38:07] <Krenair>	 hm, it doesn't seem to have had any effect
[15:38:20] <dcausse>	 hmm... let me check with eval
[15:39:03] <Krenair>	 it's fine with eval
[15:39:05] <Krenair>	 maybe RL caching
[15:39:19] <dcausse>	 should I do something?
[15:40:32] <Krenair>	 confirmed it's RL caching
[15:40:45] <dcausse>	 ok, thanks for checking
[15:40:49] <Krenair>	 https://en.wikipedia.org/w/load.php?debug=false&lang=en-gb&modules=startup&only=scripts&skin=vector - old version, then you set debug=true and you get the new one
[15:42:19] <Krenair>	 now it works
[15:42:31] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup)
[15:42:39] <dcausse>	 ok
[15:43:04] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Introduce ores.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/277725 (https://phabricator.wikimedia.org/T124202) 
[15:49:46] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "two cents" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291752 (https://phabricator.wikimedia.org/T132835) (owner: 10Ema)
[15:53:03] <wikibugs>	 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338977 (10faidon)
[16:00:20] <icinga-wm>	 ACKNOWLEDGEMENT - Host mc2001 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T136558
[16:13:49] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6201184 keys - replication_delay is 0
[16:16:28] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: base::grub: fix the ioscheduler setting [puppet] - 10https://gerrit.wikimedia.org/r/291706 
[16:16:30] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: base::grub: actually use augeas on jessie [puppet] - 10https://gerrit.wikimedia.org/r/291707 
[16:16:45] <_joe_>	 jynus, paravoid I finally did it I think
[16:17:28] <_joe_>	 I found where the true augeas docs are located :P
[16:17:51] <_joe_>	 specifically https://github.com/hercules-team/augeas/wiki/Path-expressions
[16:19:05] <jynus>	 let me add that to wikitech
[16:20:30] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "this new version actually does the right thing with augeas, but not with grep/sed (which are untouched)." [puppet] - 10https://gerrit.wikimedia.org/r/291706 (owner: 10Giuseppe Lavagetto)
[16:21:29] <icinga-wm>	 PROBLEM - DPKG on mw1020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[16:21:43] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Stop using package->latest in gerrit module [puppet] - 10https://gerrit.wikimedia.org/r/291762 (https://phabricator.wikimedia.org/T115348) 
[16:23:29] <icinga-wm>	 RECOVERY - DPKG on mw1020 is OK: All packages OK
[16:29:04] <wikibugs>	 06Operations, 06Analytics-Kanban: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2339050 (10mforns)
[16:29:40] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Stop using package->latest in ganglia monitor [puppet] - 10https://gerrit.wikimedia.org/r/291764 (https://phabricator.wikimedia.org/T115384) 
[16:29:55] <wikibugs>	 06Operations, 06Analytics-Kanban: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2333507 (10mforns) @elukey  Can you clarify what is the action to do in this task? Thanks!
[16:30:58] <paravoid>	 moritzm: if you're removing all the ensure latests, feel free to skip the RAID one, cf. Ia16b7ad8ad281640fe18fe77cb781d2480af54dc
[16:31:25] <paravoid>	 aka https://gerrit.wikimedia.org/r/#/c/290999/
[16:32:30] <moritzm>	 ok!
[16:45:19] <grrrit-wm>	 (03PS1) 10Mobrovac: Math: Enable MathML everywhere but private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291766 (https://phabricator.wikimedia.org/T131177) 
[16:49:28] <wikibugs>	 06Operations, 10Datasets-General-or-Unknown: investigate rsync between dcs with encryption - https://phabricator.wikimedia.org/T123560#2339089 (10ArielGlenn)
[16:49:30] <wikibugs>	 06Operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#2339088 (10ArielGlenn)
[16:49:42] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures
[17:00:04] <jouncebot>	 gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160530T1700). Please do the needful.
[17:01:09] <gehel>	 SMalyshev: as far as I know, you should not be here today. So no deployment of WDQS. If there is anything to push, let me know and we'll find the time
[17:02:06] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 04-1] base::grub: fix the ioscheduler setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291706 (owner: 10Giuseppe Lavagetto)
[17:04:28] <_joe_>	 heh I forgot to push the correction to the comment :P
[17:04:50] <_joe_>	 y'all talking in my ears got me distracted
[17:06:08] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "looks like "labs-instances" hashes are used in url_downloader too, it'll need to be renamed to use labs realm" [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris)
[17:07:20] <grrrit-wm>	 (03PS5) 10Faidon Liambotis: raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) 
[17:07:22] <grrrit-wm>	 (03PS5) 10Faidon Liambotis: raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 
[17:07:24] <grrrit-wm>	 (03PS5) 10Faidon Liambotis: raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) 
[17:07:26] <grrrit-wm>	 (03PS5) 10Faidon Liambotis: raid: slightly change check-raid's "utility" names [puppet] - 10https://gerrit.wikimedia.org/r/291011 
[17:07:28] <grrrit-wm>	 (03PS5) 10Faidon Liambotis: raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) 
[17:07:56] <volans>	 paravoid: just rebase?
[17:08:06] <paravoid>	 rebase & reorder
[17:08:13] <paravoid>	 i'll merge the check-raid changes first
[17:11:19] <volans>	 ok I'll take a look now
[17:11:52] <paravoid>	 if jenkins wasn't compeletely bonkers these days
[17:13:08] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] raid: slightly change check-raid's "utility" names [puppet] - 10https://gerrit.wikimedia.org/r/291011 (owner: 10Faidon Liambotis)
[17:13:14] <grrrit-wm>	 (03CR) 10Volans: "See inline comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/291011 (owner: 10Faidon Liambotis)
[17:13:23] <paravoid>	 whoops
[17:13:37] <paravoid>	 and of course you're absolutely right
[17:15:09] <volans>	 :)
[17:15:44] <Krinkle>	 gehel: greg-g: Needs a deployment asap for a regression with rollback functionality - https://gerrit.wikimedia.org/r/#/c/291768/
[17:15:52] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: raid: brown-paper bag fix on check-raid.py [puppet] - 10https://gerrit.wikimedia.org/r/291770 
[17:16:26] <grrrit-wm>	 (03PS6) 10Faidon Liambotis: raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 
[17:16:38] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] raid: brown-paper bag fix on check-raid.py [puppet] - 10https://gerrit.wikimedia.org/r/291770 (owner: 10Faidon Liambotis)
[17:17:00] * Krinkle guesses the US holiday means greg isn't here
[17:17:09] <paravoid>	 good guess :)
[17:17:18] <gehel>	 Krinkle: how can I help?
[17:17:26] <Krinkle>	 gehel: Are you deploying anything from tin?
[17:17:47] <paravoid>	 volans: are you reviewing the rest too/
[17:17:54] <gehel>	 Krinkle: not at the moment, nothing to deploy for WDQS today
[17:17:59] <volans>	 paravoid: yes
[17:17:59] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:18:03] <Krinkle>	 gehel: okay, I'm taking the slot then :)
[17:18:03] <paravoid>	 k, I'll wait
[17:18:10] <Krenair>	 Krinkle, you know it's also a UK bank holiday
[17:18:15] <gehel>	 Krinkle: go ahead and good luck!
[17:18:17] <Krinkle>	 I don't work at a bank
[17:18:26] <grrrit-wm>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/291012 (owner: 10Faidon Liambotis)
[17:19:17] <paravoid>	 I'm slightly worried that jenkins consistently takes 18 minutes to check that change and then fails
[17:20:20] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 614 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6207945 keys - replication_delay is 614
[17:22:28] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: base::grub: fix the ioscheduler setting [puppet] - 10https://gerrit.wikimedia.org/r/291706 
[17:22:30] <grrrit-wm>	 (03PS3) 10Giuseppe Lavagetto: base::grub: actually use augeas on jessie [puppet] - 10https://gerrit.wikimedia.org/r/291707 
[17:22:32] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: base::grub: allow enabling the memory cgroup controller [puppet] - 10https://gerrit.wikimedia.org/r/291772 
[17:23:28] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 (owner: 10Faidon Liambotis)
[17:23:51] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2339183 (10jcrespo) a:05RobH>03elukey @elukey will have a detailed look at this this week. Please reassign it to m...
[17:24:12] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2339186 (10jcrespo) 05stalled>03Open
[17:24:44] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: base::grub: fix the ioscheduler setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291706 (owner: 10Giuseppe Lavagetto)
[17:24:53] <grrrit-wm>	 (03CR) 10Volans: "If I understand it correctly require_package() does an ensure => present while before we were doing an ensure => latest." [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis)
[17:27:08] <logmsgbot>	 !log krinkle@tin Synchronized php-1.28.0-wmf.3/includes/api/ApiQueryRevisions.php: T136375 (duration: 00m 52s)
[17:27:09] <stashbot>	 T136375: Rollback T88044 (broke rollback-related utilities) - https://phabricator.wikimedia.org/T136375
[17:27:14] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:28:16] <grrrit-wm>	 (03CR) 10Krinkle: [C: 031] Drop dependency on wikimedia/cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291681 (owner: 10Ori.livneh)
[17:29:06] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2339200 (10jcrespo) 05stalled>03Open a:05RobH>03jcrespo This was approved today on the operations meeting, and I personally will be fo...
[17:29:15] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stats for Ladsgroup - https://phabricator.wikimedia.org/T136417#2339220 (10jcrespo) a:05Ladsgroup>03jcrespo
[17:34:04] <grrrit-wm>	 (03PS2) 10Ori.livneh: Drop dependency on wikimedia/cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291681 
[17:34:26] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Drop dependency on wikimedia/cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291681 (owner: 10Ori.livneh)
[17:35:15] <grrrit-wm>	 (03Merged) 10jenkins-bot: Drop dependency on wikimedia/cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291681 (owner: 10Ori.livneh)
[17:40:18] <Krinkle>	 ori: Previously, the CDB from mediawiki-vendor was unexposed because config loads it first.
[17:40:28] <grrrit-wm>	 (03CR) 10Volans: [C: 04-1] "Leftover of the cleaning" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis)
[17:40:45] <ori>	 Krinkle: right, it was a mess.
[17:40:54] <Krinkle>	 it's lazy loaded and since class_exists returns true it wil never ask mediawikis autoloader
[17:41:14] <Krinkle>	 and since php's autoloader extension design is function-based (not registry based) it means it also doesn't conflict 
[17:41:15] <ori>	 !log Synced composer.{json,lock} and multiversion for I5ac86f190b
[17:41:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:50:16] <grrrit-wm>	 (03CR) 10Volans: raid: add monitoring for HP controllers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis)
[17:59:23] <jzerebecki>	 I'd like to scap a bug fix for wikidata
[17:59:39] <jzerebecki>	 ori, Krinkle: are you done?
[17:59:43] <ori>	 yes
[17:59:53] <Krinkle>	 Yes
[18:01:40] <grrrit-wm>	 (03PS6) 10Faidon Liambotis: raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) 
[18:01:42] <grrrit-wm>	 (03PS6) 10Faidon Liambotis: raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) 
[18:01:44] <grrrit-wm>	 (03PS6) 10Faidon Liambotis: raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) 
[18:03:37] <grrrit-wm>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis)
[18:04:09] <grrrit-wm>	 (03PS7) 10Faidon Liambotis: raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) 
[18:04:14] <paravoid>	 last one hopefully!
[18:04:17] <paravoid>	 sorry volans :)
[18:04:17] <volans>	 paravoid: did you see my comment about require_package()?
[18:04:19] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6195029 keys - replication_delay is 0
[18:04:24] <volans>	 no prob :)
[18:04:52] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis)
[18:05:41] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [V: 032] raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis)
[18:07:21] <paravoid>	 I'm salt rm'ing /usr/local/bin/check-raid.py in the meantime
[18:08:54] <volans>	 ok, see also my last question above
[18:09:03] <paravoid>	 oh, yes
[18:09:07] <paravoid>	 yes, that's intended
[18:09:10] <paravoid>	 ensure => latest is evil
[18:09:49] <volans>	 yeah, can do bad things
[18:09:57] <volans>	 just wanted to check
[18:10:11] <paravoid>	 nod
[18:10:32] <volans>	 and about 291014 I think you can avoid some lines
[18:10:48] <volans>	 but if you want to keep the same of the Debian one it's fine too
[18:11:26] <paravoid>	 yeah, I'd like that
[18:13:37] <grrrit-wm>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis)
[18:14:06] <logmsgbot>	 !log jzerebecki@tin Synchronized php-1.28.0-wmf.3/extensions/Wikidata/vendor/wikibase/javascript-api/src/getLocationAgnosticMwApi.js: Wikidata WikibaseJavaScriptApi: Fix getLocationAgnosticMwApi behavior in Internet Explorer b6ae82c71af3d9361cfb9e8d4e6e45bcd5ee9b26 1 of 2 T136543 (duration: 00m 26s)
[18:14:07] <stashbot>	 T136543: [Bug] unable to edit in IE - https://phabricator.wikimedia.org/T136543
[18:14:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:15:47] <logmsgbot>	 !log jzerebecki@tin Synchronized php-1.28.0-wmf.3/extensions/Wikidata/vendor/wikibase/javascript-api/WikibaseJavaScriptApi.php: Wikidata WikibaseJavaScriptApi: Fix getLocationAgnosticMwApi behavior in Internet Explorer b6ae82c71af3d9361cfb9e8d4e6e45bcd5ee9b26 2 of 2 T136543 (duration: 00m 24s)
[18:15:48] <stashbot>	 T136543: [Bug] unable to edit in IE - https://phabricator.wikimedia.org/T136543
[18:15:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:17:59] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 633 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6195383 keys - replication_delay is 633
[18:18:15] <jzerebecki>	 done
[18:24:44] <hasharAway>	 paravoid: sorry I have been quite busy
[18:24:49] <hasharAway>	 going to fix pplint-HEAD
[18:29:05] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "scap::target is being declared via service::uwsgi" [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup)
[18:33:48] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6193466 keys - replication_delay is 0
[18:47:28] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 623 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6193776 keys - replication_delay is 623
[18:50:06] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki nvwiki (T45917)
[18:50:07] <stashbot>	 T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917
[18:50:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:52:55] <grrrit-wm>	 (03PS7) 10Faidon Liambotis: raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) 
[18:53:04] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis)
[18:53:59] <Amir1>	 jynus: thanks :)
[19:05:25] <icinga-wm>	 PROBLEM - MPT RAID on ms-fe1004 is CRITICAL: NRPE: Command check_raid_mpt not defined
[19:05:33] <paravoid>	 hmm?
[19:05:36] <icinga-wm>	 PROBLEM - MD RAID on dbproxy1008 is CRITICAL: NRPE: Command check_raid_md not defined
[19:05:56] <icinga-wm>	 PROBLEM - MPT RAID on dbproxy1008 is CRITICAL: NRPE: Command check_raid_mpt not defined
[19:06:26] <icinga-wm>	 PROBLEM - MD RAID on mw1260 is CRITICAL: NRPE: Command check_raid_md not defined
[19:06:36] <icinga-wm>	 PROBLEM - MPT RAID on mw1260 is CRITICAL: NRPE: Command check_raid_mpt not defined
[19:07:17] <icinga-wm>	 PROBLEM - MD RAID on silver is CRITICAL: NRPE: Command check_raid_md not defined
[19:07:26] <icinga-wm>	 PROBLEM - MD RAID on eventlog2001 is CRITICAL: NRPE: Command check_raid_md not defined
[19:07:36] <icinga-wm>	 PROBLEM - MPT RAID on silver is CRITICAL: NRPE: Command check_raid_mpt not defined
[19:07:46] <icinga-wm>	 PROBLEM - MPT RAID on eventlog2001 is CRITICAL: NRPE: Command check_raid_mpt not defined
[19:08:43] <ori>	 paravoid: :P
[19:10:42] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: raid: fix circular dependency [puppet] - 10https://gerrit.wikimedia.org/r/291780 
[19:11:04] <ori>	 hrm
[19:11:24] <ori>	 that's quite odd
[19:11:28] <paravoid>	 is it?
[19:11:39] <paravoid>	 I think it's normal
[19:11:54] <paravoid>	 require_packages creates an implicit package -> container class dependency
[19:12:17] <ori>	 ah, right, and thus the  before  => Package['mpt-status'], creates a dependency
[19:12:27] <paravoid>	 yeah
[19:12:28] <ori>	 why does it need to be there before the package?
[19:12:41] <paravoid>	 because the package sends an email upon installation otherwise
[19:12:53] <ori>	 heh
[19:12:59] <paravoid>	 annoying :)
[19:14:05] <paravoid>	 ok, brb
[19:14:07] <paravoid>	 dinner
[19:14:09] <ori>	 bye
[19:22:22] <grrrit-wm>	 (03CR) 10Ladsgroup: "Another thing: What about the worker nodes? the ores::worker seems to be not using the scap::target (since it doesn't use service::uwsgi) " [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup)
[19:22:34] <Amir1>	 akosiaris: ^
[19:22:37] <Amir1>	 if you around
[19:29:20] <grrrit-wm>	 (03PS1) 10Aklapper: Weekly Phabricator email: List archived projects with open tasks [puppet] - 10https://gerrit.wikimedia.org/r/291781 (https://phabricator.wikimedia.org/T133649) 
[19:30:09] <grrrit-wm>	 (03PS1) 10Gergő Tisza: [HOLD] Enable AuthManager on beta wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291782 (https://phabricator.wikimedia.org/T135504) 
[19:31:21] <grrrit-wm>	 (03CR) 10Aklapper: "Tested locally. (I have no idea if this will be performant enough on the production instance.)" [puppet] - 10https://gerrit.wikimedia.org/r/291781 (https://phabricator.wikimedia.org/T133649) (owner: 10Aklapper)
[19:34:05] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2339566 (10Ladsgroup) Thanks :)
[19:39:31] <grrrit-wm>	 (03PS1) 10Ppchelko: Change-Prop: White-list user-agent header header in http filter [puppet] - 10https://gerrit.wikimedia.org/r/291784 
[19:42:29] <paravoid>	 and back
[19:42:31] <paravoid>	 both me and icinga-wm :)
[19:43:23] <grrrit-wm>	 (03PS7) 10Faidon Liambotis: raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) 
[19:44:13] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis)
[19:44:35] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "they are gonna be on the same nodes for now so it should be a blocker for right now" [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup)
[19:44:43] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/291706 (owner: 10Giuseppe Lavagetto)
[19:44:49] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] base::grub: actually use augeas on jessie [puppet] - 10https://gerrit.wikimedia.org/r/291707 (owner: 10Giuseppe Lavagetto)
[19:45:47] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "Can we just call it "subnets"? :)" [puppet] - 10https://gerrit.wikimedia.org/r/291263 (owner: 10Alexandros Kosiaris)
[19:47:04] <grrrit-wm>	 (03PS12) 10Faidon Liambotis: network::constants: split off labs into its own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris)
[19:47:34] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "Yes on the principle, modulo Filippo's concern." [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris)
[19:48:33] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "I'd nitpick and say to call it "wmnet" (or something), but that could follow in a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/291234 (owner: 10Alexandros Kosiaris)
[19:49:16] <grrrit-wm>	 (03CR) 10Jforrester: "Put this in SWAT this afternoon?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) (owner: 10Catrope)
[19:50:05] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: sca: remove cxserver-admin [puppet] - 10https://gerrit.wikimedia.org/r/291785 
[19:53:21] <grrrit-wm>	 (03PS53) 10Alexandros Kosiaris: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup)
[19:54:30] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 04-1] "Nice work :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[20:00:05] <jouncebot>	 gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160530T2000). Please do the needful.
[20:02:20] <paravoid>	 yay
[20:02:28] <paravoid>	 lots of HP warnings coming in
[20:02:38] <volans>	 great
[20:02:54] <paravoid>	 db1074 - WARNING: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, Controller, Battery/Capacitor - Not Configured: Cache 
[20:03:40] <volans>	 do we have a parser to read it? :)
[20:03:52] <paravoid>	 I think this means "no BBU configured"
[20:04:49] <paravoid>	    Controller Status: OK
[20:04:49] <paravoid>	    Cache Status: Not Configured
[20:04:49] <paravoid>	    Battery/Capacitor Status: OK
[20:04:53] <paravoid>	 that's db1074
[20:05:40] * volans looking
[20:06:39] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "Regarding naming, I am open to anything. The function is indeed slicing arbitrary parts of network::constants hence the naming, hence the " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[20:06:40] <paravoid>	    Cache Status: Not Configured
[20:06:40] <paravoid>	    Cache Ratio: 100% Read / 0% Write
[20:06:50] <volans>	 yes, I think we are looking at the same commands
[20:06:54] <icinga-wm>	 PROBLEM - HP RAID on db2034 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Cache, Battery/Capacitor
[20:07:02] <paravoid>	 and a failed disk!
[20:07:02] <paravoid>	 yay :)
[20:09:25] <volans>	 Caching:  Disabled on the logical drive too
[20:09:40] <grrrit-wm>	 (03PS1) 10Gehel: Keep osmosis osm_expire files for a month [puppet] - 10https://gerrit.wikimedia.org/r/291788 (https://phabricator.wikimedia.org/T136577) 
[20:11:01] <hashar>	 paravoid: the pplint-HEAD taking ages to run is fixed / hacked :D  https://phabricator.wikimedia.org/T133816
[20:11:03] <grrrit-wm>	 (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291565 (owner: 10Ladsgroup)
[20:11:23] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Keep osmosis osm_expire files for a month [puppet] - 10https://gerrit.wikimedia.org/r/291788 (https://phabricator.wikimedia.org/T136577) (owner: 10Gehel)
[20:11:26] <paravoid>	 hashar: <3
[20:11:57] <hashar>	 paravoid: Tyler noticed that a few weeks ago but we had trouble understanding why it suddenly happened ... That will remain a mystery probably
[20:16:49] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: raid: fix sudo rules for hpssacli (mostly for ms-be) [puppet] - 10https://gerrit.wikimedia.org/r/291791 
[20:18:46] <paravoid>	 ok, re-ran puppet on neon, another batch of checks should soon appear
[20:20:46] <paravoid>	 121 HP checks in total
[20:22:15] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: raid: fix sudo rules for hpssacli (mostly for ms-be) [puppet] - 10https://gerrit.wikimedia.org/r/291791 
[20:22:21] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] raid: fix sudo rules for hpssacli (mostly for ms-be) [puppet] - 10https://gerrit.wikimedia.org/r/291791 (owner: 10Faidon Liambotis)
[20:23:08] <icinga-wm>	 PROBLEM - HP RAID on labvirt1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:23:58] <icinga-wm>	 PROBLEM - HP RAID on lvs2006 is CRITICAL: CRITICAL: Slot 0: bad transfer speed: 1I:1:2(6.0Gbps) - OK: 1I:1:2, Controller, Cache, Battery/Capacitor - Failed: 1I:1:1
[20:24:08] <icinga-wm>	 PROBLEM - HP RAID on labvirt1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:24:11] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: db2034 degraded RAID - https://phabricator.wikimedia.org/T136583#2339694 (10Volans)
[20:24:28] <icinga-wm>	 PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:24:37] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db2034 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Cache, Battery/Capacitor Volans https://phabricator.wikimedia.org/T136583
[20:24:38] <icinga-wm>	 PROBLEM - HP RAID on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:24:47] <icinga-wm>	 PROBLEM - HP RAID on ms-be2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:24:57] <paravoid>	 hrm
[20:25:17] <icinga-wm>	 PROBLEM - HP RAID on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:25:25] <volans>	 do you need to re-run puppet on the hosts?
[20:25:28] <icinga-wm>	 PROBLEM - HP RAID on ms-be2018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:25:29] <paravoid>	 I did
[20:25:37] <icinga-wm>	 PROBLEM - HP RAID on ms-be2021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:25:37] <icinga-wm>	 PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:25:37] <icinga-wm>	 PROBLEM - HP RAID on ms-be1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:25:48] <icinga-wm>	 PROBLEM - HP RAID on ms-be1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:25:57] <icinga-wm>	 PROBLEM - HP RAID on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:26:04] <paravoid>	 it runs, it's just too many disks and it takes too long :(
[20:26:09] <paravoid>	 real	0m10.699s
[20:26:47] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 711 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6200053 keys - replication_delay is 711
[20:26:49] <volans>	 ok, then we can adjust the timeout for this check
[20:27:42] <paravoid>	 yeah
[20:27:47] <icinga-wm>	 PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:28:02] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "addressed Filippo's concern in PS13" [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris)
[20:28:22] <paravoid>	 any luck finding... if we have misconfigured BBUs all across the fleet? :)
[20:28:26] <grrrit-wm>	 (03PS5) 10Alexandros Kosiaris: networks::constants: Hieraize all_network_subnets [puppet] - 10https://gerrit.wikimedia.org/r/291263 
[20:28:28] <grrrit-wm>	 (03PS27) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[20:28:30] <grrrit-wm>	 (03PS13) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 
[20:28:32] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 
[20:28:51] <paravoid>	 db20xx seem to all be happy
[20:28:58] <paravoid>	 (apart from 2034, obviously)
[20:29:01] <volans>	 paravoid: I'm checking if by any chance the default for SSD disks is disabled
[20:29:04] <volans>	 in the manual
[20:29:08] <icinga-wm>	 PROBLEM - HP RAID on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:29:16] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "done. I 've renamed the hiera variable, the puppet variable needs some more refactoring, to be done in a later patch" [puppet] - 10https://gerrit.wikimedia.org/r/291263 (owner: 10Alexandros Kosiaris)
[20:29:21] <volans>	 the others have spinning AFAIK
[20:29:27] <paravoid>	 hmm
[20:30:00] <volans>	 lvs2006 has a broken disk too, I'll open a task for it too
[20:30:09] <paravoid>	 yeah, thanks!
[20:31:04] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: raid: double NRPE timeout for check_hpssacli [puppet] - 10https://gerrit.wikimedia.org/r/291820 
[20:31:32] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: raid: double NRPE timeout for check_hpssacli [puppet] - 10https://gerrit.wikimedia.org/r/291820 
[20:32:24] <paravoid>	 http://community.hpe.com/t5/ProLiant-Servers-ML-DL-SL/Problem-with-configure-cache-Cache-Status-Not-Configured/td-p/5348173
[20:33:58] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] raid: double NRPE timeout for check_hpssacli [puppet] - 10https://gerrit.wikimedia.org/r/291820 (owner: 10Faidon Liambotis)
[20:34:26] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on lvs2006 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:1 - bad transfer speed: 1I:1:2(6.0Gbps) - OK: 1I:1:2, Controller, Cache, Battery/Capacitor Volans https://phabricator.wikimedia.org/T136584
[20:34:28] <wikibugs>	 06Operations, 10ops-codfw: lvs2006 degraded RAID - https://phabricator.wikimedia.org/T136584#2339714 (10Volans)
[20:37:01] <volans>	 paravoid: yes, I was looking at the same thing in the manual, although I didn't find yet the point where it says it explicitely
[20:37:18] <volans>	 and we have LD Acceleration Method: HP SSD Smart Path on those, let me do a larger check with salt
[20:44:36] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6185501 keys - replication_delay is 0
[20:48:08] <volans>	 paravoid: for what I've read so far SSD Smart Path should be better than traditional caching for SSDs in particular for reads. Of course only a benchmark with our specific workload could give us the final answer
[20:48:25] <volans>	 looks like we have to patch the check to handle this case too
[20:49:33] <icinga-wm>	 RECOVERY - HP RAID on labvirt1001 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 2I:1:14, 2I:1:15, 2I:1:16, 2I:1:17, 2I:1:18, Controller, Cache, Battery/Capacitor
[20:55:20] <grrrit-wm>	 (03PS1) 10Gehel: Increase the number of workers for osm2pgsql. [puppet] - 10https://gerrit.wikimedia.org/r/291825 (https://phabricator.wikimedia.org/T136578) 
[20:55:22] <icinga-wm>	 PROBLEM - HP RAID on labvirt1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[20:55:45] <grrrit-wm>	 (03CR) 10Halfak: [C: 04-1] "We need a good way to distinguish the ores-web (uwsgi) from ores-worker (celery)" [puppet] - 10https://gerrit.wikimedia.org/r/291751 (owner: 10Alexandros Kosiaris)
[20:57:42] <icinga-wm>	 RECOVERY - HP RAID on labvirt1004 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 2I:1:14, 2I:1:15, 2I:1:16, 2I:1:17, 2I:1:18, Controller, Cache, Battery/Capacitor
[21:03:47] <grrrit-wm>	 (03PS2) 10Gehel: Increase the number of workers for osm2pgsql. [puppet] - 10https://gerrit.wikimedia.org/r/291825 (https://phabricator.wikimedia.org/T136578) 
[21:07:41] <grrrit-wm>	 (03CR) 10Gehel: "Puppet compiler looks happy: https://puppet-compiler.wmflabs.org/2993/" [puppet] - 10https://gerrit.wikimedia.org/r/291825 (https://phabricator.wikimedia.org/T136578) (owner: 10Gehel)
[21:08:12] <grrrit-wm>	 (03PS3) 10Gehel: Increase the number of workers for osm2pgsql. [puppet] - 10https://gerrit.wikimedia.org/r/291825 (https://phabricator.wikimedia.org/T136578) 
[21:09:00] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] "Legacy / tech debt I guess. Today the security upgrades are managed by ops cluster wide, so they would notice and upgrade as needed." [puppet] - 10https://gerrit.wikimedia.org/r/291762 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff)
[21:09:55] <wikibugs>	 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#1722037 (10hashar) @Dzahn you might want to move the table to the task description so that anyone can amend it as needed :-)
[21:10:09] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Increase the number of workers for osm2pgsql. [puppet] - 10https://gerrit.wikimedia.org/r/291825 (https://phabricator.wikimedia.org/T136578) (owner: 10Gehel)
[21:10:57] <wikibugs>	 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2339786 (10hashar)
[21:12:32] <wikibugs>	 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#1722037 (10hashar)
[21:30:15] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: raid/hpssacli: don't barf on SATA + 6Gbps speed [puppet] - 10https://gerrit.wikimedia.org/r/291828 
[21:30:17] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: raid/hpssacli: don't barf on HP SSD Smart Path configs [puppet] - 10https://gerrit.wikimedia.org/r/291829 
[21:30:18] <paravoid>	 volans: if you're still here ^^^
[21:30:32] * volans looking
[21:32:44] <icinga-wm>	 RECOVERY - HP RAID on labvirt1008 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 2I:1:14, 2I:1:15, 2I:1:16, 2I:1:17, 2I:1:18, Controller, Cache, Battery/Capacitor
[21:33:48] <volans>	 paravoid: what I'm not sure about is if with this smart array it actually acts as a normal BBU for writes on RAID != 0
[21:34:39] <volans>	 because otherwise we need to be aware of it and decide at OS/application level different approaches (OS scheduler, application scheduler, etc...) that right now assume a BBU
[21:34:50] <paravoid>	 what do you mean?
[21:36:23] <volans>	 that if we have a DB with scheduler noop and mysql configured with IO_DIRECT and this smart thinghy doesn't cache the writes in the BBU we are not anymore protected if a crash happens
[21:39:07] <paravoid>	 it's probably write-through in that case
[21:39:10] <paravoid>	 that's the concept I think
[21:39:24] <paravoid>	 in any case, that's going to be a configuration issue, not a health issue
[21:39:32] <paravoid>	 and while we can alert on that too, that should probably be a separate thing
[21:40:35] <volans>	 sute, make sense
[21:41:08] <volans>	 btw when did you change line 248 (old file) for the wrong speed, I don't see the diff in gerrit but is updated :)
[21:41:32] <paravoid>	 uh?
[21:41:36] <paravoid>	 it's a separate diff
[21:41:38] <paravoid>	 https://gerrit.wikimedia.org/r/#/c/291828/1/modules/raid/files/dsa-check-hpssacli
[21:41:59] <volans>	 one depends on the other, I missed this one
[21:42:00] <volans>	 thx
[21:42:32] <grrrit-wm>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/291828 (owner: 10Faidon Liambotis)
[21:42:45] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] raid/hpssacli: don't barf on SATA + 6Gbps speed [puppet] - 10https://gerrit.wikimedia.org/r/291828 (owner: 10Faidon Liambotis)
[21:43:16] <paravoid>	 ack to merge the other one too?
[21:43:50] <grrrit-wm>	 (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/291829 (owner: 10Faidon Liambotis)
[21:44:07] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] raid/hpssacli: don't barf on HP SSD Smart Path configs [puppet] - 10https://gerrit.wikimedia.org/r/291829 (owner: 10Faidon Liambotis)
[21:44:53] <volans>	 FYI: http://h20195.www2.hp.com/v2/GetPDF.aspx/4AA4-8144ENW.pdf
[21:47:57] <volans>	 looks like for writes on RAID!=0 it behaves like a normal controller... but still is not clear
[21:50:03] <icinga-wm>	 RECOVERY - HP RAID on ms-be1021 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
[21:50:20] <volans>	 yay
[21:50:23] <icinga-wm>	 RECOVERY - HP RAID on ms-be2018 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
[21:50:24] <icinga-wm>	 RECOVERY - HP RAID on ms-be2021 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
[21:51:03] <icinga-wm>	 RECOVERY - HP RAID on ms-be2019 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
[21:51:04] <icinga-wm>	 RECOVERY - HP RAID on ms-be1017 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller, Battery/Capacitor
[21:51:04] <icinga-wm>	 RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller, Battery/Capacitor
[21:51:23] <icinga-wm>	 RECOVERY - HP RAID on ms-be1019 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
[21:51:24] <icinga-wm>	 RECOVERY - HP RAID on ms-be1018 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller, Battery/Capacitor
[21:51:43] <icinga-wm>	 RECOVERY - HP RAID on ms-be2016 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
[21:52:04] <icinga-wm>	 RECOVERY - HP RAID on ms-be2020 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
[21:52:13] <icinga-wm>	 RECOVERY - HP RAID on ms-be2017 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
[21:52:50] <Nemo_bis>	 clear
[21:52:57] <wikibugs>	 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Enable Kafka native TLS in 0.9 and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#2339917 (10Nuria)
[21:53:00] <wikibugs>	 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2339916 (10Nuria) 05Open>03Resolved
[21:56:03] <icinga-wm>	 RECOVERY - HP RAID on ms-be1020 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
[21:59:52] <paravoid>	 https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=hp+raid
[22:00:04] <paravoid>	 all but the two you already ack'ed are OK :)
[22:00:49] <wikibugs>	 06Operations, 10Monitoring, 13Patch-For-Review: Refactor RAID checks (check-raid) - https://phabricator.wikimedia.org/T84050#2339938 (10faidon)
[22:00:51] <wikibugs>	 06Operations, 10Monitoring, 13Patch-For-Review: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#2339936 (10faidon) 05Open>03Resolved It took a while but this is finally done. We now have 123 RAID checks for HP systems.
[22:01:03] <volans>	 yep! all good
[22:01:27] <volans>	 I've sent an email to jaime for the DB ones and the smart thing
[22:01:38] <wikibugs>	 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2339943 (10faidon)
[22:01:40] <wikibugs>	 06Operations, 10DBA, 13Patch-For-Review: investigate RAID BBU auto-learn on db hosts - https://phabricator.wikimedia.org/T84178#2339944 (10faidon)
[22:01:42] <wikibugs>	 06Operations, 10Monitoring, 13Patch-For-Review: Refactor RAID checks (check-raid) - https://phabricator.wikimedia.org/T84050#922009 (10faidon) 05Open>03Resolved a:03faidon This is now all done :)
[22:01:44] <paravoid>	 cool
[22:01:56] <paravoid>	 I linked the two patches to DSA too, I'd like to see those merged upstream
[22:02:13] <icinga-wm>	 PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2111 MB (3% inode=96%)
[22:03:47] <volans>	 paravoid: syslog filling up very quickly
[22:04:27] <volans>	  18G May 30 22:03 syslog, 21G the one from yesterday
[22:04:29] <grrrit-wm>	 (03CR) 10Yuvipanda: "^ was the reason I introduced this." [puppet] - 10https://gerrit.wikimedia.org/r/291751 (owner: 10Alexandros Kosiaris)
[22:04:46] <paravoid>	 yeah
[22:05:22] <grrrit-wm>	 (03CR) 10Yuvipanda: "This will affect *all* uwsgi defined services, all of which will need a manual stopping-of-old-service and starting-of-new-service, along " [puppet] - 10https://gerrit.wikimedia.org/r/291751 (owner: 10Alexandros Kosiaris)
[22:06:05] <icinga-wm>	 RECOVERY - Disk space on ms-be2012 is OK: DISK OK
[22:06:17] <volans>	 did you just deleted the syslog.1? :D
[22:06:33] <paravoid>	 yeah :)
[22:06:36] <paravoid>	 whatever
[22:06:38] <paravoid>	 it's just logs
[22:06:51] <paravoid>	 access logs I mean
[22:07:00] <volans>	 it should not log there, if there is any issue with those machines it's impossible to find it in syslog
[22:07:31] <paravoid>	 I agree :)
[22:08:14] <volans>	 looks like it started few days ago
[22:08:49] <grrrit-wm>	 (03PS1) 10Jforrester: BetaFeatures: Bump dates, list departments, drop now-graduated Notifications one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291836 
[22:09:00] <volans>	 syslog.5.gz 20MB, syslog.4.gz 240MB, syslog.3.gz 1.2GB, syslog.2.gz 2.2GB
[22:09:04] <grrrit-wm>	 (03CR) 10Yuvipanda: "This is awesome!" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/291525 (owner: 10BryanDavis)
[22:09:44] <grrrit-wm>	 (03PS4) 10Yuvipanda: k8s: Ensure /etc/kubernetes is present wherever required [puppet] - 10https://gerrit.wikimedia.org/r/291617 
[22:10:02] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] k8s: Ensure /etc/kubernetes is present wherever required [puppet] - 10https://gerrit.wikimedia.org/r/291617 (owner: 10Yuvipanda)
[22:10:19] <grrrit-wm>	 (03CR) 10Yuvipanda: [V: 032] k8s: Ensure /etc/kubernetes is present wherever required [puppet] - 10https://gerrit.wikimedia.org/r/291617 (owner: 10Yuvipanda)
[22:29:45] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Allow bastions to talk to flannel etcd [puppet] - 10https://gerrit.wikimedia.org/r/291841 (https://phabricator.wikimedia.org/T136413) 
[22:30:14] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Allow bastions to talk to flannel etcd [puppet] - 10https://gerrit.wikimedia.org/r/291841 (https://phabricator.wikimedia.org/T136413) (owner: 10Yuvipanda)
[22:41:23] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 673 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6198818 keys - replication_delay is 673
[22:58:34] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6168860 keys - replication_delay is 0
[23:00:05] <jouncebot>	 RoanKattouw ostriches Krenair MaxSem Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160530T2300). Please do the needful.
[23:00:05] <jouncebot>	 RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:19] <RoanKattouw>	 I'll do it
[23:00:23] <RoanKattouw>	 It's just one config patch
[23:01:05] <grrrit-wm>	 (03CR) 10Catrope: [C: 032] BetaFeatures: Bump dates, list departments, drop now-graduated Notifications one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291836 (owner: 10Jforrester)
[23:01:48] <grrrit-wm>	 (03Merged) 10jenkins-bot: BetaFeatures: Bump dates, list departments, drop now-graduated Notifications one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291836 (owner: 10Jforrester)
[23:04:27] <logmsgbot>	 !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Update BetaFeatures whitelist (duration: 00m 32s)
[23:04:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master