[00:01:22] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:07:20] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:44:41] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [01:04:49] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: puppet fail [01:11:49] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [01:31:49] RECOVERY - puppet last run on maps-test2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:37:49] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 601 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6228166 keys - replication_delay is 601 [01:38:49] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:45:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6223765 keys - replication_delay is 0 [01:56:00] PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: puppet fail [02:20:00] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 610 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6226079 keys - replication_delay is 610 [02:21:59] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6222531 keys - replication_delay is 0 [02:24:50] RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [02:24:54] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.3) (duration: 09m 04s) [02:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:46] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon May 30 02:30:46 UTC 2016 (duration 5m 52s) [02:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:21] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.32.196:9042 on restbase1008 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-06-01 02:30:53. [03:48:50] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 618 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6229343 keys - replication_delay is 618 [03:56:31] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6219247 keys - replication_delay is 0 [03:58:30] RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active [04:04:20] PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [04:29:29] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Puppet has 1 failures [04:42:45] (03PS1) 10Ori.livneh: Drop dependency on wikimedia/cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291681 [04:54:31] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [04:57:49] bd808 tried to undo the breakage by reinstalling with composer 1.0.x, but he did not revert erik's patch, so this left the busted ./composer/autoload_static.php in place [04:58:15] i ran composer update with 1.1 and thought i'd be ok if i don't commit anything related to the composer update [04:58:29] this added a line to autoload_static.php which caused it to be linted [04:59:05] * bd808 feels a disturbance in the force [04:59:30] do we have broken vendor for php 5.6+ again? [04:59:48] not really, but things are a bit wonky [05:00:10] when you reinstalled with 1.0.x, you did not remove autoload_static.php [05:00:26] Really? that was wrong [05:00:32] T135161 has the gory details [05:00:32] 1.0.x does not generate a file by that name, so your reinstall left it as an orphan [05:00:32] T135161: Composer v1.1.0 generated vendor dirs will fail lint by PHP <5.6 - https://phabricator.wikimedia.org/T135161 [05:00:47] ah crap. we need to kill it [05:48:14] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 1 failures [05:58:58] (03CR) 10Mobrovac: "Oh, I see. Cool. But let's also remove tilerator/deploy from hieradata/common/role/deployment.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/291268 (https://phabricator.wikimedia.org/T129146) (owner: 10Thcipriani) [06:12:34] (03CR) 10Mobrovac: Partially port RESTBaseUpdateJobs to change propagation. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291201 (owner: 10Ppchelko) [06:13:15] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:25:04] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:26:23] (03PS1) 10Giuseppe Lavagetto: mediawiki::hhvm: debian jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/291687 (https://phabricator.wikimedia.org/T131749) [06:26:53] RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.309 second response time [06:29:24] RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active [06:30:44] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:03] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:23] (03CR) 10Alexandros Kosiaris: "pcc at puppet-compiler.wmflabs.org/2979 is quite happy, making jenkins happy now" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [06:35:24] PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [06:35:30] (03PS2) 10Giuseppe Lavagetto: mediawiki::hhvm: debian jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/291687 (https://phabricator.wikimedia.org/T131749) [06:38:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki::hhvm: debian jessie compatibility [puppet] - 10https://gerrit.wikimedia.org/r/291687 (https://phabricator.wikimedia.org/T131749) (owner: 10Giuseppe Lavagetto) [06:40:34] PROBLEM - Verify internal DNS from within Tools on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:40:38] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:40:43] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:41:54] hmm [06:42:34] RECOVERY - Verify internal DNS from within Tools on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 8.032 second response time [06:42:37] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 8.067 second response time [06:42:37] RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 8.271 second response time [06:43:14] not sure how this got fixed [06:43:39] <_joe_> not sure why we're getting flooded by these messages [06:43:48] that too [06:53:23] (03PS3) 10Alexandros Kosiaris: networks::constants: Hieraize all_network_subnets [puppet] - 10https://gerrit.wikimedia.org/r/291263 [06:53:25] (03PS25) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [06:53:27] (03PS5) 10Alexandros Kosiaris: network: Move into module [puppet] - 10https://gerrit.wikimedia.org/r/291234 [06:53:29] (03PS10) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [06:56:54] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:57:34] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [06:57:34] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:53] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:23] RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active [07:03:04] PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [07:03:55] (03PS1) 10Giuseppe Lavagetto: nutcracker: add an additional guard on the master version [puppet] - 10https://gerrit.wikimedia.org/r/291688 (https://phabricator.wikimedia.org/T131749) [07:09:11] (03CR) 10Giuseppe Lavagetto: [C: 032] nutcracker: add an additional guard on the master version [puppet] - 10https://gerrit.wikimedia.org/r/291688 (https://phabricator.wikimedia.org/T131749) (owner: 10Giuseppe Lavagetto) [07:09:21] (03CR) 10Giuseppe Lavagetto: [V: 032] nutcracker: add an additional guard on the master version [puppet] - 10https://gerrit.wikimedia.org/r/291688 (https://phabricator.wikimedia.org/T131749) (owner: 10Giuseppe Lavagetto) [07:09:37] <_joe_> 5 mintues and no jenkins-bot verified [07:09:42] <_joe_> this is ridiculous [07:12:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [07:26:43] Hello [07:27:10] Alexz what I doing today? [07:27:25] You [07:27:47] _joe_: zuul is hung [07:27:57] there is a job that has been running for 2 hrs [07:27:59] https://integration.wikimedia.org/zuul/ [07:28:06] hashar: ^ [07:28:45] (03CR) 10Alexandros Kosiaris: [C: 031] varnish: Fix PEP8 violations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291187 (owner: 10BryanDavis) [07:29:01] (03PS5) 10Alexandros Kosiaris: varnish: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291187 (owner: 10BryanDavis) [07:29:08] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] varnish: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291187 (owner: 10BryanDavis) [07:30:29] (03PS4) 10Alexandros Kosiaris: mailman: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291180 (owner: 10BryanDavis) [07:30:51] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] mailman: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291180 (owner: 10BryanDavis) [07:34:39] PROBLEM - Apache HTTP on mw1262 is CRITICAL: Connection timed out [07:35:17] (03CR) 10Alexandros Kosiaris: [C: 031] pybal: Fix PEP8 violations [puppet] - 10https://gerrit.wikimedia.org/r/291183 (owner: 10BryanDavis) [07:35:19] PROBLEM - nutcracker port on mw1262 is CRITICAL: Timeout while attempting connection [07:35:39] PROBLEM - nutcracker process on mw1262 is CRITICAL: Timeout while attempting connection [07:36:00] PROBLEM - puppet last run on mw1262 is CRITICAL: Timeout while attempting connection [07:36:19] PROBLEM - salt-minion processes on mw1262 is CRITICAL: Timeout while attempting connection [07:36:40] PROBLEM - Check size of conntrack table on mw1262 is CRITICAL: Timeout while attempting connection [07:36:59] PROBLEM - DPKG on mw1262 is CRITICAL: Timeout while attempting connection [07:37:18] PROBLEM - Disk space on mw1262 is CRITICAL: Timeout while attempting connection [07:37:48] PROBLEM - RAID on mw1262 is CRITICAL: Timeout while attempting connection [07:38:19] PROBLEM - configured eth on mw1262 is CRITICAL: Timeout while attempting connection [07:38:38] PROBLEM - dhclient process on mw1262 is CRITICAL: Timeout while attempting connection [07:38:39] PROBLEM - mediawiki-installation DSH group on mw1262 is CRITICAL: Host mw1262 is not in mediawiki-installation dsh group [07:39:50] anybody working on --^ ? [07:42:01] ah probably one of the newer app severs [07:42:05] *servers [07:42:18] <_joe_> elukey: yes [07:42:24] <_joe_> it's me :) [07:42:39] <_joe_> not in lvs, not even in the scap sync file [07:42:56] o/ [07:43:48] RECOVERY - salt-minion processes on mw1262 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:43:58] RECOVERY - configured eth on mw1262 is OK: OK - interfaces up [07:43:59] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.009 second response time [07:44:09] RECOVERY - Check size of conntrack table on mw1262 is OK: OK: nf_conntrack is 0 % full [07:44:09] RECOVERY - dhclient process on mw1262 is OK: PROCS OK: 0 processes with command name dhclient [07:44:39] RECOVERY - nutcracker port on mw1262 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [07:44:48] RECOVERY - Disk space on mw1262 is OK: DISK OK [07:44:59] RECOVERY - nutcracker process on mw1262 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [07:45:18] RECOVERY - RAID on mw1262 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [07:46:28] RECOVERY - DPKG on mw1262 is OK: All packages OK [07:47:34] (03PS2) 10Volans: MariaDB: use 0/1 instead of off/on for read_only [puppet] - 10https://gerrit.wikimedia.org/r/291299 (https://phabricator.wikimedia.org/T133333) [07:55:49] PROBLEM - zuul_gearman_service on gallium is CRITICAL: Connection refused [07:56:30] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [07:58:00] <_joe_> is someone working on zuul? [07:58:19] RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active [08:00:16] (03CR) 10DCausse: [C: 031] Send wmf.4 search and ttmserver traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291257 (https://phabricator.wikimedia.org/T133124) (owner: 10EBernhardson) [08:01:42] 06Operations, 10ops-eqiad, 13Patch-For-Review: Rack and Set up new application servers mw1284-1306 - https://phabricator.wikimedia.org/T134309#2338122 (10Joe) @Southparkfan apparently for some reason the same DNS record for mw1090 has been assigned to mw1305, which is still turned off for good for now. So w... [08:02:17] PROBLEM - Apache HTTP on mw1262 is CRITICAL: Connection refused [08:02:47] <_joe_> I'll ack all alerts on mw1262 [08:04:26] RECOVERY - zuul_gearman_service on gallium is OK: TCP OK - 0.000 second response time on port 4730 [08:04:45] PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [08:04:56] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 2 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-server [08:06:54] _joe_: I was, I've been logging on #wikimedia-releng [08:07:11] as the instructions on https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues asked [08:07:20] the alerts echo there, so i didn't notice them here [08:07:30] <_joe_> ori: ok thanks for taking care of it :) [08:07:45] it seems to be ok now [08:08:31] I'm not sure why releng !logs on #wikimedia-releng, using a separate bot and a separate SAL [08:09:37] if responsibilities were completely and hygienically separated, that'd be one thing, but this channel gets alerts for contint service failures [08:10:16] IMO we're not so big and the main SAL is not so busy that a separate SAL is warranted [08:10:20] we can just all log here [08:10:23] <_joe_> ops: the kitchen sink where all tech debt gets turned on and off again [08:10:43] <_joe_> (the inversion was intentional) [08:10:54] (03PS1) 10Gehel: Elasticsearch - configure bind_networks. [puppet] - 10https://gerrit.wikimedia.org/r/291689 [08:10:59] heh [08:14:59] (03PS2) 10Ppchelko: Partially port RESTBaseUpdateJobs to change propagation. [puppet] - 10https://gerrit.wikimedia.org/r/291201 [08:15:42] (03CR) 10DCausse: [C: 031] Elasticsearch - configure bind_networks. [puppet] - 10https://gerrit.wikimedia.org/r/291689 (owner: 10Gehel) [08:16:03] (03CR) 10Volans: [C: 032] MariaDB: use 0/1 instead of off/on for read_only [puppet] - 10https://gerrit.wikimedia.org/r/291299 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [08:16:05] (03CR) 10Ppchelko: Partially port RESTBaseUpdateJobs to change propagation. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291201 (owner: 10Ppchelko) [08:17:03] (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291299 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [08:18:13] (03CR) 10Gehel: [C: 032] Elasticsearch - configure bind_networks. [puppet] - 10https://gerrit.wikimedia.org/r/291689 (owner: 10Gehel) [08:19:30] (03PS3) 10Ppchelko: Partially port RESTBaseUpdateJobs to change propagation. [puppet] - 10https://gerrit.wikimedia.org/r/291201 [08:27:41] !log starting elasticsearch upgrade on codfw (T133125) [08:27:42] T133125: Upgrade codfw data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133125 [08:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:28:59] (03PS3) 10Volans: MariaDB: use 0/1 instead of off/on for read_only [puppet] - 10https://gerrit.wikimedia.org/r/291299 (https://phabricator.wikimedia.org/T133333) [08:33:37] (03CR) 10DCausse: [C: 031] Upgrade cluster plugins for elasticsearch 2.3.3 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/290679 (https://phabricator.wikimedia.org/T134937) (owner: 10DCausse) [08:34:51] (03CR) 10Gehel: [C: 032] Upgrade cluster plugins for elasticsearch 2.3.3 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/290679 (https://phabricator.wikimedia.org/T134937) (owner: 10DCausse) [08:36:32] (03CR) 10Gehel: [V: 032] Upgrade cluster plugins for elasticsearch 2.3.3 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/290679 (https://phabricator.wikimedia.org/T134937) (owner: 10DCausse) [08:44:34] !log Align thread_pool_max_threads to my.cnf value on 1 slave/shard in eqiad (db1065,db1076,db1078,db1040,db1026,db1061,db1039) T133333 [08:44:34] T133333: Audit MySQL configurations - https://phabricator.wikimedia.org/T133333 [08:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:53:49] 06Operations, 06Project-Admins: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2338184 (10Aklapper) For the records, the following projects were changed from yellow tags to blue components lately: #Diamond, #Elasticsearch, #Icinga, #Shinken. (#Graphite, #LDAP, #P... [08:55:21] 06Operations, 06Project-Admins, 10Traffic: Create #HTTP2 tag - https://phabricator.wikimedia.org/T134960#2338185 (10Aklapper) Proposing to decline as per last two comments. [08:56:01] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM, 13Patch-For-Review: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2338188 (10Joe) What is left to do: [] Make mediawiki::cgroup work with systemd or change the way we manage cgroups there... [08:58:02] RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active [09:01:03] hashar, do you have some time for T126699 ? [09:01:04] T126699: mediawiki jobs fail intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges" - https://phabricator.wikimedia.org/T126699 [09:01:45] I want to merge the puppet patch, but want you for CI testing [09:04:12] PROBLEM - cassandra-c service on restbase1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [09:06:50] (03PS1) 10Filippo Giunchedi: cassandra: add restbase100[89]-c to seeds [puppet] - 10https://gerrit.wikimedia.org/r/291692 (https://phabricator.wikimedia.org/T134016) [09:09:02] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase100[89]-c to seeds [puppet] - 10https://gerrit.wikimedia.org/r/291692 (https://phabricator.wikimedia.org/T134016) (owner: 10Filippo Giunchedi) [09:09:56] !log shutting down elasticsearch on codfw for upgrade (T133125) [09:09:57] T133125: Upgrade codfw data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133125 [09:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:13:20] (03CR) 10Mobrovac: [C: 031] Partially port RESTBaseUpdateJobs to change propagation. [puppet] - 10https://gerrit.wikimedia.org/r/291201 (owner: 10Ppchelko) [09:15:58] PROBLEM - LVS HTTP IPv4 on search.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 341 bytes in 0.181 second response time [09:16:23] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2004.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2001.codfw.wmnet because of too many down! [09:16:23] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2004.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2001.codfw.wmnet because of too many down! [09:16:26] <_joe_> uh? [09:16:31] <_joe_> what the fuck is up? [09:16:39] <_joe_> gehel: any idea> [09:16:45] gehel is upodating elastic in codfw [09:16:46] codfw being restarted (upgrade) [09:16:58] no user impact, I assume [09:17:01] _joe_: damn, forgot the LVS check again [09:17:01] no [09:17:06] <_joe_> yeah, you might want to do it a bit slower maybe? [09:17:08] ok, good to know [09:17:18] <_joe_> I have no idea if that would help [09:17:29] nope, we need to take the whole cluster down at once, 1.7 and 2.3 are not compatible [09:17:41] yes full cluster restart :/ [09:17:58] which is not very HA-friendly :-) [09:18:02] not at all :( [09:18:56] as long as we have 2 separate clusters... [09:20:48] ACKNOWLEDGEMENT - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2004.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2001.codfw.wmnet because of too many down! Gehel shutting down elasticsearch on codfw for upgrade (T133125) [09:20:54] ACKNOWLEDGEMENT - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2004.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2001.codfw.wmnet because of too many down! Gehel shutting down elasticsearch on codfw for upgrade (T133125) [09:21:47] godog, if you haven't seen it, there is a degraded array email for ms-be2012 [09:22:54] jynus: thanks, yeah I filed it as https://phabricator.wikimedia.org/T136395 but will reply to the email too [09:23:06] oh, no need, sorry for pinging you [09:24:47] (03PS1) 10DCausse: Elastic: update mandatory plugins for codfw [puppet] - 10https://gerrit.wikimedia.org/r/291694 [09:24:48] no worries at all jynus, not sure if we can stop the emails once the array is degraded [09:26:27] (03CR) 10Gehel: [C: 032] Elastic: update mandatory plugins for codfw [puppet] - 10https://gerrit.wikimedia.org/r/291694 (owner: 10DCausse) [09:35:46] 06Operations, 10DBA: dbtree shows 0 lag for db1047 - https://phabricator.wikimedia.org/T109401#2338289 (10Volans) a:03Volans [09:36:31] Serious stuff... [09:36:45] Serious stuff that this channel is not +t [09:39:00] lots of Wikibase\Lib\Store\Sql\SqlEntityInfoBuilder::collectTermsForEntities hitting db1071 [09:40:45] (03PS1) 10Volans: Exclude db1047 (multisource slave) from dbtree [software/dbtree] - 10https://gerrit.wikimedia.org/r/291696 (https://phabricator.wikimedia.org/T109401) [09:43:52] (03CR) 10Jcrespo: [C: 031] Exclude db1047 (multisource slave) from dbtree [software/dbtree] - 10https://gerrit.wikimedia.org/r/291696 (https://phabricator.wikimedia.org/T109401) (owner: 10Volans) [09:49:50] (03PS1) 10Elukey: Set Kafka default cleanup policy to 'delete' to avoid any compaction with 0.9 [puppet] - 10https://gerrit.wikimedia.org/r/291697 [09:50:19] 06Operations, 10media-storage, 07Tracking: refresh swift hardware in codfw/eqiad (tracking) - https://phabricator.wikimedia.org/T130012#2338320 (10fgiunchedi) 6x swift systems (all 3TB disks) have been ordered in T130713 and T136336, though we'll be keeping the old swift hw in place for the next 6/9 months a... [09:50:51] Now I am the one that setted the topic lol [09:51:08] 06Operations, 10media-storage, 07Tracking: refresh swift hardware in codfw/eqiad (tracking) - https://phabricator.wikimedia.org/T130012#2338322 (10fgiunchedi) [09:51:39] (03CR) 10Volans: [C: 032 V: 032] Exclude db1047 (multisource slave) from dbtree [software/dbtree] - 10https://gerrit.wikimedia.org/r/291696 (https://phabricator.wikimedia.org/T109401) (owner: 10Volans) [09:52:17] jynus: do you know if additional steps are needed to deploy dbtree code? ^^^ [09:57:28] !log installing libidn security updates on jessie systems [09:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:45] RECOVERY - cassandra-c service on restbase1008 is OK: OK - cassandra-c is active [09:59:01] (03PS1) 10DCausse: Elastic: add publish_host support [puppet] - 10https://gerrit.wikimedia.org/r/291698 [10:02:54] (03PS2) 10Elukey: Set Kafka default cleanup policy to 'delete' to avoid any compaction with 0.9 [puppet] - 10https://gerrit.wikimedia.org/r/291697 [10:04:16] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2338334 (10Ladsgroup) @RobH: Thanks for the response. What I need is access to these sudo actions: ``` 'ALL=(root) NOPASSWD: /usr... [10:04:48] (03CR) 10Elukey: "Puppet compiler looks good: https://puppet-compiler.wmflabs.org/2983" [puppet] - 10https://gerrit.wikimedia.org/r/291697 (owner: 10Elukey) [10:07:04] (03PS1) 10Jcrespo: Reduce db1071 load (regular connection exhaustion from jobs) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291703 [10:08:31] (03CR) 10Gehel: [C: 032 V: 032] "Jenkins not reacting, change is trivial enough, so I'll v+2" [puppet] - 10https://gerrit.wikimedia.org/r/291698 (owner: 10DCausse) [10:09:24] gehel: jenkins is not reacting because jenkins-bot was not subscribed to the change... somethings is wrong [10:10:48] volans: I have to admit that I have no idea how this integration works... [10:11:01] (03CR) 10Jcrespo: [C: 032 V: 032] Reduce db1071 load (regular connection exhaustion from jobs) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291703 (owner: 10Jcrespo) [10:12:34] yeah, hudson is down [10:12:49] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Reduce db1071 load (duration: 00m 48s) [10:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:20:27] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Puppet has 1 failures [10:24:44] (03PS1) 10Giuseppe Lavagetto: base::grub: fix the ioscheduler setting [puppet] - 10https://gerrit.wikimedia.org/r/291706 [10:24:46] (03PS1) 10Giuseppe Lavagetto: base::grub: actually use augeas on jessie [puppet] - 10https://gerrit.wikimedia.org/r/291707 [10:24:52] <_joe_> paravoid: ^^ [10:25:26] * akosiaris_ at the hospital, won't be around for a bit [10:25:41] :-( [10:26:47] akosiaris: gah, take care [10:30:03] going to fix up zuul [10:30:09] !log Zuul deadlocked :( [10:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:30:23] bah it died :( [10:31:30] !log upgrading hhvm on mw1017 (also picking up updated versions of icu and lcms) [10:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:32:29] _joe_, isn't that incorrect, too? [10:32:51] <_joe_> jynus: I miss context [10:32:56] <_joe_> what is incorrect? [10:33:00] shouldn't we just do elevator=$ioschenduler where elevator=.* [10:33:33] <_joe_> jynus: right [10:33:42] <_joe_> jynus: although we don't want .* [10:33:53] <_joe_> and I have no idea if regexes can be used in selectors [10:34:07] yeah, the idea, I can help with the implementation [10:35:36] !log Restarted Zuul. [10:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:35:40] I am interested on this because I would like to try noop give the newest hardware [10:36:03] *given [10:36:21] (03CR) 10Hashar: "check experimental" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/291525 (owner: 10BryanDavis) [10:37:23] (03CR) 10Paladox: "check experimental" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/291525 (owner: 10BryanDavis) [10:37:25] (03CR) 10Hashar: [C: 032] Make the builder script less simple [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/291525 (owner: 10BryanDavis) [10:37:32] (03CR) 10Hashar: Make the builder script less simple [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/291525 (owner: 10BryanDavis) [10:37:37] 06Operations, 10DBA, 13Patch-For-Review: dbtree shows 0 lag for db1047 - https://phabricator.wikimedia.org/T109401#2338381 (10Volans) 05Open>03Resolved For multisource slaves the data in the tendril table `slave_status` is saved with the shard prefix (i.e. `s1.seconds_behind_master`) and is not found by... [10:39:57] (03CR) 10Ema: varnish: jemalloc tuning for frontend caches (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291592 (https://phabricator.wikimedia.org/T135384) (owner: 10BBlack) [10:43:29] <_joe_> jynus: I'll run some tests [10:46:18] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:51:08] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [10:52:57] (03PS2) 10Ema: tlsproxy: trim indentation in localssl.erb [puppet] - 10https://gerrit.wikimedia.org/r/291253 [10:55:53] I was doing the same, it seems that augeas has some issues with the latest grubs [10:57:05] I am going to ack elastic codfw errors, I cannot see a thing on icinga [10:57:06] (03CR) 10Ema: [C: 032 V: 032] tlsproxy: trim indentation in localssl.erb [puppet] - 10https://gerrit.wikimedia.org/r/291253 (owner: 10Ema) [10:57:43] !log upgrading hhvm on remaining canaries (also picking up updated versions of icu and lcms) [10:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:00:47] now we can see the important thingsm like etherpad [11:05:38] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [11:07:35] (03PS1) 10Filippo Giunchedi: prometheus: add tools role [puppet] - 10https://gerrit.wikimedia.org/r/291710 [11:07:38] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [11:08:20] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add tools role [puppet] - 10https://gerrit.wikimedia.org/r/291710 (owner: 10Filippo Giunchedi) [11:11:10] (03CR) 10Faidon Liambotis: [C: 031] "(wears brown paper bag)" [puppet] - 10https://gerrit.wikimedia.org/r/291707 (owner: 10Giuseppe Lavagetto) [11:11:18] * godog shakes fist at jenkins [11:11:20] (03CR) 10Faidon Liambotis: [C: 031] base::grub: fix the ioscheduler setting [puppet] - 10https://gerrit.wikimedia.org/r/291706 (owner: 10Giuseppe Lavagetto) [11:13:37] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 682 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6255588 keys - replication_delay is 682 [11:17:17] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:21:29] 06Operations, 10MediaWiki-Categories, 07HHVM: Broken sorting and multi-page categories for Cyrillic wikis - https://phabricator.wikimedia.org/T136281#2338413 (10NickK) Thanks, I confirm that the problem is resolved. [11:32:31] (03PS9) 10Filippo Giunchedi: prometheus: add server support [puppet] - 10https://gerrit.wikimedia.org/r/280652 (https://phabricator.wikimedia.org/T126785) [11:32:33] (03PS3) 10Filippo Giunchedi: prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) [11:32:35] (03PS2) 10Filippo Giunchedi: prometheus: add tools role [puppet] - 10https://gerrit.wikimedia.org/r/291710 [11:42:51] !log upgrading hhvm in codfw (also picking up updated versions of icu and lcms) [11:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:55:33] PROBLEM - HHVM rendering on mw2149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:57:08] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add nginx reverse proxy [puppet] - 10https://gerrit.wikimedia.org/r/290479 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [11:57:25] RECOVERY - HHVM rendering on mw2149 is OK: HTTP OK: HTTP/1.1 200 OK - 71761 bytes in 0.371 second response time [11:58:09] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: I/O issues for /dev/sdd on analytics1047.eqiad.wmnet - https://phabricator.wikimedia.org/T134056#2338489 (10elukey) 05Open>03Resolved [11:59:10] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2338493 (10Ladsgroup) Also what about adding to "deploy-service" group? [12:01:24] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [12:02:20] gehel: ^ [12:02:41] paravoid: thanks, having a look right now [12:03:26] alert is on eqiad, which has mostly no traffic at the moment, so 95th percentile is most probably not representative... [12:03:50] RECOVERY - LVS HTTP IPv4 on search.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.174 second response time [12:03:50] dcausse: ^ fyi [12:03:58] (03PS1) 10Ladsgroup: Add ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/291716 (https://phabricator.wikimedia.org/T136406) [12:04:07] 06Operations, 10ops-esams, 06DC-Ops, 10netops: Set up cr2-esams - https://phabricator.wikimedia.org/T118256#2338497 (10faidon) 05Open>03Resolved [12:04:20] <_joe_> gehel: eqiad has no traffic? [12:04:26] <_joe_> or codfw? [12:04:32] codfw I suppose [12:04:38] <_joe_> because codfw was down for most of the morning [12:04:52] _joe_: my bad, codfw has no traffic and alert is for codfw [12:05:44] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [12:06:12] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291565 (owner: 10Ladsgroup) [12:06:13] 06Operations, 10ops-esams, 06DC-Ops: Power cr2-esams PEM 2/PEM 3 - https://phabricator.wikimedia.org/T118166#2338512 (10faidon) PEM 2 is powered, but by the same PDU. PEM 3 is not powered and is also unplugged from the chassis, which downgrades the alarm from a Major (red) to a Minor (yellow). This will pro... [12:06:14] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [12:07:08] !log nginx restarted on elasticsearch codfw cluster (T133125) [12:07:09] T133125: Upgrade codfw data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133125 [12:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:08:44] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: puppet fail [12:11:18] 06Operations, 10ops-esams, 06DC-Ops: Power cr2-esams PEM 2/PEM 3 - https://phabricator.wikimedia.org/T118166#2338519 (10faidon) The Icinga [[ https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr2-esams&service=Juniper+alarms#comments | alert for the chassis alarm ]] has been acknowledged. T... [12:11:50] job queue size is in an ascending pattern [12:12:15] https://grafana-admin.wikimedia.org/dashboard/db/job-queue-health?from=1464523928505&to=1464610028505&var-jobType=all [12:13:05] maybe the second derivative is descending, not sure yet [12:13:15] (03PS1) 10Gergő Tisza: Remove centralauth-autoaccount right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291720 [12:13:22] 06Operations, 10netops: turn-up/implement zayo wave (579171) for ulsfo-codfw - https://phabricator.wikimedia.org/T122885#2338531 (10faidon) [12:17:30] 06Operations, 10netops: codfw-eqiad Zayo link is down (cr2-codfw:xe-5/0/1) - https://phabricator.wikimedia.org/T134930#2338552 (10faidon) This was down again for 48 hours with the same symptoms. I raised it again with Zayo, which got assigned the case TTN-0001073020. They dispatched a tech at both 2323 Bryan a... [12:20:16] (03PS4) 10Faidon Liambotis: Create raid module to hold RAID monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/290986 (https://phabricator.wikimedia.org/T84050) [12:20:19] (03PS9) 10Faidon Liambotis: raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) [12:20:20] (03PS4) 10Faidon Liambotis: raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) [12:20:22] (03PS4) 10Faidon Liambotis: raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 [12:20:25] (03PS4) 10Faidon Liambotis: raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) [12:20:27] (03PS4) 10Faidon Liambotis: raid: slightly change check-raid's "utility" names [puppet] - 10https://gerrit.wikimedia.org/r/291011 [12:20:29] (03PS4) 10Faidon Liambotis: raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) [12:20:32] (just rebasing) [12:23:58] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "this would work on the first run, then continue adding elevator=$ioscheduler on subsequent runs" [puppet] - 10https://gerrit.wikimedia.org/r/291706 (owner: 10Giuseppe Lavagetto) [12:24:43] <_joe_> jynus: ^^ we can't apparently select based on regexes [12:25:07] yes, I saw the issue [12:25:13] same with the exec [12:25:15] later [12:25:44] unless I missread, if the config changes, it will add two values [12:26:44] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6209474 keys - replication_delay is 0 [12:26:49] unless => "grep -q '^GRUB_CMDLINE_LINUX=.*elevator=${ioscheduler}' /etc/default/grub", [12:28:20] (03CR) 10Faidon Liambotis: [C: 032] Create raid module to hold RAID monitoring checks [puppet] - 10https://gerrit.wikimedia.org/r/290986 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [12:28:28] (03CR) 10Faidon Liambotis: [C: 032] raid: add a new "raid" fact [puppet] - 10https://gerrit.wikimedia.org/r/290988 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [12:30:16] faidon, let me help testing that [12:30:45] hmm, found a bug already [12:30:46] interesting [12:31:23] fucking puppet [12:31:28] ? [12:31:28] stringify facts stupidity [12:34:05] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures [12:34:13] PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures [12:34:13] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures [12:34:34] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Puppet has 1 failures [12:34:34] PROBLEM - puppet last run on mw1096 is CRITICAL: CRITICAL: Puppet has 1 failures [12:34:44] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 1 failures [12:34:54] PROBLEM - puppet last run on mc2013 is CRITICAL: CRITICAL: Puppet has 1 failures [12:34:56] transient, I think [12:35:08] !log re-enabling puppet on elasticsearch codfw cluster (T133125) [12:35:09] T133125: Upgrade codfw data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133125 [12:35:14] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures [12:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:14] PROBLEM - puppet last run on mw2162 is CRITICAL: CRITICAL: puppet fail [12:35:23] RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:35:24] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: Puppet has 1 failures [12:35:34] faidon, puppet facts find db2018.codfw.wmnet --render-as yaml | grep raid -> raid: megaraid [12:35:44] jynus: "facter --puppet" [12:35:53] and I know, I'm looking at the all hosts view [12:36:03] ok, ok [12:36:05] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [12:36:26] ACKNOWLEDGEMENT - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] Gehel Upgrade in progress, low traffic, so 95th percentile not significant at the moment [12:36:54] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Puppet has 1 failures [12:37:03] PROBLEM - puppet last run on mw2072 is CRITICAL: CRITICAL: Puppet has 1 failures [12:37:47] I see it now: raid: "[\x22hpsa\x22]" [12:37:48] (03PS1) 10Faidon Liambotis: raid: always stringify the raid fact [puppet] - 10https://gerrit.wikimedia.org/r/291726 [12:38:03] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [12:39:13] (03CR) 10Faidon Liambotis: [C: 032] raid: always stringify the raid fact [puppet] - 10https://gerrit.wikimedia.org/r/291726 (owner: 10Faidon Liambotis) [12:40:43] now need to wait for another half an hour :) [12:40:47] (03CR) 10jenkins-bot: [V: 04-1] raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 (owner: 10Faidon Liambotis) [12:41:01] lol what the hell jenkns [12:41:27] hashar: any idea why this change has been taking 18 minutes to be checked, consistently? [12:41:57] now it says: "raid: hpsa" [12:42:02] jynus: yeah [12:42:23] the backstory is that facter 2.0.0 introduced structured facts, i.e. facts can return booleans, arrays, hashes etc. [12:42:25] (03CR) 10jenkins-bot: [V: 04-1] raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [12:42:37] so leave it there for half an hour, then do a sanity check? [12:42:48] puppet 3.7 can use that, but only if you tweak a setting [12:42:50] for some reason... [12:42:58] that setting is on by default in 4.0 [12:43:15] so I tried to play it smart and have my fact work with returning an array [12:43:18] and it blew up on my face [12:43:19] anyway [12:43:44] jynus: yeah, leave it for half an hour, then check https://servermon.wikimedia.org/query [12:45:25] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [12:53:01] !log Upgrading Zuul 1cc37f7..66c8e52 T128569 [12:53:02] T128569: Zuul deadlocks if unknown repo has activity in Gerrit - https://phabricator.wikimedia.org/T128569 [12:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:54:49] so some nodes have the mpt kernel module- not 100% they should [12:54:52] *sure [12:59:25] !log disabling warmers elasticsearch codfw cluster (T133125) [12:59:26] T133125: Upgrade codfw data centre to Elasticsearch 2.3 - https://phabricator.wikimedia.org/T133125 [12:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:34] RECOVERY - puppet last run on mc2013 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [13:01:03] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:01:14] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:01:23] RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:01:33] RECOVERY - puppet last run on mw1096 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:01:34] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:01:45] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [13:01:53] RECOVERY - puppet last run on mw2072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:02:03] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:02:13] RECOVERY - puppet last run on mw2162 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:02:24] RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:02:54] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [13:03:05] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2338675 (10elukey) Created a grafana dashboard from Varnishkafka metrics: https://grafana.wikimedia.org/dashboard/db/varnishkafka [13:20:37] http://p.defau.lt/?fKGznI_VPRMXNcM1AUon3A [13:20:39] raid stats [13:20:42] pretty impressive [13:22:38] rdb1005/1006 have no RAID configured [13:22:49] they have a /dev/sdb, which is not formatted at all [13:22:49] /dev/sdb1 2048 976771071 976769024 465.8G 7 HPFS/NTFS/exFAT [13:23:32] (03CR) 10Alexandros Kosiaris: [C: 032] wikilabels: make file settings recursive [puppet] - 10https://gerrit.wikimedia.org/r/291572 (owner: 10Ladsgroup) [13:23:37] _joe_: ^^ [13:23:55] (03CR) 10Alexandros Kosiaris: "what other methods of deployment ?" [puppet] - 10https://gerrit.wikimedia.org/r/291527 (owner: 10Ladsgroup) [13:24:27] (03PS2) 10Alexandros Kosiaris: wikilabels: make file settings recursive [puppet] - 10https://gerrit.wikimedia.org/r/291572 (owner: 10Ladsgroup) [13:24:31] back btw [13:25:17] (03CR) 10Alexandros Kosiaris: [V: 032] wikilabels: make file settings recursive [puppet] - 10https://gerrit.wikimedia.org/r/291572 (owner: 10Ladsgroup) [13:25:46] ytterbium and antimony as well.. [13:26:17] and all the snapshot hosts [13:26:46] and a few others [13:26:57] jynus: I see a few databases/dbproxies on that list too [13:27:02] (the "no RAID" list) [13:27:08] which one paravoid ? [13:27:19] I think these are old [13:27:21] the dbs [13:27:32] db1001, db1043, db1048, dbproxy1001, dbproxy1002 [13:29:53] paravoid: db1043 looks to have an hardware raid10 [13:30:09] oh interesting [13:30:17] same for db1048 [13:30:30] thanks, I'll check those [13:30:58] I'm manually triaging the list to see where my fact has missed stuff [13:31:01] same for db1001 [13:31:12] the dbproxy I'm not familiar, let me take a quick look [13:31:53] Warning: Could not load fact file /var/lib/puppet/lib/facter/raid.rb: ./raid.rb:37: undefined (?...) sequence: /^\s*\d+\s+(?\w+)/ [13:31:58] uh? [13:32:02] oh god [13:32:07] those 3 hosts have facter 1.7.5 on precise [13:32:08] broken on precise's ruby [13:32:09] yeah [13:34:06] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [13:34:14] hashar: ping? [13:34:17] for checking purposes after the fix dbproxy1001/2 have md, they are precise too [13:34:24] thanks :) [13:37:04] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [13:39:04] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [13:39:35] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor issue, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291716 (https://phabricator.wikimedia.org/T136406) (owner: 10Ladsgroup) [13:39:56] oh ffs [13:40:03] FileTest.exist? is also broken on ruby 1.7 [13:40:12] 1.7 ? [13:40:18] I assume typo, 1.8 [13:40:19] precise, [13:40:22] ok [13:40:23] nope! [13:40:24] paravoid: looks like in ruby 1.8 you have to check the MatchData object [13:40:41] volans: yeah, that part I fixed alreayd.. [13:40:42] 1.7? the precise I'm looking at have 1.8.7 [13:41:17] er, right [13:42:22] hrm, ok, that works [13:42:29] and it should have the FileTest.exist?(filename) [13:42:32] root@db1043:~# ruby raid.rb [13:42:32] megaraid [13:42:34] (03PS2) 10Muehlenhoff: Enable firejail for image scaling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291202 (https://phabricator.wikimedia.org/T135111) [13:42:34] ok [13:42:35] yes [13:42:40] great! [13:44:58] (03PS1) 10Faidon Liambotis: raid: make the raid fact Ruby 1.8-compatible [puppet] - 10https://gerrit.wikimedia.org/r/291740 [13:45:01] I was going to send a proposed fix for mpt [13:45:05] volans: want to review? [13:45:07] jynus: what about it? [13:45:18] sure [13:45:51] there are hosts that have the mpt kernel module loaded (and so, some "files" are created), but no real "raid" device [13:46:03] interesting! [13:46:05] (03CR) 10jenkins-bot: [V: 04-1] raid: make the raid fact Ruby 1.8-compatible [puppet] - 10https://gerrit.wikimedia.org/r/291740 (owner: 10Faidon Liambotis) [13:46:07] do you have an example? [13:46:14] we should check /proc/scsi/mptsas/0 /proc/mpt/ioc0 [13:46:33] but not, e.g. mptctl or summarty [13:46:37] jenkins being super broken again? [13:46:39] that are created by the module [13:46:43] on load [13:46:56] 2 examples [13:47:05] db1019 and db1009 [13:47:13] they have a working megacli [13:47:26] jenkins: Could not resolve host: gerrit.wikimedia.org [13:47:37] but mpt-status -p fails with ioctl: No such device [13:47:53] and Gem::RemoteFetcher::UnknownHostError: no such name (https://rubygems.org/gems/hiera-1.3.4.gem) [13:48:05] so looks like DNS or network issues [13:48:16] the precise hosts will disappear eventually [13:48:35] (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291740 (owner: 10Faidon Liambotis) [13:48:37] but remember I have to several failovers first [13:48:48] some of which are blocked [13:48:54] PROBLEM - RAID on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:04] PROBLEM - configured eth on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:04] ^mmmm [13:49:12] crashed again? [13:49:14] (03PS2) 10Ladsgroup: Add ores-admins group [puppet] - 10https://gerrit.wikimedia.org/r/291716 (https://phabricator.wikimedia.org/T136406) [13:49:20] PROBLEM - MariaDB disk space on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:20] PROBLEM - dhclient process on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:23] PROBLEM - Check size of conntrack table on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:34] no, not again [13:49:40] PROBLEM - mysqld processes on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:46] is that a wish or a statement? [13:49:55] PROBLEM - DPKG on es2017 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:50:04] PROBLEM - MariaDB Slave SQL: es3 on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:50:04] PROBLEM - puppet last run on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:50:04] PROBLEM - Disk space on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:50:09] so far a wish [13:50:25] PROBLEM - salt-minion processes on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:50:25] PROBLEM - MariaDB Slave IO: es3 on es2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:50:40] still pings but no ssh (so far), checking console [13:50:45] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable firejail for image scaling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291202 (https://phabricator.wikimedia.org/T135111) (owner: 10Muehlenhoff) [13:51:09] I can loging to mysql [13:51:24] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 4 failures [13:51:41] lag is growing [13:51:52] but otherwise the host is functional [13:52:03] (03PS1) 10Faidon Liambotis: raid: adjust the mpt heuristic detection [puppet] - 10https://gerrit.wikimedia.org/r/291743 [13:52:12] jynus: ^ [13:52:17] at console I got the login, entered root and waiting for prompt of password... [13:52:20] !log enable firejail on image scalers [13:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:52:34] sorry, didn't realize the outage [13:52:36] nevermind me [13:52:46] [246461.498936] megaraid_sas 0000:03:00.0: pending commands remain after waiting, will reset adapter scsi0. [13:52:51] [246477.398591] megaraid_sas 0000:03:00.0: Init cmd success [13:52:57] ? [13:53:00] from console... [13:53:04] RECOVERY - RAID on es2017 is OK: OK: optimal, 1 logical, 12 physical [13:53:06] is that on es2017? [13:53:12] yes on mgmt [13:53:14] !log jmm@tin Synchronized wmf-config/CommonSettings.php: firejail security hardening for image scalers (duration: 00m 38s) [13:53:14] RECOVERY - configured eth on es2017 is OK: OK - interfaces up [13:53:17] (03CR) 10jenkins-bot: [V: 04-1] raid: adjust the mpt heuristic detection [puppet] - 10https://gerrit.wikimedia.org/r/291743 (owner: 10Faidon Liambotis) [13:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:53:23] now I can ssh [13:53:30] RECOVERY - MariaDB disk space on es2017 is OK: DISK OK [13:53:31] RECOVERY - dhclient process on es2017 is OK: PROCS OK: 0 processes with command name dhclient [13:53:44] RECOVERY - Check size of conntrack table on es2017 is OK: OK: nf_conntrack is 0 % full [13:53:51] RECOVERY - mysqld processes on es2017 is OK: PROCS OK: 1 process with command name mysqld [13:53:58] [246235.851795] INFO: task jbd2/sda1-8:924 blocked for more than 120 seconds. [13:54:05] RECOVERY - DPKG on es2017 is OK: All packages OK [13:54:15] RECOVERY - MariaDB Slave SQL: es3 on es2017 is OK: OK slave_sql_state Slave_SQL_Running: Yes [13:54:15] RECOVERY - Disk space on es2017 is OK: DISK OK [13:54:15] RECOVERY - puppet last run on es2017 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures [13:54:17] there are a bunch of call traces [13:54:33] RECOVERY - salt-minion processes on es2017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:54:34] RECOVERY - MariaDB Slave IO: es3 on es2017 is OK: OK slave_io_state Slave_IO_Running: Yes [13:54:34] lets get the 1) RAID log 2) ipimi [13:55:04] looks like the controller so far [13:55:04] [246477.456982] megaraid_sas 0000:03:00.0: 2270 (2s/0x0020/CRIT) - Controller encountered a fatal error and was reset [13:55:10] wow [13:55:14] (03PS2) 10Faidon Liambotis: raid: make the raid fact Ruby 1.8-compatible [puppet] - 10https://gerrit.wikimedia.org/r/291740 [13:55:16] (03PS2) 10Faidon Liambotis: raid: adjust the mpt heuristic detection [puppet] - 10https://gerrit.wikimedia.org/r/291743 [13:55:29] gotta appreciate the irony of a RAID controller failing when we're chatting about RAID controllers [13:55:44] paravoid, do not discard a direct causality [13:55:46] talking about the devil? :D [13:55:59] it doesn't point to it at all, but still [13:56:00] (03CR) 10jenkins-bot: [V: 04-1] network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [13:58:07] (03PS3) 10Filippo Giunchedi: prometheus: add tools role [puppet] - 10https://gerrit.wikimedia.org/r/291710 [13:58:32] this batch of new servers has had more issue than all of the other servers together [13:59:46] "Correctable memory error rate exceeded for DIMM_A2." after replacing the memory [14:00:44] "Disk 0 in Backplane 1 of Integrated RAID Controller 1 is inserted." at 2016-05-30T13:52:21-0500 [14:06:43] (03CR) 10Faidon Liambotis: [C: 032] raid: make the raid fact Ruby 1.8-compatible [puppet] - 10https://gerrit.wikimedia.org/r/291740 (owner: 10Faidon Liambotis) [14:06:47] (03CR) 10Faidon Liambotis: [C: 032] raid: adjust the mpt heuristic detection [puppet] - 10https://gerrit.wikimedia.org/r/291743 (owner: 10Faidon Liambotis) [14:07:13] ok, let's wait another 30mins now :) [14:07:31] !log rolling reboot of mc2* to Linux 4.4 [14:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:43] sorry I let you down instead of sending you that patch [14:07:50] didn't let me down at all [14:08:00] good catch [14:08:32] I kept the all hosts facts output on a text file [14:08:38] so I'll diff after these changes are in effect [14:11:42] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2338829 (10jcrespo) 05Resolved>03Open es2017: `Correctable memory error rate exceeded for DIMM_A2.` just after booting for the first time after replacing the memory `Disk 0 in Backplane... [14:11:51] ^I've reopened this [14:12:44] the job queue seems to be going back to normal now [14:13:53] jynus: thx I was kinda doing the same [14:15:04] (03PS1) 10Alexandros Kosiaris: fix a couple of puppetmaster failing tests [puppet] - 10https://gerrit.wikimedia.org/r/291747 [14:15:04] if you can paste ther more info about the RAID log or status, you are welcome [14:15:52] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:19:31] PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: Puppet has 1 failures [14:20:51] PROBLEM - IPsec on mc1001 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2001_v4 [14:21:13] PROBLEM - IPsec on mc1017 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2001_v4 [14:24:33] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:26:48] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add tools role [puppet] - 10https://gerrit.wikimedia.org/r/291710 (owner: 10Filippo Giunchedi) [14:27:23] (03CR) 10Alexandros Kosiaris: [C: 032] fix a couple of puppetmaster failing tests [puppet] - 10https://gerrit.wikimedia.org/r/291747 (owner: 10Alexandros Kosiaris) [14:30:09] (03PS4) 10Alexandros Kosiaris: networks::constants: Hieraize all_network_subnets [puppet] - 10https://gerrit.wikimedia.org/r/291263 [14:30:11] (03PS26) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [14:30:13] (03PS6) 10Alexandros Kosiaris: network: Move into module [puppet] - 10https://gerrit.wikimedia.org/r/291234 [14:30:15] (03PS11) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [14:35:29] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [14:35:54] 06Operations, 10ops-codfw: Fauly RAM on mc2001 - https://phabricator.wikimedia.org/T136558#2338860 (10MoritzMuehlenhoff) [14:36:08] 06Operations, 10ops-codfw: Faulty RAM on mc2001 - https://phabricator.wikimedia.org/T136558#2338876 (10MoritzMuehlenhoff) [14:38:33] YuviPanda, Error: Could not retrieve catalog from remote server: Error 400 on SERVER: pick_initscript(): Wrong number of arguments given (6 for 5) at /etc/puppet/modules/base/manifests/service_unit.pp:82 on node deployment-changeprop.deployment-prep.eqiad.wmflabs [14:40:24] (03PS1) 10Alexandros Kosiaris: uwsgi: Remove uwsgi from service name [puppet] - 10https://gerrit.wikimedia.org/r/291751 [14:42:18] (03PS1) 10Ema: update-ocsp-all: write output to logfile [puppet] - 10https://gerrit.wikimedia.org/r/291752 (https://phabricator.wikimedia.org/T132835) [14:44:19] (03CR) 10Alexandros Kosiaris: "Finally pcc is happy, jenkins is happy, I am happy with this change. reviews anyone ?" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [14:46:40] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:48:16] (03CR) 10Mholloway: "Just added @Hashar since I'm not sure he ever saw this... :)" [puppet] - 10https://gerrit.wikimedia.org/r/264303 (owner: 10Niedzielski) [14:51:39] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2338880 (10Volans) As a confirmation that I/O was stuck, rom dmesg after a bunch of call traces we got: ``` [246461.498936] megaraid_sas 0000:03:00.0: pending commands remain after waiting, wi... [14:52:14] jouncebot, next [14:52:14] In 0 hour(s) and 7 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160530T1500) [14:56:49] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6204417 keys - replication_delay is 626 [14:57:49] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:58:33] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2338894 (10fgiunchedi) I've uploaded python-statsd and pexif to jessie-backports, they should appear in the next few days [15:00:04] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160530T1500). [15:00:04] Urbanecm dcausse: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:39] o/ [15:00:55] I'm around. [15:02:52] 06Operations, 10DNS, 10Traffic, 10Wikimedia-Language-setup: Create moon.wikimedia.org and redirect it to https://meta.wikimedia.org/wiki/Wikipedia_to_the_Moon - https://phabricator.wikimedia.org/T136557#2338897 (10Danny_B) [15:04:41] (03CR) 10EBernhardson: [C: 032] Send wmf.4 search and ttmserver traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291257 (https://phabricator.wikimedia.org/T133124) (owner: 10EBernhardson) [15:04:55] (03PS3) 10EBernhardson: Send wmf.4 search and ttmserver traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291257 (https://phabricator.wikimedia.org/T133124) [15:05:01] (03CR) 10EBernhardson: [C: 032] Send wmf.4 search and ttmserver traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291257 (https://phabricator.wikimedia.org/T133124) (owner: 10EBernhardson) [15:05:03] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2267027 (10fgiunchedi) yanc uploaded too, when that is approved we can also go ahead with preggy -> pyvows -> derpconf [15:05:59] (03Merged) 10jenkins-bot: Send wmf.4 search and ttmserver traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291257 (https://phabricator.wikimedia.org/T133124) (owner: 10EBernhardson) [15:06:42] dcausse: merged [15:07:05] ebernhardson: no particular order for this one? [15:07:30] dcausse: doesn't matter i think [15:08:02] you could probably `scap sync-dir wmf-config ...` and would be fine [15:08:23] ok [15:10:19] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [15:10:35] !log dcausse@tin Synchronized wmf-config: Send wmf.4 search and ttmserver traffic to codfw (duration: 00m 33s) [15:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:51] Urbanecm: around? [15:12:56] yep [15:13:12] I can swat for you [15:13:23] I just need ebernhardson to +2 your patch :) [15:14:55] sure se [15:15:04] (03PS3) 10EBernhardson: Changetags should be granted only to sysops and bots in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290680 (https://phabricator.wikimedia.org/T136187) (owner: 10Urbanecm) [15:15:10] (03CR) 10EBernhardson: [C: 032] Changetags should be granted only to sysops and bots in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290680 (https://phabricator.wikimedia.org/T136187) (owner: 10Urbanecm) [15:15:52] (03Merged) 10jenkins-bot: Changetags should be granted only to sysops and bots in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290680 (https://phabricator.wikimedia.org/T136187) (owner: 10Urbanecm) [15:17:17] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: Changetags should be granted only to sysops and bots in ruwiki (duration: 00m 26s) [15:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:31] Urbanecm: please check if tou can ^ [15:18:41] It seems that it's ok. Thanks. [15:18:49] Urbanecm: thanks! [15:22:16] Krenair: I see that you added a patch, can I help? [15:23:18] hey [15:23:19] yes [15:24:18] Krenair: I can deploy but unfortunately I can't +2 on wmf-config ... mind +2ing ? [15:24:56] (03PS52) 10Alexandros Kosiaris: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [15:26:07] oh, need to make a quick change [15:26:26] sure [15:26:30] (03PS2) 10Alex Monk: Make VE RB URLs domain-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) [15:26:58] (03CR) 10Alexandros Kosiaris: [C: 032] dynamicproxy: make invisible-unicorn.py python3 compatible [puppet] - 10https://gerrit.wikimedia.org/r/291562 (owner: 10Ladsgroup) [15:27:03] (03PS2) 10Alexandros Kosiaris: dynamicproxy: make invisible-unicorn.py python3 compatible [puppet] - 10https://gerrit.wikimedia.org/r/291562 (owner: 10Ladsgroup) [15:27:10] (03CR) 10Alexandros Kosiaris: [V: 032] dynamicproxy: make invisible-unicorn.py python3 compatible [puppet] - 10https://gerrit.wikimedia.org/r/291562 (owner: 10Ladsgroup) [15:28:37] (03CR) 10Alex Monk: [C: 032] Make VE RB URLs domain-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) (owner: 10Alex Monk) [15:33:22] Krenair: you need to rebase I think [15:34:37] (03PS3) 10Alex Monk: Make VE RB URLs domain-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) [15:34:55] (03CR) 10Alex Monk: [C: 032] Make VE RB URLs domain-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) (owner: 10Alex Monk) [15:35:40] (03Merged) 10jenkins-bot: Make VE RB URLs domain-relative [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291349 (https://phabricator.wikimedia.org/T135171) (owner: 10Alex Monk) [15:36:39] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10faidon) [15:36:59] !log dcausse@tin Synchronized wmf-config/CommonSettings.php: Make VE RB URLs domain-relative (duration: 00m 26s) [15:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:13] Krenair: check please ^ [15:38:07] hm, it doesn't seem to have had any effect [15:38:20] hmm... let me check with eval [15:39:03] it's fine with eval [15:39:05] maybe RL caching [15:39:19] should I do something? [15:40:32] confirmed it's RL caching [15:40:45] ok, thanks for checking [15:40:49] https://en.wikipedia.org/w/load.php?debug=false&lang=en-gb&modules=startup&only=scripts&skin=vector - old version, then you set debug=true and you get the new one [15:42:19] now it works [15:42:31] (03CR) 10jenkins-bot: [V: 04-1] ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [15:42:39] ok [15:43:04] (03PS2) 10Alexandros Kosiaris: Introduce ores.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/277725 (https://phabricator.wikimedia.org/T124202) [15:49:46] (03CR) 10Filippo Giunchedi: "two cents" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291752 (https://phabricator.wikimedia.org/T132835) (owner: 10Ema) [15:53:03] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338977 (10faidon) [16:00:20] ACKNOWLEDGEMENT - Host mc2001 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T136558 [16:13:49] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6201184 keys - replication_delay is 0 [16:16:28] (03PS2) 10Giuseppe Lavagetto: base::grub: fix the ioscheduler setting [puppet] - 10https://gerrit.wikimedia.org/r/291706 [16:16:30] (03PS2) 10Giuseppe Lavagetto: base::grub: actually use augeas on jessie [puppet] - 10https://gerrit.wikimedia.org/r/291707 [16:16:45] <_joe_> jynus, paravoid I finally did it I think [16:17:28] <_joe_> I found where the true augeas docs are located :P [16:17:51] <_joe_> specifically https://github.com/hercules-team/augeas/wiki/Path-expressions [16:19:05] let me add that to wikitech [16:20:30] (03CR) 10Giuseppe Lavagetto: [C: 031] "this new version actually does the right thing with augeas, but not with grep/sed (which are untouched)." [puppet] - 10https://gerrit.wikimedia.org/r/291706 (owner: 10Giuseppe Lavagetto) [16:21:29] PROBLEM - DPKG on mw1020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:21:43] (03PS1) 10Muehlenhoff: Stop using package->latest in gerrit module [puppet] - 10https://gerrit.wikimedia.org/r/291762 (https://phabricator.wikimedia.org/T115348) [16:23:29] RECOVERY - DPKG on mw1020 is OK: All packages OK [16:29:04] 06Operations, 06Analytics-Kanban: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2339050 (10mforns) [16:29:40] (03PS1) 10Muehlenhoff: Stop using package->latest in ganglia monitor [puppet] - 10https://gerrit.wikimedia.org/r/291764 (https://phabricator.wikimedia.org/T115384) [16:29:55] 06Operations, 06Analytics-Kanban: Jmxtrans failures on Kafka hosts caused metric holes in grafana - https://phabricator.wikimedia.org/T136405#2333507 (10mforns) @elukey Can you clarify what is the action to do in this task? Thanks! [16:30:58] moritzm: if you're removing all the ensure latests, feel free to skip the RAID one, cf. Ia16b7ad8ad281640fe18fe77cb781d2480af54dc [16:31:25] aka https://gerrit.wikimedia.org/r/#/c/290999/ [16:32:30] ok! [16:45:19] (03PS1) 10Mobrovac: Math: Enable MathML everywhere but private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291766 (https://phabricator.wikimedia.org/T131177) [16:49:28] 06Operations, 10Datasets-General-or-Unknown: investigate rsync between dcs with encryption - https://phabricator.wikimedia.org/T123560#2339089 (10ArielGlenn) [16:49:30] 06Operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#2339088 (10ArielGlenn) [16:49:42] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160530T1700). Please do the needful. [17:01:09] SMalyshev: as far as I know, you should not be here today. So no deployment of WDQS. If there is anything to push, let me know and we'll find the time [17:02:06] (03CR) 10Faidon Liambotis: [C: 04-1] base::grub: fix the ioscheduler setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291706 (owner: 10Giuseppe Lavagetto) [17:04:28] <_joe_> heh I forgot to push the correction to the comment :P [17:04:50] <_joe_> y'all talking in my ears got me distracted [17:06:08] (03CR) 10Filippo Giunchedi: [C: 04-1] "looks like "labs-instances" hashes are used in url_downloader too, it'll need to be renamed to use labs realm" [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [17:07:20] (03PS5) 10Faidon Liambotis: raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) [17:07:22] (03PS5) 10Faidon Liambotis: raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 [17:07:24] (03PS5) 10Faidon Liambotis: raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) [17:07:26] (03PS5) 10Faidon Liambotis: raid: slightly change check-raid's "utility" names [puppet] - 10https://gerrit.wikimedia.org/r/291011 [17:07:28] (03PS5) 10Faidon Liambotis: raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) [17:07:56] paravoid: just rebase? [17:08:06] rebase & reorder [17:08:13] i'll merge the check-raid changes first [17:11:19] ok I'll take a look now [17:11:52] if jenkins wasn't compeletely bonkers these days [17:13:08] (03CR) 10Faidon Liambotis: [C: 032] raid: slightly change check-raid's "utility" names [puppet] - 10https://gerrit.wikimedia.org/r/291011 (owner: 10Faidon Liambotis) [17:13:14] (03CR) 10Volans: "See inline comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/291011 (owner: 10Faidon Liambotis) [17:13:23] whoops [17:13:37] and of course you're absolutely right [17:15:09] :) [17:15:44] gehel: greg-g: Needs a deployment asap for a regression with rollback functionality - https://gerrit.wikimedia.org/r/#/c/291768/ [17:15:52] (03PS1) 10Faidon Liambotis: raid: brown-paper bag fix on check-raid.py [puppet] - 10https://gerrit.wikimedia.org/r/291770 [17:16:26] (03PS6) 10Faidon Liambotis: raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 [17:16:38] (03CR) 10Faidon Liambotis: [C: 032 V: 032] raid: brown-paper bag fix on check-raid.py [puppet] - 10https://gerrit.wikimedia.org/r/291770 (owner: 10Faidon Liambotis) [17:17:00] * Krinkle guesses the US holiday means greg isn't here [17:17:09] good guess :) [17:17:18] Krinkle: how can I help? [17:17:26] gehel: Are you deploying anything from tin? [17:17:47] volans: are you reviewing the rest too/ [17:17:54] Krinkle: not at the moment, nothing to deploy for WDQS today [17:17:59] paravoid: yes [17:17:59] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:18:03] gehel: okay, I'm taking the slot then :) [17:18:03] k, I'll wait [17:18:10] Krinkle, you know it's also a UK bank holiday [17:18:15] Krinkle: go ahead and good luck! [17:18:17] I don't work at a bank [17:18:26] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/291012 (owner: 10Faidon Liambotis) [17:19:17] I'm slightly worried that jenkins consistently takes 18 minutes to check that change and then fails [17:20:20] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 614 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6207945 keys - replication_delay is 614 [17:22:28] (03PS3) 10Giuseppe Lavagetto: base::grub: fix the ioscheduler setting [puppet] - 10https://gerrit.wikimedia.org/r/291706 [17:22:30] (03PS3) 10Giuseppe Lavagetto: base::grub: actually use augeas on jessie [puppet] - 10https://gerrit.wikimedia.org/r/291707 [17:22:32] (03PS1) 10Giuseppe Lavagetto: base::grub: allow enabling the memory cgroup controller [puppet] - 10https://gerrit.wikimedia.org/r/291772 [17:23:28] (03CR) 10Faidon Liambotis: [C: 032] raid: move check-raid.py into /usr/local/lib/nagios/plugins [puppet] - 10https://gerrit.wikimedia.org/r/291012 (owner: 10Faidon Liambotis) [17:23:51] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2339183 (10jcrespo) a:05RobH>03elukey @elukey will have a detailed look at this this week. Please reassign it to m... [17:24:12] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2339186 (10jcrespo) 05stalled>03Open [17:24:44] (03CR) 10Giuseppe Lavagetto: base::grub: fix the ioscheduler setting (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291706 (owner: 10Giuseppe Lavagetto) [17:24:53] (03CR) 10Volans: "If I understand it correctly require_package() does an ensure => present while before we were doing an ensure => latest." [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [17:27:08] !log krinkle@tin Synchronized php-1.28.0-wmf.3/includes/api/ApiQueryRevisions.php: T136375 (duration: 00m 52s) [17:27:09] T136375: Rollback T88044 (broke rollback-related utilities) - https://phabricator.wikimedia.org/T136375 [17:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:28:16] (03CR) 10Krinkle: [C: 031] Drop dependency on wikimedia/cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291681 (owner: 10Ori.livneh) [17:29:06] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2339200 (10jcrespo) 05stalled>03Open a:05RobH>03jcrespo This was approved today on the operations meeting, and I personally will be fo... [17:29:15] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stats for Ladsgroup - https://phabricator.wikimedia.org/T136417#2339220 (10jcrespo) a:05Ladsgroup>03jcrespo [17:34:04] (03PS2) 10Ori.livneh: Drop dependency on wikimedia/cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291681 [17:34:26] (03CR) 10Ori.livneh: [C: 032] Drop dependency on wikimedia/cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291681 (owner: 10Ori.livneh) [17:35:15] (03Merged) 10jenkins-bot: Drop dependency on wikimedia/cdb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291681 (owner: 10Ori.livneh) [17:40:18] ori: Previously, the CDB from mediawiki-vendor was unexposed because config loads it first. [17:40:28] (03CR) 10Volans: [C: 04-1] "Leftover of the cleaning" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [17:40:45] Krinkle: right, it was a mess. [17:40:54] it's lazy loaded and since class_exists returns true it wil never ask mediawikis autoloader [17:41:14] and since php's autoloader extension design is function-based (not registry based) it means it also doesn't conflict [17:41:15] !log Synced composer.{json,lock} and multiversion for I5ac86f190b [17:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:50:16] (03CR) 10Volans: raid: add monitoring for HP controllers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [17:59:23] I'd like to scap a bug fix for wikidata [17:59:39] ori, Krinkle: are you done? [17:59:43] yes [17:59:53] Yes [18:01:40] (03PS6) 10Faidon Liambotis: raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) [18:01:42] (03PS6) 10Faidon Liambotis: raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) [18:01:44] (03PS6) 10Faidon Liambotis: raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) [18:03:37] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [18:04:09] (03PS7) 10Faidon Liambotis: raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) [18:04:14] last one hopefully! [18:04:17] sorry volans :) [18:04:17] paravoid: did you see my comment about require_package()? [18:04:19] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6195029 keys - replication_delay is 0 [18:04:24] no prob :) [18:04:52] (03CR) 10Faidon Liambotis: [C: 032] raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [18:05:41] (03CR) 10Faidon Liambotis: [V: 032] raid: vary package installation on the RAID installed [puppet] - 10https://gerrit.wikimedia.org/r/290999 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [18:07:21] I'm salt rm'ing /usr/local/bin/check-raid.py in the meantime [18:08:54] ok, see also my last question above [18:09:03] oh, yes [18:09:07] yes, that's intended [18:09:10] ensure => latest is evil [18:09:49] yeah, can do bad things [18:09:57] just wanted to check [18:10:11] nod [18:10:32] and about 291014 I think you can avoid some lines [18:10:48] but if you want to keep the same of the Debian one it's fine too [18:11:26] yeah, I'd like that [18:13:37] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [18:14:06] !log jzerebecki@tin Synchronized php-1.28.0-wmf.3/extensions/Wikidata/vendor/wikibase/javascript-api/src/getLocationAgnosticMwApi.js: Wikidata WikibaseJavaScriptApi: Fix getLocationAgnosticMwApi behavior in Internet Explorer b6ae82c71af3d9361cfb9e8d4e6e45bcd5ee9b26 1 of 2 T136543 (duration: 00m 26s) [18:14:07] T136543: [Bug] unable to edit in IE - https://phabricator.wikimedia.org/T136543 [18:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:15:47] !log jzerebecki@tin Synchronized php-1.28.0-wmf.3/extensions/Wikidata/vendor/wikibase/javascript-api/WikibaseJavaScriptApi.php: Wikidata WikibaseJavaScriptApi: Fix getLocationAgnosticMwApi behavior in Internet Explorer b6ae82c71af3d9361cfb9e8d4e6e45bcd5ee9b26 2 of 2 T136543 (duration: 00m 24s) [18:15:48] T136543: [Bug] unable to edit in IE - https://phabricator.wikimedia.org/T136543 [18:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:17:59] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 633 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6195383 keys - replication_delay is 633 [18:18:15] done [18:24:44] paravoid: sorry I have been quite busy [18:24:49] going to fix pplint-HEAD [18:29:05] (03CR) 10Alexandros Kosiaris: "scap::target is being declared via service::uwsgi" [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [18:33:48] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6193466 keys - replication_delay is 0 [18:47:28] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 623 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6193776 keys - replication_delay is 623 [18:50:06] !log mwscript deleteEqualMessages.php --wiki nvwiki (T45917) [18:50:07] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [18:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:52:55] (03PS7) 10Faidon Liambotis: raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) [18:53:04] (03CR) 10Faidon Liambotis: [C: 032 V: 032] raid: setup multiple checks, one per each RAID found [puppet] - 10https://gerrit.wikimedia.org/r/291013 (https://phabricator.wikimedia.org/T84050) (owner: 10Faidon Liambotis) [18:53:59] jynus: thanks :) [19:05:25] PROBLEM - MPT RAID on ms-fe1004 is CRITICAL: NRPE: Command check_raid_mpt not defined [19:05:33] hmm? [19:05:36] PROBLEM - MD RAID on dbproxy1008 is CRITICAL: NRPE: Command check_raid_md not defined [19:05:56] PROBLEM - MPT RAID on dbproxy1008 is CRITICAL: NRPE: Command check_raid_mpt not defined [19:06:26] PROBLEM - MD RAID on mw1260 is CRITICAL: NRPE: Command check_raid_md not defined [19:06:36] PROBLEM - MPT RAID on mw1260 is CRITICAL: NRPE: Command check_raid_mpt not defined [19:07:17] PROBLEM - MD RAID on silver is CRITICAL: NRPE: Command check_raid_md not defined [19:07:26] PROBLEM - MD RAID on eventlog2001 is CRITICAL: NRPE: Command check_raid_md not defined [19:07:36] PROBLEM - MPT RAID on silver is CRITICAL: NRPE: Command check_raid_mpt not defined [19:07:46] PROBLEM - MPT RAID on eventlog2001 is CRITICAL: NRPE: Command check_raid_mpt not defined [19:08:43] paravoid: :P [19:10:42] (03PS1) 10Faidon Liambotis: raid: fix circular dependency [puppet] - 10https://gerrit.wikimedia.org/r/291780 [19:11:04] hrm [19:11:24] that's quite odd [19:11:28] is it? [19:11:39] I think it's normal [19:11:54] require_packages creates an implicit package -> container class dependency [19:12:17] ah, right, and thus the before => Package['mpt-status'], creates a dependency [19:12:27] yeah [19:12:28] why does it need to be there before the package? [19:12:41] because the package sends an email upon installation otherwise [19:12:53] heh [19:12:59] annoying :) [19:14:05] ok, brb [19:14:07] dinner [19:14:09] bye [19:22:22] (03CR) 10Ladsgroup: "Another thing: What about the worker nodes? the ores::worker seems to be not using the scap::target (since it doesn't use service::uwsgi) " [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [19:22:34] akosiaris: ^ [19:22:37] if you around [19:29:20] (03PS1) 10Aklapper: Weekly Phabricator email: List archived projects with open tasks [puppet] - 10https://gerrit.wikimedia.org/r/291781 (https://phabricator.wikimedia.org/T133649) [19:30:09] (03PS1) 10Gergő Tisza: [HOLD] Enable AuthManager on beta wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291782 (https://phabricator.wikimedia.org/T135504) [19:31:21] (03CR) 10Aklapper: "Tested locally. (I have no idea if this will be performant enough on the production instance.)" [puppet] - 10https://gerrit.wikimedia.org/r/291781 (https://phabricator.wikimedia.org/T133649) (owner: 10Aklapper) [19:34:05] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Ladsgroup - https://phabricator.wikimedia.org/T136406#2339566 (10Ladsgroup) Thanks :) [19:39:31] (03PS1) 10Ppchelko: Change-Prop: White-list user-agent header header in http filter [puppet] - 10https://gerrit.wikimedia.org/r/291784 [19:42:29] and back [19:42:31] both me and icinga-wm :) [19:43:23] (03PS7) 10Faidon Liambotis: raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) [19:44:13] (03CR) 10Faidon Liambotis: [C: 032 V: 032] raid: add monitoring for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/291014 (https://phabricator.wikimedia.org/T97998) (owner: 10Faidon Liambotis) [19:44:35] (03CR) 10Alexandros Kosiaris: "they are gonna be on the same nodes for now so it should be a blocker for right now" [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [19:44:43] (03CR) 10Faidon Liambotis: [C: 031] "Nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/291706 (owner: 10Giuseppe Lavagetto) [19:44:49] (03CR) 10Faidon Liambotis: [C: 031] base::grub: actually use augeas on jessie [puppet] - 10https://gerrit.wikimedia.org/r/291707 (owner: 10Giuseppe Lavagetto) [19:45:47] (03CR) 10Faidon Liambotis: [C: 031] "Can we just call it "subnets"? :)" [puppet] - 10https://gerrit.wikimedia.org/r/291263 (owner: 10Alexandros Kosiaris) [19:47:04] (03PS12) 10Faidon Liambotis: network::constants: split off labs into its own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [19:47:34] (03CR) 10Faidon Liambotis: [C: 031] "Yes on the principle, modulo Filippo's concern." [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [19:48:33] (03CR) 10Faidon Liambotis: [C: 031] "I'd nitpick and say to call it "wmnet" (or something), but that could follow in a separate patch." [puppet] - 10https://gerrit.wikimedia.org/r/291234 (owner: 10Alexandros Kosiaris) [19:49:16] (03CR) 10Jforrester: "Put this in SWAT this afternoon?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288702 (https://phabricator.wikimedia.org/T135266) (owner: 10Catrope) [19:50:05] (03PS1) 10Alexandros Kosiaris: sca: remove cxserver-admin [puppet] - 10https://gerrit.wikimedia.org/r/291785 [19:53:21] (03PS53) 10Alexandros Kosiaris: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [19:54:30] (03CR) 10Faidon Liambotis: [C: 04-1] "Nice work :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [20:00:05] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160530T2000). Please do the needful. [20:02:20] yay [20:02:28] lots of HP warnings coming in [20:02:38] great [20:02:54] db1074 - WARNING: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, Controller, Battery/Capacitor - Not Configured: Cache [20:03:40] do we have a parser to read it? :) [20:03:52] I think this means "no BBU configured" [20:04:49] Controller Status: OK [20:04:49] Cache Status: Not Configured [20:04:49] Battery/Capacitor Status: OK [20:04:53] that's db1074 [20:05:40] * volans looking [20:06:39] (03CR) 10Alexandros Kosiaris: "Regarding naming, I am open to anything. The function is indeed slicing arbitrary parts of network::constants hence the naming, hence the " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [20:06:40] Cache Status: Not Configured [20:06:40] Cache Ratio: 100% Read / 0% Write [20:06:50] yes, I think we are looking at the same commands [20:06:54] PROBLEM - HP RAID on db2034 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Cache, Battery/Capacitor [20:07:02] and a failed disk! [20:07:02] yay :) [20:09:25] Caching: Disabled on the logical drive too [20:09:40] (03PS1) 10Gehel: Keep osmosis osm_expire files for a month [puppet] - 10https://gerrit.wikimedia.org/r/291788 (https://phabricator.wikimedia.org/T136577) [20:11:01] paravoid: the pplint-HEAD taking ages to run is fixed / hacked :D https://phabricator.wikimedia.org/T133816 [20:11:03] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/291565 (owner: 10Ladsgroup) [20:11:23] (03CR) 10Gehel: [C: 032] Keep osmosis osm_expire files for a month [puppet] - 10https://gerrit.wikimedia.org/r/291788 (https://phabricator.wikimedia.org/T136577) (owner: 10Gehel) [20:11:26] hashar: <3 [20:11:57] paravoid: Tyler noticed that a few weeks ago but we had trouble understanding why it suddenly happened ... That will remain a mystery probably [20:16:49] (03PS1) 10Faidon Liambotis: raid: fix sudo rules for hpssacli (mostly for ms-be) [puppet] - 10https://gerrit.wikimedia.org/r/291791 [20:18:46] ok, re-ran puppet on neon, another batch of checks should soon appear [20:20:46] 121 HP checks in total [20:22:15] (03PS2) 10Faidon Liambotis: raid: fix sudo rules for hpssacli (mostly for ms-be) [puppet] - 10https://gerrit.wikimedia.org/r/291791 [20:22:21] (03CR) 10Faidon Liambotis: [C: 032 V: 032] raid: fix sudo rules for hpssacli (mostly for ms-be) [puppet] - 10https://gerrit.wikimedia.org/r/291791 (owner: 10Faidon Liambotis) [20:23:08] PROBLEM - HP RAID on labvirt1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:23:58] PROBLEM - HP RAID on lvs2006 is CRITICAL: CRITICAL: Slot 0: bad transfer speed: 1I:1:2(6.0Gbps) - OK: 1I:1:2, Controller, Cache, Battery/Capacitor - Failed: 1I:1:1 [20:24:08] PROBLEM - HP RAID on labvirt1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:24:11] 06Operations, 10ops-codfw, 10DBA: db2034 degraded RAID - https://phabricator.wikimedia.org/T136583#2339694 (10Volans) [20:24:28] PROBLEM - HP RAID on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:24:37] ACKNOWLEDGEMENT - HP RAID on db2034 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Cache, Battery/Capacitor Volans https://phabricator.wikimedia.org/T136583 [20:24:38] PROBLEM - HP RAID on ms-be2016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:24:47] PROBLEM - HP RAID on ms-be2017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:24:57] hrm [20:25:17] PROBLEM - HP RAID on ms-be2020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:25:25] do you need to re-run puppet on the hosts? [20:25:28] PROBLEM - HP RAID on ms-be2018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:25:29] I did [20:25:37] PROBLEM - HP RAID on ms-be2021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:25:37] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:25:37] PROBLEM - HP RAID on ms-be1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:25:48] PROBLEM - HP RAID on ms-be1018 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:25:57] PROBLEM - HP RAID on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:26:04] it runs, it's just too many disks and it takes too long :( [20:26:09] real 0m10.699s [20:26:47] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 711 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6200053 keys - replication_delay is 711 [20:26:49] ok, then we can adjust the timeout for this check [20:27:42] yeah [20:27:47] PROBLEM - HP RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:28:02] (03CR) 10Alexandros Kosiaris: "addressed Filippo's concern in PS13" [puppet] - 10https://gerrit.wikimedia.org/r/291219 (owner: 10Alexandros Kosiaris) [20:28:22] any luck finding... if we have misconfigured BBUs all across the fleet? :) [20:28:26] (03PS5) 10Alexandros Kosiaris: networks::constants: Hieraize all_network_subnets [puppet] - 10https://gerrit.wikimedia.org/r/291263 [20:28:28] (03PS27) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [20:28:30] (03PS13) 10Alexandros Kosiaris: network::constants Split off labs into it's own realm [puppet] - 10https://gerrit.wikimedia.org/r/291219 [20:28:32] (03PS1) 10Alexandros Kosiaris: networks::constants: use slice_network_constants [puppet] - 10https://gerrit.wikimedia.org/r/291819 [20:28:51] db20xx seem to all be happy [20:28:58] (apart from 2034, obviously) [20:29:01] paravoid: I'm checking if by any chance the default for SSD disks is disabled [20:29:04] in the manual [20:29:08] PROBLEM - HP RAID on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:29:16] (03CR) 10Alexandros Kosiaris: "done. I 've renamed the hiera variable, the puppet variable needs some more refactoring, to be done in a later patch" [puppet] - 10https://gerrit.wikimedia.org/r/291263 (owner: 10Alexandros Kosiaris) [20:29:21] the others have spinning AFAIK [20:29:27] hmm [20:30:00] lvs2006 has a broken disk too, I'll open a task for it too [20:30:09] yeah, thanks! [20:31:04] (03PS1) 10Faidon Liambotis: raid: double NRPE timeout for check_hpssacli [puppet] - 10https://gerrit.wikimedia.org/r/291820 [20:31:32] (03PS2) 10Faidon Liambotis: raid: double NRPE timeout for check_hpssacli [puppet] - 10https://gerrit.wikimedia.org/r/291820 [20:32:24] http://community.hpe.com/t5/ProLiant-Servers-ML-DL-SL/Problem-with-configure-cache-Cache-Status-Not-Configured/td-p/5348173 [20:33:58] (03CR) 10Faidon Liambotis: [C: 032] raid: double NRPE timeout for check_hpssacli [puppet] - 10https://gerrit.wikimedia.org/r/291820 (owner: 10Faidon Liambotis) [20:34:26] ACKNOWLEDGEMENT - HP RAID on lvs2006 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:1 - bad transfer speed: 1I:1:2(6.0Gbps) - OK: 1I:1:2, Controller, Cache, Battery/Capacitor Volans https://phabricator.wikimedia.org/T136584 [20:34:28] 06Operations, 10ops-codfw: lvs2006 degraded RAID - https://phabricator.wikimedia.org/T136584#2339714 (10Volans) [20:37:01] paravoid: yes, I was looking at the same thing in the manual, although I didn't find yet the point where it says it explicitely [20:37:18] and we have LD Acceleration Method: HP SSD Smart Path on those, let me do a larger check with salt [20:44:36] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6185501 keys - replication_delay is 0 [20:48:08] paravoid: for what I've read so far SSD Smart Path should be better than traditional caching for SSDs in particular for reads. Of course only a benchmark with our specific workload could give us the final answer [20:48:25] looks like we have to patch the check to handle this case too [20:49:33] RECOVERY - HP RAID on labvirt1001 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 2I:1:14, 2I:1:15, 2I:1:16, 2I:1:17, 2I:1:18, Controller, Cache, Battery/Capacitor [20:55:20] (03PS1) 10Gehel: Increase the number of workers for osm2pgsql. [puppet] - 10https://gerrit.wikimedia.org/r/291825 (https://phabricator.wikimedia.org/T136578) [20:55:22] PROBLEM - HP RAID on labvirt1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:55:45] (03CR) 10Halfak: [C: 04-1] "We need a good way to distinguish the ores-web (uwsgi) from ores-worker (celery)" [puppet] - 10https://gerrit.wikimedia.org/r/291751 (owner: 10Alexandros Kosiaris) [20:57:42] RECOVERY - HP RAID on labvirt1004 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 2I:1:14, 2I:1:15, 2I:1:16, 2I:1:17, 2I:1:18, Controller, Cache, Battery/Capacitor [21:03:47] (03PS2) 10Gehel: Increase the number of workers for osm2pgsql. [puppet] - 10https://gerrit.wikimedia.org/r/291825 (https://phabricator.wikimedia.org/T136578) [21:07:41] (03CR) 10Gehel: "Puppet compiler looks happy: https://puppet-compiler.wmflabs.org/2993/" [puppet] - 10https://gerrit.wikimedia.org/r/291825 (https://phabricator.wikimedia.org/T136578) (owner: 10Gehel) [21:08:12] (03PS3) 10Gehel: Increase the number of workers for osm2pgsql. [puppet] - 10https://gerrit.wikimedia.org/r/291825 (https://phabricator.wikimedia.org/T136578) [21:09:00] (03CR) 10Hashar: [C: 031] "Legacy / tech debt I guess. Today the security upgrades are managed by ops cluster wide, so they would notice and upgrade as needed." [puppet] - 10https://gerrit.wikimedia.org/r/291762 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [21:09:55] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#1722037 (10hashar) @Dzahn you might want to move the table to the task description so that anyone can amend it as needed :-) [21:10:09] (03CR) 10Gehel: [C: 032] Increase the number of workers for osm2pgsql. [puppet] - 10https://gerrit.wikimedia.org/r/291825 (https://phabricator.wikimedia.org/T136578) (owner: 10Gehel) [21:10:57] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#2339786 (10hashar) [21:12:32] 06Operations, 13Patch-For-Review: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#1722037 (10hashar) [21:30:15] (03PS1) 10Faidon Liambotis: raid/hpssacli: don't barf on SATA + 6Gbps speed [puppet] - 10https://gerrit.wikimedia.org/r/291828 [21:30:17] (03PS1) 10Faidon Liambotis: raid/hpssacli: don't barf on HP SSD Smart Path configs [puppet] - 10https://gerrit.wikimedia.org/r/291829 [21:30:18] volans: if you're still here ^^^ [21:30:32] * volans looking [21:32:44] RECOVERY - HP RAID on labvirt1008 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 2I:1:14, 2I:1:15, 2I:1:16, 2I:1:17, 2I:1:18, Controller, Cache, Battery/Capacitor [21:33:48] paravoid: what I'm not sure about is if with this smart array it actually acts as a normal BBU for writes on RAID != 0 [21:34:39] because otherwise we need to be aware of it and decide at OS/application level different approaches (OS scheduler, application scheduler, etc...) that right now assume a BBU [21:34:50] what do you mean? [21:36:23] that if we have a DB with scheduler noop and mysql configured with IO_DIRECT and this smart thinghy doesn't cache the writes in the BBU we are not anymore protected if a crash happens [21:39:07] it's probably write-through in that case [21:39:10] that's the concept I think [21:39:24] in any case, that's going to be a configuration issue, not a health issue [21:39:32] and while we can alert on that too, that should probably be a separate thing [21:40:35] sute, make sense [21:41:08] btw when did you change line 248 (old file) for the wrong speed, I don't see the diff in gerrit but is updated :) [21:41:32] uh? [21:41:36] it's a separate diff [21:41:38] https://gerrit.wikimedia.org/r/#/c/291828/1/modules/raid/files/dsa-check-hpssacli [21:41:59] one depends on the other, I missed this one [21:42:00] thx [21:42:32] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/291828 (owner: 10Faidon Liambotis) [21:42:45] (03CR) 10Faidon Liambotis: [C: 032] raid/hpssacli: don't barf on SATA + 6Gbps speed [puppet] - 10https://gerrit.wikimedia.org/r/291828 (owner: 10Faidon Liambotis) [21:43:16] ack to merge the other one too? [21:43:50] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/291829 (owner: 10Faidon Liambotis) [21:44:07] (03CR) 10Faidon Liambotis: [C: 032] raid/hpssacli: don't barf on HP SSD Smart Path configs [puppet] - 10https://gerrit.wikimedia.org/r/291829 (owner: 10Faidon Liambotis) [21:44:53] FYI: http://h20195.www2.hp.com/v2/GetPDF.aspx/4AA4-8144ENW.pdf [21:47:57] looks like for writes on RAID!=0 it behaves like a normal controller... but still is not clear [21:50:03] RECOVERY - HP RAID on ms-be1021 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:50:20] yay [21:50:23] RECOVERY - HP RAID on ms-be2018 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:50:24] RECOVERY - HP RAID on ms-be2021 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:51:03] RECOVERY - HP RAID on ms-be2019 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:51:04] RECOVERY - HP RAID on ms-be1017 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller, Battery/Capacitor [21:51:04] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller, Battery/Capacitor [21:51:23] RECOVERY - HP RAID on ms-be1019 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:51:24] RECOVERY - HP RAID on ms-be1018 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller, Battery/Capacitor [21:51:43] RECOVERY - HP RAID on ms-be2016 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:52:04] RECOVERY - HP RAID on ms-be2020 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:52:13] RECOVERY - HP RAID on ms-be2017 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:52:50] clear [21:52:57] 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Enable Kafka native TLS in 0.9 and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#2339917 (10Nuria) [21:53:00] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2339916 (10Nuria) 05Open>03Resolved [21:56:03] RECOVERY - HP RAID on ms-be1020 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [21:59:52] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=hp+raid [22:00:04] all but the two you already ack'ed are OK :) [22:00:49] 06Operations, 10Monitoring, 13Patch-For-Review: Refactor RAID checks (check-raid) - https://phabricator.wikimedia.org/T84050#2339938 (10faidon) [22:00:51] 06Operations, 10Monitoring, 13Patch-For-Review: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#2339936 (10faidon) 05Open>03Resolved It took a while but this is finally done. We now have 123 RAID checks for HP systems. [22:01:03] yep! all good [22:01:27] I've sent an email to jaime for the DB ones and the smart thing [22:01:38] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2339943 (10faidon) [22:01:40] 06Operations, 10DBA, 13Patch-For-Review: investigate RAID BBU auto-learn on db hosts - https://phabricator.wikimedia.org/T84178#2339944 (10faidon) [22:01:42] 06Operations, 10Monitoring, 13Patch-For-Review: Refactor RAID checks (check-raid) - https://phabricator.wikimedia.org/T84050#922009 (10faidon) 05Open>03Resolved a:03faidon This is now all done :) [22:01:44] cool [22:01:56] I linked the two patches to DSA too, I'd like to see those merged upstream [22:02:13] PROBLEM - Disk space on ms-be2012 is CRITICAL: DISK CRITICAL - free space: / 2111 MB (3% inode=96%) [22:03:47] paravoid: syslog filling up very quickly [22:04:27] 18G May 30 22:03 syslog, 21G the one from yesterday [22:04:29] (03CR) 10Yuvipanda: "^ was the reason I introduced this." [puppet] - 10https://gerrit.wikimedia.org/r/291751 (owner: 10Alexandros Kosiaris) [22:04:46] yeah [22:05:22] (03CR) 10Yuvipanda: "This will affect *all* uwsgi defined services, all of which will need a manual stopping-of-old-service and starting-of-new-service, along " [puppet] - 10https://gerrit.wikimedia.org/r/291751 (owner: 10Alexandros Kosiaris) [22:06:05] RECOVERY - Disk space on ms-be2012 is OK: DISK OK [22:06:17] did you just deleted the syslog.1? :D [22:06:33] yeah :) [22:06:36] whatever [22:06:38] it's just logs [22:06:51] access logs I mean [22:07:00] it should not log there, if there is any issue with those machines it's impossible to find it in syslog [22:07:31] I agree :) [22:08:14] looks like it started few days ago [22:08:49] (03PS1) 10Jforrester: BetaFeatures: Bump dates, list departments, drop now-graduated Notifications one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291836 [22:09:00] syslog.5.gz 20MB, syslog.4.gz 240MB, syslog.3.gz 1.2GB, syslog.2.gz 2.2GB [22:09:04] (03CR) 10Yuvipanda: "This is awesome!" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/291525 (owner: 10BryanDavis) [22:09:44] (03PS4) 10Yuvipanda: k8s: Ensure /etc/kubernetes is present wherever required [puppet] - 10https://gerrit.wikimedia.org/r/291617 [22:10:02] (03CR) 10Yuvipanda: [C: 032] k8s: Ensure /etc/kubernetes is present wherever required [puppet] - 10https://gerrit.wikimedia.org/r/291617 (owner: 10Yuvipanda) [22:10:19] (03CR) 10Yuvipanda: [V: 032] k8s: Ensure /etc/kubernetes is present wherever required [puppet] - 10https://gerrit.wikimedia.org/r/291617 (owner: 10Yuvipanda) [22:29:45] (03PS1) 10Yuvipanda: tools: Allow bastions to talk to flannel etcd [puppet] - 10https://gerrit.wikimedia.org/r/291841 (https://phabricator.wikimedia.org/T136413) [22:30:14] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Allow bastions to talk to flannel etcd [puppet] - 10https://gerrit.wikimedia.org/r/291841 (https://phabricator.wikimedia.org/T136413) (owner: 10Yuvipanda) [22:41:23] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 673 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6198818 keys - replication_delay is 673 [22:58:34] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6168860 keys - replication_delay is 0 [23:00:05] RoanKattouw ostriches Krenair MaxSem Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160530T2300). Please do the needful. [23:00:05] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:19] I'll do it [23:00:23] It's just one config patch [23:01:05] (03CR) 10Catrope: [C: 032] BetaFeatures: Bump dates, list departments, drop now-graduated Notifications one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291836 (owner: 10Jforrester) [23:01:48] (03Merged) 10jenkins-bot: BetaFeatures: Bump dates, list departments, drop now-graduated Notifications one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/291836 (owner: 10Jforrester) [23:04:27] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Update BetaFeatures whitelist (duration: 00m 32s) [23:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master