[00:51:40] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 332 seconds [00:51:42] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 332 seconds [00:52:49] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:52:52] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [01:01:20] PROBLEM - RAID on analytics1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:04:59] RECOVERY - RAID on analytics1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:09:50] PROBLEM - RAID on analytics1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:12:19] RECOVERY - RAID on analytics1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:21:10] PROBLEM - RAID on analytics1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:27:19] RECOVERY - RAID on analytics1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:43:30] PROBLEM - RAID on analytics1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:47:09] RECOVERY - RAID on analytics1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [03:33:09] PROBLEM - puppet last run on titanium is CRITICAL: CRITICAL: Puppet has 1 failures [03:33:39] PROBLEM - puppet last run on mw1023 is CRITICAL: CRITICAL: Puppet has 1 failures [03:33:49] PROBLEM - puppet last run on mw1188 is CRITICAL: CRITICAL: Puppet has 1 failures [03:33:50] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:00] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:09] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:10] PROBLEM - puppet last run on mw1074 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:30] PROBLEM - puppet last run on mw1032 is CRITICAL: CRITICAL: Puppet has 1 failures [03:43:09] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: puppet fail [03:51:40] RECOVERY - puppet last run on mw1023 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [03:51:49] RECOVERY - puppet last run on mw1188 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [03:51:50] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [03:52:00] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:52:00] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:52:09] RECOVERY - puppet last run on mw1074 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:52:10] RECOVERY - puppet last run on titanium is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [03:52:29] RECOVERY - puppet last run on mw1032 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [04:02:10] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [05:22:38] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.014 second response time [05:23:47] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.007 second response time [06:28:47] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:57] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:58] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:08] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:08] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:48] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:44:57] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:45:29] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:45:57] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:47:08] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:48:07] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 1 failures [07:05:17] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:18:18] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: Puppet has 1 failures [07:30:49] 3ops-core: Graph data missing for "MediaWiki: Total Backend Latency" - https://phabricator.wikimedia.org/T85316#953702 (10jeremyb-phone) [07:33:58] 3ops-core: Graph data missing for "MediaWiki: Total Backend Latency" - https://phabricator.wikimedia.org/T85316#953705 (10jeremyb-phone) related to {T85641} ? [07:36:17] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:18:36] (03CR) 10GOIII: [C: 031] Add en.wikisource to global abuse filters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/179864 (owner: 10TTO) [12:35:38] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [12:49:17] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [13:28:48] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [13:36:38] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 0 down 0 [13:40:08] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [13:40:43] hmm [13:46:28] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 1 failures [13:49:26] !log dbproxy1002 failed m2-master traffic over to m2-slave. services up. investigating cause [13:49:32] Logged the message, Master [14:05:39] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:29:26] !log xtrabackup clone db1020 to db2011 [14:29:31] Logged the message, Master [14:38:47] PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Puppet has 1 failures [14:56:57] RECOVERY - puppet last run on radon is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:11:58] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 0 down 0 [15:16:48] PROBLEM - haproxy alive on dbproxy1002 is CRITICAL: CRITICAL check_alive invalid response [15:17:57] RECOVERY - haproxy alive on dbproxy1002 is OK: OK check_alive uptime 3696189s [15:23:27] !log limiting exim/otrs concurrent connections on m2-master to 250 [15:23:35] Logged the message, Master [15:42:05] 3ops-network: Dear network@rt.wikimedia.org, No Publication Fee for AASCIT Members - https://phabricator.wikimedia.org/T85765#953945 (10emailbot) [15:45:18] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [15:50:28] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [15:53:30] (03PS1) 10Springle: reassign db1057 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/182692 [15:59:41] !log upgrade db1057 trusty [15:59:45] Logged the message, Master [16:05:37] PROBLEM - Varnishkafka Delivery Errors on cp3017 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 291.049988 [16:08:38] RECOVERY - Varnishkafka Delivery Errors on cp3017 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:19:30] (03CR) 10Springle: [C: 032] reassign db1057 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/182692 (owner: 10Springle) [16:47:40] (03CR) 10Springle: [C: 032] reassign db1057 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/182692 (owner: 10Springle) [16:47:47] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:48] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 57864 bytes in 0.265 second response time [16:49:55] !log restarted zuul [16:49:59] Logged the message, Master [16:53:48] 3ops-network: Dear network@rt.wikimedia.org, No Publication Fee for AASCIT Members - https://phabricator.wikimedia.org/T85765#953958 (10MZMcBride) 5Open>3Invalid a:3MZMcBride This appears to be spam. [17:05:47] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:07:58] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 57869 bytes in 5.187 second response time [17:10:59] (03PS2) 10Springle: reassign db1057 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/182692 [17:12:03] (03CR) 10Springle: [C: 032] reassign db1057 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/182692 (owner: 10Springle) [17:35:18] !log xtrabackup clone db1061 to db1057 [17:35:24] Logged the message, Master [17:43:03] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: puppet fail [17:47:22] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: Puppet has 1 failures [18:02:13] RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:04:22] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:25:32] 3ops-core, ops-requests, Project-Creators, operations: Project Proposal: Label style projects for common operations tools - https://phabricator.wikimedia.org/T1147#953980 (10Nemo_bis) https://phabricator.wikimedia.org/tag/mail/ is hell confusing, please delete or clarify within few days. [18:26:07] 3ops-core, ops-requests, Project-Creators, operations: Project Proposal: Label style projects for common operations tools - https://phabricator.wikimedia.org/T1147#953982 (10Nemo_bis) In general, it seems most such tags should be renamed and prefixed with Wikimedia-, given they're terribly generic. [18:59:06] ^demon|away: how would flattening projects solve #email-as-ops-label vs #email-as-mediawiki-extension? [19:22:39] valhallasw`cloud: AIUI #email would be a generic tag, so if it was ops related, it would be #operations #email, or for MW it would be #mediawiki #email and you would search for both projects together [19:23:15] So how would the #email-as-mediawiki-extension maintainer subscribe to those bugs? [19:23:28] no idea! [19:30:13] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [19:30:21] how would you even search for them, with phab's crippled search? [19:31:01] say, you're a developers and you want to see bugs about email handling that you can solve, so the ones in MW core and in the extension, but not the ones in ops. good luck with that [19:40:32] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [19:49:33] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 610.299988 [20:02:03] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:28:04] (03PS1) 10Yuvipanda: beta: Rename shinken config file to make more sense [puppet] - 10https://gerrit.wikimedia.org/r/182699 [20:29:03] (03CR) 10Yuvipanda: [C: 032] beta: Rename shinken config file to make more sense [puppet] - 10https://gerrit.wikimedia.org/r/182699 (owner: 10Yuvipanda) [20:47:37] _joe_: ping [21:39:01] <_joe_> hoo: uh, need me for something urgent? [21:39:40] Not really urgent... but I'm (again) having error not appear in the apache logs... and other errors in the logs I can make no sense of [21:39:45] * errors [21:40:37] <_joe_> hoo: so it can wait tomorrow I guess :) I'll be fully back tomorrow [21:40:43] Oh, sure :) [21:56:33] (03PS1) 10QChris: Make Gerrit only comment for published drafts that add new task references [puppet] - 10https://gerrit.wikimedia.org/r/182751 [23:43:47] 3ops-core, ops-requests, Project-Creators, operations: Project Proposal: Label style projects for common operations tools - https://phabricator.wikimedia.org/T1147#954244 (10faidon) "Mail is gone" but with no comment here. What replaced it? Isn't Wikimedia implied? This is the Wikimedia Phabricator, after all....