[00:04:44] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [00:08:15] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [01:17:14] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1547 bytes in 0.154 second response time [02:31:19] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.13) (duration: 12m 01s) [02:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:28] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Feb 14 02:40:27 UTC 2016 (duration 9m 8s) [02:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:49:14] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1532 bytes in 0.164 second response time [03:36:04] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail [04:02:04] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [04:39:35] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: puppet fail [05:07:16] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:30:36] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:24] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: puppet fail [06:32:34] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:16] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:34] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:36] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:57:05] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:57:36] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:36] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:54] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:58:05] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:36] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:56] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:10:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 655 [10:20:15] RECOVERY - check_mysql on db1008 is OK: Uptime: 2227315 Threads: 3 Questions: 15128579 Slow queries: 14938 Opens: 5046 Flush tables: 2 Open tables: 401 Queries per second avg: 6.792 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [13:06:57] springle: ping [13:08:35] (03PS1) 10Hoo man: Depool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270583 [13:09:12] -!- idle : 31 days 10 hours 8 mins 4 secs [signon: Thu Jan 14 04:00:49 2016] [13:09:23] Whatever you need him for, probably email is better. :) [13:09:27] well, then [13:11:06] (03CR) 10Hoo man: [C: 032] Depool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270583 (owner: 10Hoo man) [13:11:41] (03Merged) 10jenkins-bot: Depool db1021 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270583 (owner: 10Hoo man) [13:12:26] Nemo_bis: i just responded to your bug. ensuring that OOUI widgets are no worse than native ones is really the most important goal of the library to me. [13:12:37] so please tell me how to reproduce that. :) [13:12:52] (https://phabricator.wikimedia.org/T126905) [13:14:18] !log hoo@mira Synchronized wmf-config/db-eqiad.php: Depool db1021 (duration: 01m 26s) [13:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:45] Can someone page Jaime maybe? [13:17:57] the problem should be fixed, but still that's super awry [13:31:23] (03PS1) 10Hoo man: Add --crit-stopped to check_mariadb.pl [puppet] - 10https://gerrit.wikimedia.org/r/270584 [13:36:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [5000000.0] [13:36:43] called him, he's looking [13:38:37] hey jynus [13:39:26] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=db1021 [13:39:33] https://gerrit.wikimedia.org/r/270583 [13:39:51] when it is in warning and not critical it means someone stopped it [13:39:51] exception log on fluorine was "exploding" [13:40:41] jynus: Nothing in SAL [13:42:06] yeah, it crashed [13:42:49] InnoDB: Error: Unable to read tablespace 3349 page no 62720 into the buffer pool after 100 attempts [13:42:49] InnoDB: The most probable cause of this error may be that the table has been corrupted. [13:43:43] there is not much to do, really, aside from reimage [13:44:22] I would put another host as vslow, dump, however [13:45:27] did it affect users? [13:45:31] the others seemed even more unhappy [13:45:39] Bots and jobs, I guess [13:45:57] but probably not users (or only on a very few) [13:46:11] that load balancing is wrong [13:46:21] at many levels [13:46:27] db1063 is also having troubles keeping up [13:46:43] Thus exception log is occasionally still getting flooded because of s2 [13:46:49] yes, when one fails, it tries on the others [13:46:56] we need to isolate those on only one host [13:49:42] your config change was unfortunate [13:49:59] whe never mix rcs and dumps/terbium [13:50:03] please call me first [13:50:06] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [13:50:32] the problem become bigger like that [13:51:30] if you need to shoot the dumps jobs after pooling a dumps-specific db, jynus, go ahead [13:51:34] they'll just rerun later [13:51:44] wait until my commit [13:52:01] (03PS1) 10Jcrespo: vslow, dump to db1054; do not send api to db1054; 1036 only rc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270585 [13:52:45] that is not right [13:54:09] (03PS2) 10Jcrespo: vslow, dump to db1054; do not send api to db1054; 1036 only rc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270585 [13:54:35] (03CR) 10Jcrespo: [C: 032] vslow, dump to db1054; do not send api to db1054; 1036 only rc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270585 (owner: 10Jcrespo) [13:55:03] (03Merged) 10jenkins-bot: vslow, dump to db1054; do not send api to db1054; 1036 only rc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270585 (owner: 10Jcrespo) [13:56:52] apergos, we have enough redundancy to survive this, the problem is the current automatic load balancing logic is not smart enough to do it on its own [13:57:20] !log jynus@mira Synchronized wmf-config/db-eqiad.php: vslow, dump to db1054; do not send api to db1054; 1036 only rc (duration: 01m 17s) [13:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:26] gotcha [13:58:00] I need to kill some residual processes [13:58:20] figured you would [14:00:17] things should be good now [14:00:24] thank you for fixing [14:00:34] what was the thing that alerted you, to check? [14:00:38] replication lag? [14:00:41] hoo? ^^ [14:00:44] or something else [14:00:50] to check it is gone [14:01:12] jynus: Exception logs on fluorine [14:01:29] we got lucky then [14:01:34] most of those are normal [14:01:40] still occasionally happening, but no longer as bad as it used to be [14:01:46] not normal [14:01:46] because who randomly looks at exception logs on fluorine on a sunday afternoon [14:01:50] but "known" [14:02:00] I noticed because the file got so large that I couldn't really open it in vim anymore [14:02:27] apergos: I did because of a user report of s5 potential problems [14:02:27] this is a known issue: "Database is read-only: The database has been automatically locked while the slave database servers catch up to the master." [14:02:36] ok, that is different [14:03:00] so, the problem was real (db1021 crashed) [14:03:02] hm I admit I always tail | more [14:03:05] so I would never notice that [14:03:29] but that is only since 11:00 [14:03:44] that led load balancing to send backups to other hosts [14:03:51] which can lead to replication lag [14:03:57] right [14:04:25] then I think you changed the dumps to db1036, and unfortunately, that is the only host that I would not recommend touching [14:04:39] because it hosts the recentchanges [14:05:03] and now I moved that again to db1054 [14:05:17] dumps can lag and we do not care, as they are long-running processes [14:05:25] recentchanges cannot lag [14:05:33] makes sense? [14:06:46] there seems to be corruption on enwiktionary.templatelinks and that self-crashes to secure that data [14:06:55] (on db1021) [14:07:31] ugh [14:07:53] what I do not understand is why it restarted, if preciselly that is there to avoid that [14:08:16] did it recover or it's still seeing corruption issues? [14:08:42] I do not care, once it crashes, the policy is always reimage [14:09:04] we cannot risk data corruption on production [14:09:34] so we want it not to restart regardless [14:09:39] if we're going to reimage after [14:09:43] yes [14:09:51] db1021 created issues in the past [14:10:20] possible memory issue or something else? [14:10:50] I will check the disks before reimage and putting it in production first [14:11:29] no, all points to corruption: [14:11:36] InnoDB: Error: Unable to read tablespace 3349 page no 62720 into the buffer pool after 100 attempts [14:11:36] InnoDB: The most probable cause of this error may be that the table has been corrupted. [14:12:31] which usually is good, InnoDB kills itself when data corruption is there [14:12:35] to pretect the data [14:12:59] but in this case, as we do not have "high availability" for the dumps [14:13:17] those jumped to other hosts, probably causing issues with lag [14:13:47] the solution is configure a secondary backup host that is not a recentchanges host [14:14:28] sadly, these kind of things are difficult to realize, because it seldom happens [14:15:01] another option is to let crons and dumps fail until human intervention [14:15:09] which would be safer [14:18:50] well that would halt all runs for unknown period of time [14:19:10] and the trned has been to making them more automate and more able to recover from failure [14:19:49] the secondary host shouldn't do dumps unless the first one fail though [14:19:56] *trend [14:20:03] well, the problems is that the current dump system takes a lot to reload its configuration [14:20:21] a more inteligent load balancing system would be able to handle that [14:20:54] automatically, but that is not yet done [14:21:22] there is horrible things on mediawiki for testing replica health [14:22:34] so, how can we confirm no longer more issues? I see no 500x requests failed, and no mediawiki errors [14:25:28] let me truncate he exception log, too [14:26:45] PROBLEM - Disk space on labvirt1008 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 91483 MB (3% inode=99%) [14:27:04] fi the current dump system tried to reload its config after a disconnection event [14:27:12] (we could conceivable do this) [14:27:23] then that would mandate having the 'secondary' in place [14:27:41] problem is not disconnect, but config change [14:28:08] I depool the server, want it to apply it in < 1 minute [14:28:19] so let's say we can do that [14:28:19] this problem happens for the job queue, too [14:28:31] then the dumps hunt around for another server to run on [14:28:37] if that 'secondary dumps' server isn't there [14:28:41] they'll pick random one [14:28:48] maybe the wrong one... [14:29:48] that depends on the config, ideally we would have something like servera => 1, serverb => 0 [14:30:12] meaning in theory, use servera, if it is unavailable, use serverb [14:30:34] or, you know, stop using the mediawiki load balancer and start using a real proxy [14:30:46] :-D [14:30:50] now there's an idea [14:31:43] I want to have lunch, but only if we have confirmation that there is no more issues [14:31:58] ^hoo ? [14:35:21] (03CR) 10Jcrespo: [C: 04-1] "Ok with the patch, not with the idea behind it (applying it to production)." [puppet] - 10https://gerrit.wikimedia.org/r/270584 (owner: 10Hoo man) [14:36:35] (03CR) 10Jcrespo: "Also, there are reasons to have a server stopped in production (lagged slaves)." [puppet] - 10https://gerrit.wikimedia.org/r/270584 (owner: 10Hoo man) [14:37:39] (03CR) 10Hoo man: "Sure, but shouldn't the alerts be ACKed then rather than making them only warnings in the first place?" [puppet] - 10https://gerrit.wikimedia.org/r/270584 (owner: 10Hoo man) [14:38:02] (03CR) 10Jcrespo: "MySQL should not have started in the first place, that is the right fix." [puppet] - 10https://gerrit.wikimedia.org/r/270584 (owner: 10Hoo man) [14:41:06] db1021 reconstructed its RAID recently- that is too much coincidence [14:44:01] hoo, there are many things missing in our monitoring: Uptime and SHOW PROCESSLIST- I do not think that is the right approach [14:44:44] see db1052 for a better example [14:45:03] the exceptions you saw were unrelated to the issues [14:46:27] in any case, I have a better patch already that would make the same effect [14:46:46] https://gerrit.wikimedia.org/r/#/c/253665/ [14:47:10] if the slave is stopped, the lag will increase (unlike now) [14:47:43] is there a ticket where all this is being discussed? [14:47:50] yes [14:47:52] because if there is I need to get on it [14:47:56] if I'm not [14:47:57] several [14:48:09] sorry, apergos [14:48:15] you mean about load balancing? [14:48:17] for what? [14:48:29] or mysql monitoring? [14:48:29] about monitoring lag [14:48:33] yeah monitoring [14:48:56] I should be interested in load balancing too but in practice my brain is full [14:48:57] https://phabricator.wikimedia.org/T114752 [14:49:09] https://phabricator.wikimedia.org/T111266 [14:49:31] subscribed [14:49:32] thanks [14:49:43] https://phabricator.wikimedia.org/T112473 [14:49:44] 7Puppet, 10Beta-Cluster-Infrastructure, 5Patch-For-Review, 7Tracking: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2026667 (10Danny_B) NOTE: This [[ https://www.mediawiki.org/wiki/Phabricator/Project_management/Tracking_tasks | tracking task ]] should be converted to [[ http... [14:50:01] awesome [14:50:05] jynus: HM [14:51:14] hoo, I do not care too much, but remember you were the first to complain about "too many unnecessary mysql alerts" [14:51:28] we need a single replication alert per host, and make it good [14:52:31] new hardware is arriving for s2 and s3 soon, so maybe that will make some of these discussion obsolete [14:52:48] more capacity, less lag [14:55:05] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: puppet fail [15:17:56] 6operations, 5Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2026727 (10Danny_B) NOTE: This [[ https://www.mediawiki.org/wiki/Phabricator/Project_management/Tracking_tasks | tracking task ]] should be converted to [[ https://... [15:19:37] 6operations, 10Traffic, 6Zero, 3Mobile-Content-Service, and 2 others: Send X-Carrier + X-Carrier-Meta headers on all responses - https://phabricator.wikimedia.org/T126053#2026732 (10Danny_B) [15:20:27] 6operations: decom erbium/gadolinium (was: Reinstall erbium with jessie) - https://phabricator.wikimedia.org/T123722#2026733 (10Danny_B) [15:20:44] 6operations: decom protactinium (was: Reinstall protactinium with jessie) - https://phabricator.wikimedia.org/T123720#2026734 (10Danny_B) [15:20:56] 6operations: Reinstall nitrogen with jessie - https://phabricator.wikimedia.org/T123715#2026735 (10Danny_B) [15:23:16] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:30:08] (03CR) 10MarcoAurelio: "Scheduled for SWAT ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269970 (https://phabricator.wikimedia.org/T126604) (owner: 10MarcoAurelio) [15:36:02] any admin at wikitechwiki here? [15:36:09] there's a vandal over there [15:36:11] need block [15:39:01] ostriches: ^ [15:40:35] Nemo_bis is taking care, thx. [15:49:16] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 53 failures [15:57:10] 7Puppet, 10Beta-Cluster-Infrastructure, 5Patch-For-Review, 7Tracking: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2026956 (10Aklapper) >>! In T86644#2026667, @Danny_B wrote: > NOTE: This [[ https://www.mediawiki.org/wiki/Phabricator/Project_management/Tracking_tasks | track... [16:02:32] 6operations, 5Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2026972 (10Aklapper) @Danny_B: I'm surprised about the "should" in that sentence - it's news to me. If anyone wants a goal project and would like to have a workboa... [16:15:21] 7Puppet, 10Beta-Cluster-Infrastructure, 5Patch-For-Review, 7Tracking: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#2026991 (10Danny_B) >>! In T86644#2026956, @Aklapper wrote: >>>! In T86644#2026667, @Danny_B wrote: >> NOTE: This [[ https://www.mediawiki.org/wiki/Phabricator/... [16:15:35] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:20:25] 6operations, 10Incident-20150205-SiteOutage, 10MediaWiki-Debug-Logger, 6Reading-Infrastructure-Team, and 2 others: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#2027013 (10Danny_B) [16:22:52] (03PS1) 10Andrew Bogott: Use keystone v3 api for horizon [puppet] - 10https://gerrit.wikimedia.org/r/270593 [16:48:15] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:26] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:34] PROBLEM - RAID on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:04] PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:14] PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:50:36] PROBLEM - puppet last run on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:54:06] PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:25] <_joe_> !log powercycling mw1140, stuck in OOM, no ssh no console login [16:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:55] PROBLEM - nutcracker process on mw1140 is CRITICAL: Timeout while attempting connection [16:56:25] PROBLEM - HHVM processes on mw1140 is CRITICAL: Timeout while attempting connection [16:57:34] RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [16:57:35] RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:58:04] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [16:58:35] RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 0 % full [16:58:44] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 500 bytes in 4.582 second response time [16:58:46] RECOVERY - DPKG on mw1140 is OK: All packages OK [16:58:46] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 66085 bytes in 3.095 second response time [16:58:55] RECOVERY - RAID on mw1140 is OK: OK: no RAID installed [16:59:06] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 44 minutes ago with 0 failures [17:19:00] (03PS1) 10Andrew Bogott: keystone policy changes, in progress [puppet] - 10https://gerrit.wikimedia.org/r/270597 [17:36:55] PROBLEM - restbase endpoints health on aqs1001 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [17:38:44] RECOVERY - restbase endpoints health on aqs1001 is OK: All endpoints are healthy [17:57:35] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: Puppet has 1 failures [18:02:44] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 15 failures [18:23:44] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [18:28:45] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:58:15] PROBLEM - HHVM rendering on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:58:36] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:59:45] PROBLEM - RAID on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:45] PROBLEM - configured eth on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:45] PROBLEM - HHVM processes on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:45] PROBLEM - dhclient process on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:16] PROBLEM - puppet last run on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:26] PROBLEM - SSH on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:00:34] PROBLEM - salt-minion processes on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:55] PROBLEM - DPKG on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:56] PROBLEM - Check size of conntrack table on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:01:25] PROBLEM - nutcracker process on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:01:35] RECOVERY - HHVM processes on mw1119 is OK: PROCS OK: 6 processes with command name hhvm [19:01:36] RECOVERY - dhclient process on mw1119 is OK: PROCS OK: 0 processes with command name dhclient [19:05:36] PROBLEM - nutcracker port on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:06:36] PROBLEM - Disk space on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:06:55] PROBLEM - HHVM processes on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:06:55] PROBLEM - dhclient process on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:20:14] RECOVERY - Check size of conntrack table on mw1119 is OK: OK: nf_conntrack is 0 % full [19:20:25] RECOVERY - Disk space on mw1119 is OK: DISK OK [19:20:25] RECOVERY - nutcracker process on mw1119 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:20:44] RECOVERY - dhclient process on mw1119 is OK: PROCS OK: 0 processes with command name dhclient [19:20:45] RECOVERY - HHVM processes on mw1119 is OK: PROCS OK: 6 processes with command name hhvm [19:20:45] RECOVERY - configured eth on mw1119 is OK: OK - interfaces up [19:20:45] RECOVERY - RAID on mw1119 is OK: OK: no RAID installed [19:21:06] RECOVERY - nutcracker port on mw1119 is OK: TCP OK - 0.000 second response time on port 11212 [19:21:14] RECOVERY - SSH on mw1119 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [19:21:16] RECOVERY - salt-minion processes on mw1119 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:21:36] RECOVERY - DPKG on mw1119 is OK: All packages OK [20:18:32] 6operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2027410 (10mobrovac) 5Open>3stalled I'm setting this task to stalled, as this is only relevant for edge cases where we execute `npm install` directly on the hosts (some testing hosts and CI). I don't think there is much va... [20:21:26] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1545 bytes in 0.139 second response time [20:24:04] 6operations, 10MediaWiki-Authentication-and-authorization, 5MW-1.27-release-notes, 5Patch-For-Review: ~3000% increase in session redis memory usage, causing evictions and session loss - https://phabricator.wikimedia.org/T125267#2027416 (10Tgr) >>! In T125267#1983529, @Anomie wrote: > We note that the basel... [21:26:36] (03PS1) 10Merlijn van Deen: toollabs: install inkscape on exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/270638 (https://phabricator.wikimedia.org/T126933) [21:30:24] PROBLEM - Disk space on cp3040 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=87%) [22:20:35] (03PS1) 10MarcoAurelio: New user groups configuration for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270645 (https://phabricator.wikimedia.org/T126931) [22:28:01] PROBLEM - mysqld processes on labsdb1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [22:29:21] PROBLEM - MariaDB disk space on labsdb1002 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [22:29:22] PROBLEM - Disk space on labsdb1002 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error [22:30:45] <_joe_> uhm this doesn't look good [22:30:54] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [22:31:34] <_joe_> jynus: broken disk i'd say [22:31:39] [21555223.116628] XFS (dm-1): xfs_log_force: error 5 returned. [22:31:46] or xfs bug [22:31:48] I'me checking [22:32:01] <_joe_> [21554975.965623] scsi 7:0:1:0: [sdc] Unhandled error code [22:32:03] it is mounted [22:32:05] <_joe_> disk error [22:32:16] <_joe_> it's ro [22:32:19] <_joe_> because errors [22:32:21] ah [22:32:40] <_joe_> and you cannot read it [22:32:48] <_joe_> so, yes, disk busted [22:33:02] let's redirect queries to the other servers [22:33:22] <_joe_> how is that done (for future reference)? [22:33:23] and set stronger limitations to queries [22:33:29] (03CR) 10Luke081515: [C: 031] New user groups configuration for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270645 (https://phabricator.wikimedia.org/T126931) (owner: 10MarcoAurelio) [22:33:50] some strange sc [22:33:55] ript on labs [22:34:12] <_joe_> ook :P [22:34:41] <_joe_> i'm here if you need a pair of hands [22:35:01] me too [22:35:07] can you double check the disk meanwhile (if there is something to do there) [22:35:14] RAID, etc. [22:35:49] <_joe_> no raid AFAICS [22:36:28] <_joe_> and yeah, the disk is on lvm and dead [22:36:34] <_joe_> but just the partition [22:37:11] it doesn't matter, no partition, no service [22:37:56] <_joe_> yeah so, one of the disks in the pv is gone [22:38:06] ouch [22:38:09] <_joe_> "unkown device" in pvdisplay [22:38:21] <_joe_> so we had an "informal raid0" there [22:38:32] (03PS1) 10MarcoAurelio: Termporary lift of IP cap for an Edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270646 (https://phabricator.wikimedia.org/T126939) [22:38:50] please log that so I can later ask for the proper replacements, etc. [22:39:00] (I will lose it if not) [22:40:36] PROBLEM - puppet last run on labsdb1002 is CRITICAL: CRITICAL: Puppet has 4 failures [22:40:41] <_joe_> !log labsdb lvm partition /srv unavailable because /dev/sdc is apparently broken [22:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:40:53] that's all [22:41:14] I am still looking for the file, I do not manage it, so I do not have it handy [22:43:09] <_joe_> page andrewbogott in case [22:43:17] <_joe_> he should have been paged btw [22:43:18] I’m here [22:43:21] <_joe_> hey [22:43:25] were were in -labs talking about the same things [22:43:26] perfect timing [22:43:29] ah [22:43:43] apparently I don’t get pages until 5-7 hours after they happen, though, so generally calling me is a good idea [22:43:44] <_joe_> chasemp: bad choice given everyone in ops gets paged [22:43:50] I do not know where on labs you manage the hosts [22:44:15] <_joe_> so, what was on /srv and what on /srvuserdata? [22:44:38] /srv all database content [22:44:42] jynus: I don’t immediately know, can you tell me more about what I’m looking for? [22:44:43] <_joe_> shit [22:44:50] which dbs are on 1002? [22:44:51] <_joe_> ok so I can try to reduce the pv [22:45:17] <_joe_> but well, chasemp what's your take? [22:46:00] <_joe_> we could try to unount the FS and remount it [22:47:27] <_joe_> if that doesn't work (I expect not) we can try to a) check if the disk is indeed broken as it appears; if yes b) pvremove it and pray xfs is forgiving (it's just 1/10th of the PV anyways) if not c) try to rebuild the metadata [22:48:15] _joe_: unmount/remount sounds good to me… agreed that it’s likely to fail [22:48:17] andrewbogott, all of them, they just redirect to one host or another based on a list [22:48:19] but won’t make things worse [22:48:28] (03PS1) 10MarcoAurelio: Cleanup: removing expired event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270648 [22:49:15] first time I've logged in to this server :) seems like we can't count on the disk so work around it or count the box out for now [22:49:38] <_joe_> andrewbogott: just killed a bach of yours [22:49:50] <_joe_> chasemp: yeah count it out [22:49:52] _joe_: no worries [22:49:56] <_joe_> a bash, ofc [22:50:17] I assume db1001/1003 won't handle the load by themselves? [22:50:25] <_joe_> I'm trying to unmount the disk [22:50:31] RECOVERY - MariaDB disk space on labsdb1002 is OK: DISK OK [22:50:33] RECOVERY - Disk space on labsdb1002 is OK: DISK OK [22:50:33] yes they will [22:50:38] (03PS2) 10MarcoAurelio: Cleanup: removing expired event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270648 [22:50:43] good to know [22:50:44] <_joe_> that's stupid ofc [22:50:48] <_joe_> the recovery [22:50:54] check-raid.py reports as ok so that's fun [22:50:55] no worse than the 3 previous [22:51:47] (03PS1) 10Andrew Bogott: Remove labsdb1002 from the mysql server list for toollabs. [puppet] - 10https://gerrit.wikimedia.org/r/270649 [22:52:03] jynus: I /think/ that ^ is what you were looking for [22:52:09] it’s sure in a strange place though [22:52:12] no it is not [22:52:19] <_joe_> chasemp: check-raid on a server that has... no raid [22:52:26] for userdata it will [22:52:29] what about the rest? [22:52:35] it is a list of wikis that create the hosts files [22:53:22] bah, ok [22:53:38] (03Abandoned) 10Andrew Bogott: Remove labsdb1002 from the mysql server list for toollabs. [puppet] - 10https://gerrit.wikimedia.org/r/270649 (owner: 10Andrew Bogott) [22:53:42] modules/mysql_wmf/templates/skrillex.yaml.erb ? (I"m just grepping [22:53:43] ) [22:53:48] <_joe_> the umount is stuck ofc [22:53:50] so maybe this is not puppetized. [22:54:01] modules/role/manifests/labs/dnsrecursor.pp [22:54:01] [22:54:07] jynus has the right one [22:54:10] ah ha [22:54:13] that is the one, found it through old commits [22:54:25] I do not handle that, coren did [22:54:27] # There are three replica servers (c1, c2, c3). The mapping of [22:54:27] # "shards" (s1, etc.) and databases (enwiki, etc.) to these is [22:54:27] # arbitrary and can be adjusted to depool a server or redistribute [22:54:27] # load. [22:55:33] I would say s2 to 1001 [22:55:42] s4 and 5 to 1003 [22:55:50] modules/role/templates/labs/dns/db_aliases.erb could be edited to move all c2 dbs to the 1001 IP [22:56:22] 1 and 3 seems to have headroom for it [22:56:30] Database on labs is offline Phab:T126942 [22:56:30] T126942: labs -Database is offline - https://phabricator.wikimedia.org/T126942 [22:56:31] but yeah, maybe best split them up [22:56:35] <_joe_> unmounting and remounting don't work [22:57:18] let me write the patch [22:57:24] ok [22:57:24] 6operations, 10DBA, 6Labs, 10Labs-Infrastructure: labs -Database is offline - https://phabricator.wikimedia.org/T126942#2027582 (10Krenair) known, being worked on. one of the DB server's drives has failed [22:57:52] 6operations, 10DBA, 6Labs, 10Labs-Infrastructure: labs -Database is offline - https://phabricator.wikimedia.org/T126942#2027585 (10ArielGlenn) labsdb1002 has a broken disk which resulted in the partition with the db being off line. Folks will be redirecting queries to other servers shortly. [22:59:31] (03PS1) 10Jcrespo: Depool labsdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/270650 (https://phabricator.wikimedia.org/T126942) [22:59:32] I should never complain about mediawiki db depooling again [23:00:09] check that, I do not give it too much thought [23:00:28] (03PS2) 10Rush: Depool labsdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/270650 (https://phabricator.wikimedia.org/T126942) (owner: 10Jcrespo) [23:01:19] (03CR) 10Rush: [C: 031] "I don't have a more nuanced thought but we can shuffle further if needed" [puppet] - 10https://gerrit.wikimedia.org/r/270650 (https://phabricator.wikimedia.org/T126942) (owner: 10Jcrespo) [23:01:38] (03CR) 10Andrew Bogott: [C: 032] Depool labsdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/270650 (https://phabricator.wikimedia.org/T126942) (owner: 10Jcrespo) [23:01:52] looks sane [23:02:21] heavy with wikidata over there but it's got to go somewhere [23:02:22] s1 is usually very busy, that is the easiest sanest rebalancing I can think of [23:02:27] yep [23:02:32] gotcah [23:02:41] applying on the dns host... [23:02:45] apply that whatever that goes [23:03:04] (which is labservices1001) [23:03:12] c2 will not work, that is intended c2 means == labsdb1002 [23:03:31] what does the 'c' actually mean? [23:03:41] s2 and commonswiki will work [23:04:06] I do not know, but c is for phyisical hosts, for those that have local writes [23:04:20] krenair@tools-bastion-01:~$ host s2.labsdb [23:04:21] s2.labsdb has address 10.64.37.4 [23:04:21] krenair@tools-bastion-01:~$ host s2.labsdb [23:04:21] s2.labsdb has address 10.64.4.11 [23:04:22] it should not be used for access replication [23:04:24] clone? all I could guess [23:05:11] of course locally written databases will be lost, but that is also intended, and highly advised- only scratch data there, no backups are taken [23:05:25] hm, replag tool is still unhappy [23:05:28] if you are hosting production data there, it is your fault [23:05:37] andrewbogott: you want to !log that change? [23:06:17] !log moved tools dbs off of labsdb1002 via https://gerrit.wikimedia.org/r/#/c/270650/ [23:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:23] thx [23:07:22] 6operations, 10DBA, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: labs -Database is offline - https://phabricator.wikimedia.org/T126942#2027606 (10Krenair) [23:07:37] I restarted the replag tool and /now/ it is happy [23:07:43] works for me http://quarry.wmflabs.org/query/7383 [23:07:45] awesome [23:08:16] Suppose that means that I should send a 'restart your tool if you are in this list’ email? [23:08:18] where is that tool and how is it restarted, for reference? [23:08:23] oh I see [23:08:25] all tools [23:08:26] yeah [23:08:45] only reload your connections, really [23:08:57] but for most of those tools I bet that's a restart [23:09:23] jynus, I don't think quarry tests this [23:09:44] Krenair, what do you mean? [23:10:15] I've queried a previously- labsdb1002 hosted database (commons and wikidata) [23:10:26] Well, from quarry-main-01.quarry.eqiad.wmflabs:/srv/quarry/quarry/config.yaml [23:10:33] REPLICA_HOST: 'enwiki.labsdb' [23:10:49] i.e. labsdb1001 always [23:10:50] ah [23:10:55] so hardcoded wrongly [23:10:57] mmmeeehhh [23:10:59] well [23:11:19] send a ticket to yubi, or yourself :-) [23:11:24] *yuvi [23:11:33] I'm working on something related [23:11:33] well [23:11:36] I worked on something related [23:11:41] can someone else check? [23:11:41] and am waiting for Yuvi to approve it [23:12:43] basically quarry just goes to one DB host and runs the user's query there, the user has to explicitly change to another DB as part of their query [23:12:57] and sanitarium is broken too, to add some fun [23:13:26] there is no DB input field for it to choose the right DB beforehand [23:13:40] and you'd still want to allow joining across DBs I guess [23:13:41] w/o quarry test are we satisfied the replicas as moved off for tools? [23:14:58] once https://gerrit.wikimedia.org/r/#/c/266925/ goes in we may be able to improve things in this aspect jynus [23:15:30] andrewbogott: are you able to restart https://tools.wmflabs.org/templatecount/ that one? [23:15:34] 6operations, 10ops-eqiad, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2027631 (10chasemp) 3NEW a:3Cmjohnson [23:15:36] that's the one reported in the ticket [23:15:54] apergos: one moment... [23:16:30] apergos: ok, done, I think [23:16:34] testing [23:16:39] yeah, seems better [23:16:48] error gone [23:16:48] jynus, this probably contributes to labsdb1001 always being busy [23:16:54] andrewbogott: seems good [23:16:58] got a number of results, and that's de which was moved [23:17:00] awesome [23:17:40] Krenair, I do not think quarry is a real problem in terms of load [23:17:50] 6operations, 10DBA, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: labs -Database is offline - https://phabricator.wikimedia.org/T126942#2027645 (10chasemp) 5Open>3Resolved a:3chasemp this should be worked around now with https://gerrit.wikimedia.org/r/#/c/270650/ replag and templatecount have... [23:18:10] these aliases are for 1h I think? [23:18:13] !log repaired table and restarted replication on sanitarium (s3) [23:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:22] so now it's getting the other tools restarted/reloaded [23:19:00] Here’s my email draft for labs-announce: https://dpaste.de/GrnQ [23:19:16] jynus, mind reading and confirming that it’s accurate? [23:19:21] (or anyone, really) [23:19:55] omg, worst possible syntax highlighting on dpaste, sorry [23:19:55] seems good, you did say load moved to other web servers which is confusing tho [23:19:56] to different web servers, andrewbogott? [23:20:07] to different db servers [23:20:07] oops [23:20:19] s/web servers/servers/ [23:20:22] ah you were there first Krenair [23:20:27] :) [23:20:52] yep looks fine to me after that [23:21:00] lgtm [23:21:10] I would like to be more specific in the ‘might want to consider…' bit [23:21:18] but I guess it’s not really possible to be specific [23:21:30] +1 [23:21:58] they'll get it from the context [23:22:00] it's fine [23:22:02] ok, sent [23:22:32] andewbogott giftbot/weblinksuche has not restartet: https://tools.wmflabs.org/giftbot/weblinksuche.fcgi?target=https%3A%2F%2Fweb.archive.org%2Fweb%2F20%25&namespace=0 [23:23:13] boshomi: restarting giftbot... [23:23:15] better? [23:24:04] better :-) thanks [23:24:28] I can write the outage report unless someone else is excited about doing that. [23:24:57] "thanks for volunteering"? :-D [23:25:06] it’ll be short [23:25:40] andrewbogott: https://phabricator.wikimedia.org/T126946 fyi [23:36:55] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1538 bytes in 0.173 second response time