[00:05:42] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: puppet fail [00:06:31] (03PS1) 10Dzahn: create sslcert::letsencrypt::simple, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) [00:08:07] (03PS2) 10Dzahn: create sslcert::letsencrypt::simple, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) [00:09:24] (03CR) 10jenkins-bot: [V: 04-1] create sslcert::letsencrypt::simple, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) (owner: 10Dzahn) [00:09:42] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2211618 (10Dzahn) >Package/backport acme-tiny and install via puppet (it's so tiny we could even just throw the file in puppet for a first draft). ye... [00:14:50] (03PS3) 10Dzahn: create sslcert::letsencrypt::simple, install acme-tiny [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) [00:16:34] YuviPanda: we already check SSH connectivity with check_ssh in base.. you can just use that [00:17:11] check_command = 'check_ssh' i mean [00:31:59] (03PS1) 10Dzahn: add role for hosts with LE certs, add on carbon [puppet] - 10https://gerrit.wikimedia.org/r/283763 (https://phabricator.wikimedia.org/T132812) [00:32:16] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:02:15] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 75.88 ms [01:59:04] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: Connection refused [02:05:32] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2211772 (10mmodell) @jcrespo: Phabricator database clustering support is now documented: https://secure.phabricator.com/book/phabricator/article/cluster_data... [02:14:07] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: Connection refused eevans Node is bootstrapping. - The acknowledgement expires at: 2016-04-18 02:13:31. [02:16:27] !log Bootstraping restbase1009-a.eqiad.wmnet : T95253 [02:16:29] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [02:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:22] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 09m 03s) [02:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:33] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Apr 16 02:31:33 UTC 2016 (duration 9m 11s) [02:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:51:58] 06Operations, 10OTRS, 07Upstream: Investigate OTRS 5.0.6 memory leak - https://phabricator.wikimedia.org/T126448#2211787 (10Peachey88) >>! In T126448#2015605, @akosiaris wrote: > Filed in http://bugs.otrs.org/show_bug.cgi?id=11864 Has this been looked at since then?, Could this be contributing to {T132822}? [05:10:47] (03CR) 10Legoktm: [C: 04-1] "Please include the licensing information for acme-tiny" [puppet] - 10https://gerrit.wikimedia.org/r/283761 (https://phabricator.wikimedia.org/T132812) (owner: 10Dzahn) [05:43:06] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.38 seconds [05:44:57] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.31 seconds [05:45:57] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 342.08 seconds [05:46:28] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 358.75 seconds [05:50:47] PROBLEM - MariaDB Slave Lag: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.92 seconds [05:53:46] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 13.73 seconds [05:54:17] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [05:54:37] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.25 seconds [05:54:38] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.34 seconds [06:04:38] ^^^ looking [06:08:26] PROBLEM - MariaDB Slave Lag: s3 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.98 seconds [06:22:07] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: puppet fail [06:22:50] that's strange... all the others recovered by themselves, dbstore2002 was continue to increase the lag [06:23:47] !log Stopped and restarted replica on dbstore2002 for s3 to "unstuck" the replica [06:23:47] RECOVERY - MariaDB Slave Lag: s3 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 16.89 seconds [06:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:08] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:06] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:06] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:26] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:37] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 3 failures [06:38:23] s3 lag seems recovered, to be investigated more later, from tendril graph my guess is that there was a single update on many rows, that from db2018 below to it's child got expanded due to row based replication [06:44:40] !log restart fermium to apply extra vcpus assignment [06:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:44:47] (03CR) 10Elukey: Override kafkatee's default logrotate/rsyslog configuration. (033 comments) [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) (owner: 10Elukey) [06:48:11] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:50:28] 06Operations, 10Traffic: cronspam from cpXXXX hosts due to /usr/local/sbin/update-ocsp-all - https://phabricator.wikimedia.org/T132835#2211794 (10elukey) [06:50:40] 06Operations, 10OTRS, 07Upstream: Investigate OTRS 5.0.6 memory leak - https://phabricator.wikimedia.org/T126448#2211807 (10akosiaris) @Peachey88, yes it has been quite effectively mitigated in apache (MaxConnectionsPerChild 2000). OTRS memory usage is not averaging at 3.5-3.9GB. But it is not solved nonet... [06:52:34] 06Operations, 10Traffic: cronspam from cpXXXX hosts due to update-ocsp-all and zero_fetch - https://phabricator.wikimedia.org/T132835#2211808 (10elukey) [06:55:38] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2211811 (10akosiaris) Oh it is OTRS. There are multiple apache processes consuming close to 100% of CPU time. stracing returns nothing as well as ltrace. This is starting to l... [06:56:00] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:56:30] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:42] RECOVERY - puppet last run on sca1001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:56:42] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:11] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:41] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:41] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:00] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:46] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2211812 (10akosiaris) I 've killed multiple ones to restore service performance. I 've left a couple running to investigate more. [07:10:02] (03PS4) 10Elukey: Override kafkatee's default logrotate/rsyslog configuration. [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) [08:11:16] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bawiktionary.hitcounter doesnt exist on query. Default database: information_schema. Query: DELETE FROM bawiktionary.hitcounter [08:47:53] PROBLEM - MySQL Slave Running on db1038 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Table kshwiki._counters doesnt exist on query. Default dat [08:53:07] * volans looking ^^^ [08:56:08] hi jynus, sorry to bother, so for dbstore1002 is trying to execute a delete on bawiktionary.hitcounter but that schema is not there at all [08:56:30] we do not care about that [08:56:43] someone wrote to db2018 [08:56:57] for db1038 yes looks like [08:57:24] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2211838 (10akosiaris) Documenting what I 've seen up to now. Choosing 20928 process `strace -fp 20928` returns nothing `ltrace -fp 20928` returns nothing This is starting... [08:58:28] db2018 is in read only mode, so someone with root wrote to db2018 [08:58:35] the _counters table is engine memory [08:59:47] Query: 'DELETE FROM `kshwiki`.`_counters`' [09:00:15] 15 minutes ago [09:00:48] yes, whithout where [09:01:00] 160416 8:42:26 UTC [09:01:12] could it be an event, maybe with no disabled binlog? [09:03:15] show events\G in kshwiki is empty [09:07:35] it could be an event on the intermediate master, which one is it? [09:08:07] db1038 -> db1075 -> db2018 -> db1038 [09:09:04] yes, counters is in db1075 [09:09:49] but also in db1044 [09:09:54] another slave of s3 [09:10:06] I can check all of them [09:11:11] so there must be some kind of mechanism that writes to the slaves [09:11:18] but does not disable the binlog [09:12:06] quite rare or it should have been surfaced before [09:12:25] since you set the circular replica 2~3 weeks ago [09:13:11] no, it happened before when I setup the intermediate slaves [09:13:36] all of codfw got broken due to the events on the ops database [09:13:58] I later disabled all those events' binlog (I hate events) [09:14:42] it has to be a trigger or an event, otherwise it would not have been copied to the new servers [09:17:02] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2211383 (10pajz) Would it make sense to try to trace this down to the affected ticket? I just googled the text bit ("Saudi Arabia which faces inevitable military, politica") a... [09:17:43] I cannot find any [09:18:02] memory table and replication are tricky... https://dev.mysql.com/doc/refman/5.6/en/replication-features-memory.html (5.7 looks the same) [09:18:07] the 4 events on the slaves have the binlog disabled [09:18:38] could it be related to statement/mixed and row based replication with the intermediate slave? [09:18:47] mmm [09:19:11] did you touch db1075 recently, like restart it? [09:19:22] yesterday, db2018 too.. for the TLS [09:19:33] buy yesterday, not 15 minutes ago [09:19:39] surely not :) [09:19:59] and according to this when I restarted it it should have sent the same delete [09:20:06] that yesterday didn't failed [09:20:17] ah, "the first time that the master uses a given MEMORY table after startup" [09:20:34] so it is not on startup, but after use [09:21:14] true [09:21:37] so here is the thing, I have not idea what touches this [09:21:44] but why the table is not on the master db1038? [09:22:02] but if all slaves are in read only mode, it must be us [09:22:07] maybe tendril [09:22:21] maybe cron [09:22:21] is an engine memory, doesn't looks problematic to me to add it, so I don't get why only on the master is missing [09:22:28] I'll check the other shards [09:22:31] no, we must understand [09:22:46] it is probably an ops process [09:22:52] and server dependent [09:22:59] want to check if on the other masters is there or not [09:23:08] and we must force binlog disabling [09:23:37] good news is that we have performance schema on db1075 [09:24:59] ok great, dbstore1002 can wait? it's lag will be larger, if it's only a skip I'll skip, but there too I don't know why is missing a DB that is in the replicate_do filter [09:28:29] no, it cannot wait [09:29:08] but the first issue was more important [09:29:17] agree [09:29:18] I wouls skip that event [09:29:46] ok I skip it now [09:30:33] I am going to import sys locally to db1075 to understand where that came from (who uses that table) [09:30:43] ok [09:31:22] !log importing sys schema to db1075 [09:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:46] !log Skipped on dbstore1002 query from replica: 'DELETE FROM `bawiktionary`.`hitcounter`' [09:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:32:42] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:32:57] so far so good, no other errors [09:33:25] oh, I meant the master error, not that one [09:34:57] ops... I though you were replying to me, but what else we could have done there? the whole schema bawiktionary is not there [09:35:10] is it a memory table, too? [09:35:17] because then it could be related [09:35:31] it failed again in another hitcounter table [09:35:33] checking [09:35:37] probably related then [09:36:18] yes memory [09:36:36] the hitcounter table, as far as I know is not used [09:37:00] it's strange dbstore1002 is breaking only on DB it doesn't have... [09:37:09] could help us to find from where they are coming? [09:37:19] chwikimedia.hitcounter this time [09:37:26] it is the memory tables, that run delete on usage [09:37:53] ok, but why didn't happened with all the other shards? [09:38:08] becase we have restarted db1075 [09:38:16] I restarted intermediated slave and codfw master for s1, s3-s7 [09:38:17] s3 [09:38:32] during the week for the TLS stuff [09:38:36] are those tables on the other shards? [09:38:41] are those tables used? [09:38:52] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table chwikimedia.hitcounter doesnt exist on query. Default database: information_schema. Query: DELETE FROM chwikimedia.hitcounter [09:38:53] because in theory, none of these tables should be in use [09:39:00] ok [09:39:08] it could be a configuration problem [09:39:17] checking if they are there to start with [09:39:18] or a configuration exclusive to s3 [09:39:59] the only different thing I've done for s3, operation wise, is that we changed the designated masters [09:40:46] do not discard a configuration change + restart [09:42:09] just wanted to put all the info on the table [09:42:24] or some extension wanting to read from that table, even if data is not there [09:44:07] s4.commonswiki: _counters and hitcounter exists both in the master and intermediate master on eqiad, empty [09:44:49] so here is the thing, if it exists on the master, it must exist on the slave [09:44:57] not necessarily the other way round [09:45:10] and on db1038 is not there [09:45:40] I can see it there: my -h db1038 -e "SHOW CREATE TABLE bawiktionary.hitcounter" [09:45:57] when? [09:46:03] now [09:46:24] maybe was me if few minutes ago [09:46:29] which server? [09:46:51] if it should not exist on one slave, it should have a replication filter [09:47:02] ACKNOWLEDGEMENT - MySQL Slave Running on db1038 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Table kshwiki._counters doesnt exist on query. Default dat Volans Working on it, see SAL [09:47:11] that's me ^^^ [09:50:17] so looking at the relay log, the master is receiving deletes from all hitcounters [09:50:47] at this point kinda expected [09:51:14] so, given that it is not in use, I would create it empty [09:51:30] because it should only send the deletes [09:51:41] so from what I'm reading around, MEMORY tables don't play nice with replication so it's better to not replicate them at all, but given that we have it on other masters I agree let's create it [09:51:45] for all the wikis? [09:52:09] yeah, but I prefer not to create filters, that equaly do not work well with replication [09:52:22] we do not use the table, so we do not care about the contents [09:52:26] only about replication [09:52:33] yep, s1 too has those tables in the master db1052 [09:52:41] so, there it is [09:52:59] we will receive deletes every time that some server in the replication chain restarts [09:53:04] but that is ok [09:53:31] I'll continue to check the other shards to be sure we don't hit this problem, although I guess they are ok given that I restarted them earlier [09:53:33] technically the application is not replicating that table, only the servers restart are [09:54:32] and we can open a task to audit them and drop them if really not used [09:56:02] ACKNOWLEDGEMENT - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table chwikimedia.hitcounter doesnt exist on query. Default database: information_schema. Query: DELETE FROM chwikimedia.hitcounter Volans Working on it, see SAL [09:57:12] actually, we should open a more general one containing https://phabricator.wikimedia.org/T132416 and more [09:57:13] s2 has only hitcounter, but the codfw master too, both are missing _counters [09:57:33] at least on some random wiki I'm checking [09:57:39] I am not sure if counters is needed by mediawiki [09:58:16] ideally we should sync with mediawiki's HEAD [09:58:53] https://github.com/wikimedia/mediawiki/search?utf8=%E2%9C%93&q=_counters is empty [09:59:22] hitcounter exists: https://github.com/wikimedia/mediawiki/search?utf8=%E2%9C%93&q=hitcounter [09:59:50] but I see a drop [10:00:02] yeah, _counters sounds like an ops-only table [10:00:03] in 1.25 and 1.26 [10:00:09] ah! [10:00:20] for hitcounters [10:00:23] not _counters [10:00:37] that is not there at all, looks like an ops stuff given also the underscore [10:00:38] look at the latest version of tables.sql on mediawiki [10:00:58] I named mine __wmf_checksums, that seems more clear :-) [10:01:21] I am preparing the creation anyway [10:01:47] is not there: https://github.com/wikimedia/mediawiki/blob/master/maintenance/tables.sql [10:01:56] nor _counters nor hitcounters [10:01:57] but whoever had the idea of creating it as memory... [10:02:45] in the old tables.sql was ENGINE=HEA [10:02:50] ENGINE=HEAP MAX_ROWS=25000; [10:03:06] it is the same [10:03:13] HEAP==MEMORY [10:03:29] yeah I know :) just copy and paste [10:04:35] in 1.13 was different: CREATE TABLE /*$wgDBprefix*/hitcounter ( hc_id INTEGER ); [10:04:43] using the default engine apparently [10:05:14] so I am ready to create it, but if it doesn't exist on codfw masters but exists on its slaves, I could break replication [10:06:03] alternatively, I could create it on the master only, and risk replication issues on eqiad [10:06:04] CREATE TABLE IF NOT EXISTS? [10:06:11] row based replication [10:06:27] a no, it doesn't affect DMLs [10:06:37] so it works (I already had the not exists) [10:06:50] I guess codfw slaves have it otherwise replica should have break there too [10:07:26] ok, so ready to apply it to the master, with binlog on [10:07:32] on s3 [10:07:37] ok, log and go [10:08:39] !log recreating hitcounter table on s3 to solve replication issues [10:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:08:50] check replication integrity while it happens [10:09:20] ok, sanitarium too has broken replica like dbstore1002 [10:09:26] dbstore1001 will break tomorrow [10:09:31] :-) [10:09:48] we can fix it going back in time [10:10:00] - if not exists should work well [10:10:48] problem there is that the schema doesn not exists... [10:11:03] from tendril all good so far [10:11:14] the schema??? [10:11:19] ah [10:11:27] that should be due to filtering [10:11:42] bawiktionary.hitcounter, bawiktionary is not there (both sanitarium and db1002) [10:11:59] mmm [10:12:07] does it exist in other slaves? [10:12:32] db1035 no [10:12:39] (random slave of s3) [10:12:45] checking on the masters [10:12:48] so it does not exist for us [10:14:36] mmmh db1038, db1075, db2018 yes though... :( [10:15:25] db1044 too.. WAT?!?! [10:15:37] so it exists on the masters, but not on the others? [10:15:55] I mean sanitarium it would be normal, due to special filtering that has to be enabled [10:16:02] sorry my bad [10:16:04] I'm blind [10:16:09] it's there too for db1035 [10:16:27] so far it's on all, just dbstore1002 it's missing it [10:17:01] so it exists everywhere but db1069:3212 and dbstores? [10:17:12] 3313 I meant [10:17:29] so far yes, doing with salt on all to be sure [10:18:33] strange, some filter must be wrong [10:18:38] * volans note to self: create salt groups for each shard [10:18:38] ah, I know [10:18:49] maybe that wiki is deleted [10:19:03] but it only fails on memory+restart [10:19:36] deleted from dbstore but not from production? like logical delete? [10:19:51] deleted from "wmf" [10:19:57] mediawiki [10:20:07] replication filter on dbstore1002 would include it: %wik% [10:20:19] yeah, as in "archived" [10:20:28] ok make sense now [10:20:33] https://phabricator.wikimedia.org/T30384 [10:20:53] there seems to be an issue there [10:21:02] we have to confirm its state [10:21:23] the truth is that it doesnt exist on these two hosts [10:21:29] so skip [10:21:39] and then ask about its real state [10:21:51] and either remove it completely [10:21:58] or imported to both places [10:22:14] ok, so lets create 2 tickets [10:22:31] subtask of replication services [10:22:43] that's for sure, but how many wll be in that state? dbstore1002 is broken with chwikimedia.hitcounter now [10:22:54] that looks removed too [10:23:54] well, more to the list to check [10:24:18] https://phabricator.wikimedia.org/T82828#905989 [10:24:23] if noone bothered to clean up thing properly, it is out job [10:24:23] potential list [10:24:54] ok, so I'll take care of dbstore1002, skipping each one that breaks and is not there and making a list of them [10:24:57] ok? [10:25:07] ok [10:25:30] my script finished [10:25:44] and created the table on the master [10:25:58] great [10:26:51] I will look if I have to do the same thing for _counters [10:27:19] ok [10:28:12] I can do sanitarium too [10:28:28] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:29:43] were it a lot, can you get the list of all the dbs? [10:30:56] I'm keeping a list of all the ones that I skip [10:31:00] great [10:31:23] so far only dbstore1002, it's breaking every 30 secs [10:33:05] we can put it on idempotent for a while [10:34:06] !log recreating _counters table on s3 to solve replication issues [10:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:35:10] it is funny how people deleted schemas from a couple of servers but not from anothers [10:35:33] yeah [10:36:24] they seems to go alphabetically [10:36:25] but hey, lets recreate the wiki with the same name, what could possibly go wrong? [10:36:26] the errors [10:36:35] rotfl [10:37:05] maybe I am creating those by executing the script? [10:37:20] no I'm 8000 seconds delayed [10:37:41] is it all wikis or just a few? [10:37:48] maybe you're re-creating them, more fun for later [10:37:53] just strange name ones [10:37:57] not all [10:38:11] like flaggedrevs_labswikimedia.hitcounter [10:38:35] well, in theory, I only recreate s3, which means not the ones deleted [10:38:40] from the s3 list [10:39:06] from where you pick the list? [10:39:07] or if they are on both, existing tables should not complain [10:39:15] mediawiki-config [10:39:18] the repo [10:39:24] ah ok then, not from show databases [10:39:38] that is supposed to be the authorative source [10:39:58] good to know [10:40:18] I'll start skipping on sanitarium too in the meanwhile [10:40:21] "all slaves should contain the same information" too [10:41:45] so far 11 on dbstore1002 and for now is not breaking [10:42:30] what I do not understand is the logic in deleting only from those 2 [10:42:58] space issues? but from the name I guess they are small [10:43:17] maybe those were re-created afterwards from the official list? [10:43:34] maybe those were moved and a drop was done [10:43:55] but they already existed on another shards [10:44:04] so only multi-shards were affected? [10:44:25] could be [10:44:43] in any case, let's finish the immediate actionables [10:44:52] seems like a busy saturday heh [10:45:03] our databases are crap [10:45:06] morning bblack :) [10:45:15] and not from the infrastructure point of view [10:45:24] nothing user-facing though [10:45:24] but guess who has to fix them? [10:45:36] infrastructure, of course [10:46:06] not user facing this time [10:46:13] my brief scan of the channel says: some software wrote to readonly DBs and broke replication? [10:46:31] nope, old tought [10:46:39] more subtle than that [10:46:57] a deleted table is not actually deleted [10:47:18] heh [10:47:27] and some deleted? schemas are not deleted (or yes, depending on the server) [10:47:51] so we have some fun seeing on which servers is and which isn't [10:47:53] due to storage engine diffs? [10:48:04] and the use of Engine=MEMORY tables that doesn't play nice with replication [10:48:21] yes, but that is a side effect [10:48:34] yes, but if those tables were used also a bug [10:48:34] if the table didn't exist anywhere, this wouldn't happen [10:49:04] ok, the script finished [10:49:41] restarted s3-master slave [10:49:45] ^sounds weird [10:49:48] RECOVERY - MySQL Slave Running on db1038 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error [10:50:27] the fact that we are using circulare replication could be seen as a cause- [10:50:48] but that only exposes poorly maintaned data, is not really the cause [10:51:10] instead of playing the roulette with 2 servers, we do it with 3 [10:51:39] good news is that this should only cause cross-datacenter issues, and we do not use both datacenters at the same time (yet) [10:52:02] ok, the master seems good now [10:52:08] jynus: when MySQL docs says "the first time that the master uses a given MEMORY table after startup", do you think a SELECT COUNT(*) will trigger it? [10:52:38] a pure mysql_upgrade would trigger it, I do not know [10:52:59] I would like to be sure all those events were triggered and there are no more hidden that will trigger in the future at random time [10:53:11] they should affect only dbstores and sanitarium though [10:53:11] it's easier than that [10:53:23] we should drop the table completelly [10:53:32] +1000 :) [10:53:48] but I am not 100% sure it is not used [10:53:57] both of them [10:54:13] of course we need to check, from the code hitcounters seems dropped [10:54:19] _counters we need to find who created it [10:54:29] searching https://phabricator.wikimedia.org/T54921 [10:55:05] the question is, it is ok to leave unused tables [10:55:13] if they exist everywhere [10:55:30] not a half measure [10:55:31] yep, all or nothing [10:56:01] and we should force developers that create new tables to write a drop script on uninstall [10:56:29] break count steady at 11, sanitarium had exactly the same ones [10:56:47] can you create a ticket and paste that list? [10:57:13] I wonder if they will fail again due to my script [10:57:17] sure, I'll wait that the lag go back to zero to see if it breaks again [10:57:18] PROBLEM - HHVM rendering on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:57:44] also I'm wondering if it could happen that they will be triggered again at random time, for dbstore and sanitarium [10:57:57] why did it got triggered this morning? [10:57:58] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:58:01] please create that ticket "missing/dropped databases?", I will create the hitcounter _counter one [10:58:32] it could be a cron trying to update/read those [10:58:48] I setup sys on db1075, but did not follow up [10:59:00] PROBLEM - dhclient process on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:08] PROBLEM - nutcracker port on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:19] PROBLEM - DPKG on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:21] jynus: hitcounters was ripped out of core a year or two ago iirc, but was disabled on the cluster for years beyond that [10:59:28] PROBLEM - puppet last run on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:29] PROBLEM - HHVM processes on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:30] PROBLEM - nutcracker process on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:39] PROBLEM - salt-minion processes on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:49] PROBLEM - Check size of conntrack table on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:59:50] "disabled on the cluster" it is dubious [10:59:56] or the core thing [11:00:09] PROBLEM - RAID on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:00:09] because code on git != actual data on the cluster [11:00:19] PROBLEM - Disk space on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:00:23] the feature was disabled on the custer [11:00:30] whoever did it, did not properly clean up things [11:00:38] PROBLEM - SSH on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:00:39] PROBLEM - configured eth on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:01:01] I do not discuss that, I discuss developer's responsability on maintaining the cluster up to date [11:01:29] and with responsability I do not mean them actually doing it, I should [11:01:48] but they should make sure it is done/create tickets when corresponds [11:02:19] if develpers are not responsible for anything WMF-related, then releng should enforce it [11:05:04] jynus: from the fact that the DB for which it breaks are alphabetically ordered, or was some sort of internal mysql event that did it on all, or was some sort of crontab/script [11:05:43] in both cases I hope that this means also that all were done and there will be no surprise left [11:06:07] although they were done at different times, thing that I cannot explain [11:06:09] oh, wait for better surprises on tuesday! [11:06:13] eheheh [11:06:24] will yo be there in the evening? [11:07:19] as I told you, I can be there whenever you want next week [11:07:20] :) [11:09:20] RECOVERY - nutcracker port on mw1132 is OK: TCP OK - 0.000 second response time on port 11212 [11:11:39] RECOVERY - HHVM processes on mw1132 is OK: PROCS OK: 6 processes with command name hhvm [11:12:29] RECOVERY - Disk space on mw1132 is OK: DISK OK [11:12:38] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table zh_cnwiki.hitcounter doesnt exist on query. Default database: information_schema. Query: DELETE FROM zh_cnwiki.hitcounter [11:12:39] RECOVERY - SSH on mw1132 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [11:12:48] RECOVERY - configured eth on mw1132 is OK: OK - interfaces up [11:12:53] 06Operations, 10DBA, 06Labs, 07Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#2211899 (10jcrespo) [11:13:00] already fixed dbstore1002 ^^^ [11:13:09] RECOVERY - dhclient process on mw1132 is OK: PROCS OK: 0 processes with command name dhclient [11:13:22] could you copy your list on a subtask of T50930 ? [11:13:23] T50930: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930 [11:13:28] RECOVERY - DPKG on mw1132 is OK: All packages OK [11:13:38] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures [11:13:44] ok, now we are at 27 now [11:13:48] RECOVERY - nutcracker process on mw1132 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:13:50] 27? [11:13:50] RECOVERY - salt-minion processes on mw1132 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:13:59] 27 databases? [11:13:59] RECOVERY - Check size of conntrack table on mw1132 is OK: OK: nf_conntrack is 0 % full [11:14:19] RECOVERY - RAID on mw1132 is OK: OK: no RAID installed [11:14:25] yep.. had to skip some more around the end of the alphabet (p, r, s, t) [11:14:39] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:15:31] last one was with zh so hopefully we are done [11:15:50] it has very little lag, so it should be post-script [11:16:28] true but last one was at 1000 seconds of lag so I dunno the relationship between those events and the lag [11:16:37] mmm [11:16:49] I don't get it why they are sparse and were not all together [11:17:24] what I mean is that if it hasn't failed already, with the current lag, then it should not fail again [11:18:13] I agree with you, maybe the last events were from your script, they happened all together, but were following the alphabetic order [11:18:18] zero lag [11:19:08] to be fair, just paste the list and let's investigate further next week [11:20:16] 06Operations, 10DBA, 06Labs, 07Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#2211918 (10Volans) [11:20:31] done and added to the task you give me before [11:20:48] I'll keep an eye on sanitarium and fix the ones missing, it's a bit behind [11:21:55] what we want to do for dbstore1001 that will probably break tomorrow morning? [11:22:19] pfff, fix it on monday [11:22:31] 1 day or 2 days behind will not make a difference [11:22:38] is safer with 2 :D [11:23:10] 06Operations, 10Traffic: cronspam from cpXXXX hosts due to update-ocsp-all and zero_fetch - https://phabricator.wikimedia.org/T132835#2211794 (10BBlack) OCSP: ------------ We have an icinga alert for when things go really-wrong (multiple serial failures), and the commentary in `modules/tlsproxy/manifests/ocsp.... [11:23:26] (03PS1) 10BBlack: OCSP Stapling: make icinga alerts more aggressive [puppet] - 10https://gerrit.wikimedia.org/r/283767 (https://phabricator.wikimedia.org/T132835) [11:23:50] sorry to bother you on your first saturday back but looked tricky from the start [11:24:33] we do have a codfw-switchover test starting up next week. I think Tuesday for the MW part? [11:24:37] 06Operations, 10DBA: Email spam from some MariaDB's logrotate - https://phabricator.wikimedia.org/T127638#2211932 (10jcrespo) Technically, most of these will disappear when we deprecate 5.5 on the cluster. I would not do much until then, because we will reimage the old servers. [11:24:45] you didn't bother me [11:24:53] the database did [11:25:15] and certainly I do not blame you for this [11:25:43] I think I had the auto-update on icinga disabled [11:25:47] so I did not catch this [11:26:15] to be fair "s3-master replication issue" would have sound scary [11:27:01] yeah [11:27:30] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 5 failures [11:27:31] btw, did you finish yesterday the checks? [11:27:41] it is ok to say no [11:28:05] I want to repool db2019 before tuesday [11:28:19] es2019? [11:28:27] yes, sorry [11:28:45] no didn't finish I wanted to work a bit on it this morning :( [11:28:50] nope [11:29:02] reserve yourself for tuesday [11:29:21] so you want me on Tue, Thu, Fri? don't you need help on Wed? [11:29:24] but send me an email with some scripts/path on your home if you already had something [11:29:51] if I have to choose, I prefer these 3 days, ues [11:30:27] if something is semi-broken on wednesday, it is not as important as if something is broken on failback [11:30:50] ok then, those will be, and I check if I can join on Wed too, more to do maintenance stuff [11:31:18] In any case I'll send a CR with the puppet changes for TLS to be merged after eqiad is passive and we are not rolling back [11:31:44] but we can check it on Tue, I'll be there [11:32:04] yep, no rush [11:32:13] but I should be able do those too [11:32:27] it should be only a parameter change, right? [11:32:45] I wonder if 48 hours will be enough to failover properly all masters [11:33:07] not only the buffer pool and caches warming [11:33:34] potential incompatibilities, etc. [11:33:56] I suppose we can not touch the current masters until later [11:34:02] and replicate from 10 -> 5.5 [11:35:01] I will also install pt-heartbeat-wikimedia on all servers [11:36:22] yes is a paramter change in site.pp [11:36:25] logging off [11:36:37] ok sanitarium just hit the last one, only one more [11:36:44] in comparison to dbstore [11:36:48] updating the task [11:43:17] sanitarium in sync too, logging off [11:58:10] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:24:07] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:24:56] PROBLEM - HHVM rendering on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:17] PROBLEM - Check size of conntrack table on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:25:17] PROBLEM - nutcracker port on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:25:28] PROBLEM - Disk space on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:25:38] PROBLEM - nutcracker process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:25:57] PROBLEM - puppet last run on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:26:27] PROBLEM - RAID on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:26:37] PROBLEM - dhclient process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:26:37] PROBLEM - HHVM processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:26:46] PROBLEM - SSH on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:26:57] PROBLEM - DPKG on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:26:58] PROBLEM - configured eth on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:28:46] RECOVERY - dhclient process on mw1138 is OK: PROCS OK: 0 processes with command name dhclient [12:28:46] RECOVERY - HHVM processes on mw1138 is OK: PROCS OK: 6 processes with command name hhvm [12:34:36] PROBLEM - salt-minion processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:57] PROBLEM - dhclient process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:34:57] PROBLEM - HHVM processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:57] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 53 minutes ago with 0 failures [12:50:08] RECOVERY - salt-minion processes on mw1138 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:50:18] RECOVERY - RAID on mw1138 is OK: OK: no RAID installed [12:50:36] RECOVERY - SSH on mw1138 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [12:50:36] RECOVERY - HHVM processes on mw1138 is OK: PROCS OK: 6 processes with command name hhvm [12:50:36] RECOVERY - dhclient process on mw1138 is OK: PROCS OK: 0 processes with command name dhclient [12:50:47] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 66372 bytes in 0.105 second response time [12:50:47] RECOVERY - DPKG on mw1138 is OK: All packages OK [12:50:56] RECOVERY - configured eth on mw1138 is OK: OK - interfaces up [12:51:08] RECOVERY - nutcracker port on mw1138 is OK: TCP OK - 0.000 second response time on port 11212 [12:51:08] RECOVERY - Check size of conntrack table on mw1138 is OK: OK: nf_conntrack is 12 % full [12:51:26] RECOVERY - Disk space on mw1138 is OK: DISK OK [12:51:36] RECOVERY - nutcracker process on mw1138 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:51:56] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.070 second response time [13:22:37] PROBLEM - HHVM rendering on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:37] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:36] PROBLEM - DPKG on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:37] PROBLEM - configured eth on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:56] PROBLEM - nutcracker port on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:24:56] PROBLEM - Check size of conntrack table on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:07] PROBLEM - Disk space on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:17] PROBLEM - nutcracker process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:36] PROBLEM - puppet last run on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:25:58] PROBLEM - RAID on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:16] PROBLEM - HHVM processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:16] PROBLEM - dhclient process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:26:16] PROBLEM - SSH on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:47] PROBLEM - salt-minion processes on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:57] RECOVERY - HHVM processes on mw1138 is OK: PROCS OK: 6 processes with command name hhvm [13:27:57] RECOVERY - dhclient process on mw1138 is OK: PROCS OK: 0 processes with command name dhclient [13:28:06] RECOVERY - SSH on mw1138 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [13:28:17] RECOVERY - DPKG on mw1138 is OK: All packages OK [13:28:26] RECOVERY - configured eth on mw1138 is OK: OK - interfaces up [13:28:37] RECOVERY - nutcracker port on mw1138 is OK: TCP OK - 0.000 second response time on port 11212 [13:28:38] RECOVERY - Check size of conntrack table on mw1138 is OK: OK: nf_conntrack is 0 % full [13:28:56] RECOVERY - Disk space on mw1138 is OK: DISK OK [13:29:06] RECOVERY - nutcracker process on mw1138 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:29:37] RECOVERY - salt-minion processes on mw1138 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:29:47] RECOVERY - RAID on mw1138 is OK: OK: no RAID installed [13:29:54] (03PS3) 10Urbanecm: Add HD versions of logo for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) [13:34:25] 06Operations, 10DBA: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#2211998 (10matmarex) [13:43:22] (03PS2) 10Nicko: Improve robustness of es-tool [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [13:59:02] (03CR) 10Luke081515: [C: 031] Add HD versions of logo for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) (owner: 10Urbanecm) [14:01:07] PROBLEM - puppet last run on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:01:16] PROBLEM - DPKG on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:01:18] PROBLEM - HHVM rendering on mw1134 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50407 bytes in 0.626 second response time [14:02:27] PROBLEM - Apache HTTP on mw1134 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50407 bytes in 0.344 second response time [14:02:46] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2211383 (10Platonides) >>! In T132822#2211838, @akosiaris wrote: > Code seems to be looping between lines 1365 -> goto (YES goto restart) 847 -> and down again there calling a... [14:03:16] RECOVERY - DPKG on mw1134 is OK: All packages OK [14:04:07] PROBLEM - SSH on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:38] PROBLEM - nutcracker port on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:04:48] PROBLEM - Check size of conntrack table on mw1134 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:08:27] RECOVERY - nutcracker port on mw1134 is OK: TCP OK - 0.000 second response time on port 11212 [14:08:37] RECOVERY - Check size of conntrack table on mw1134 is OK: OK: nf_conntrack is 0 % full [14:09:46] RECOVERY - SSH on mw1134 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [14:23:27] If no one is investigating those OOMs, I'm going to reboot most of them [14:24:00] !log rebooting mw1134 — OOM [14:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:27] !log rebooting mw1132 — OOM [14:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:47] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.315 second response time [14:28:41] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2212062 (10Platonides) 2016041510010245 does indeed contain invalid utf-8 sequences: «=D9=85=A0 =D8=A7…» and 12 lines below: «=D8=A8.=A0…» Those standalone A0 bytes are not... [14:28:46] RECOVERY - HHVM rendering on mw1134 is OK: HTTP OK: HTTP/1.1 200 OK - 66362 bytes in 5.894 second response time [14:28:57] RECOVERY - HHVM rendering on mw1132 is OK: HTTP OK: HTTP/1.1 200 OK - 66362 bytes in 1.233 second response time [14:29:57] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.118 second response time [14:31:26] 06Operations: apache servers OOMing: mw1134 mw1132 mw1139 mw1138 - https://phabricator.wikimedia.org/T132845#2212063 (10Andrew) [14:34:33] 06Operations: apache servers OOMing: mw1134 mw1132 mw1139 mw1138 - https://phabricator.wikimedia.org/T132845#2212075 (10Andrew) [14:35:19] ACKNOWLEDGEMENT - Apache HTTP on mw1138 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.023 second response time andrew bogott OOM -- T132845 [14:35:19] ACKNOWLEDGEMENT - HHVM rendering on mw1138 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.009 second response time andrew bogott OOM -- T132845 [14:35:19] ACKNOWLEDGEMENT - puppet last run on mw1138 is CRITICAL: CRITICAL: puppet fail andrew bogott OOM -- T132845 [14:36:50] 06Operations: API apache servers OOMing: mw1134 mw1132 mw1139 mw1138 - https://phabricator.wikimedia.org/T132845#2212077 (10Andrew) [14:37:43] 06Operations: API apache servers OOMing: mw1134 mw1132 mw1139 mw1138 - https://phabricator.wikimedia.org/T132845#2212080 (10Andrew) p:05Triage>03Unbreak! [14:49:44] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [24.0] [15:01:35] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [15:20:24] (03CR) 10Nicko: [C: 031] Improve robustness of es-tool [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [15:47:00] (03PS1) 10Volans: [codfw-rollout] MariaDB: use Puppet cert for all core DBs [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) [15:48:24] (03CR) 10Volans: "!!! TO BE MERGED ONLY AFTER codfw IS ACTIVE AND STABLE !!!" [puppet] - 10https://gerrit.wikimedia.org/r/283771 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [15:49:17] <_joe_> volans: the commit title is not in line with our standard style [15:49:39] RECOVERY - HHVM rendering on mw1138 is OK: HTTP OK: HTTP/1.1 200 OK - 66370 bytes in 8.551 second response time [15:50:06] _joe_: the commit line is temporary, should be changed before merging with an amend [15:50:30] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.076 second response time [15:50:49] I didn't send it to the branch you created because this has to be merged after we swtich to codfw [15:50:52] not before [15:51:58] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:07:12] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2212146 (10jcrespo) As a note, m3 (miscellaneous database services - shard number 3) is entirely dedicated to phabricator database needs, and as you can see h... [16:48:40] (03PS4) 10Dereckson: Add HD versions of logo for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) (owner: 10Urbanecm) [16:52:50] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2212184 (10akosiaris) >>! In T132822#2211841, @pajz wrote: > Would it make sense to try to trace this down to the affected ticket? I just googled the text bit ("Saudi Arabia w... [17:02:06] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2212200 (10BBlack) >>! In T131775#2209885, @mark wrote: > I think a backup Phabricator host in codfw would make a lot of sense, and is something we strive for... [17:22:29] Dereckson: https://phabricator.wikimedia.org/T128371#2212233 [17:22:41] Do you want to join too? [17:24:39] Luke081515: I can probably help as well with puppet / ops-yish changes [17:24:52] YuviPanda: Great [17:24:57] Currently the channel is a bit empty :D [17:28:18] Luke081515: send a mail to wikitech-l explaining code review is @ #wikimedia-codereview [17:28:35] ok, I will do [17:28:41] after signing up that channel at meta [17:29:14] mukunda already did [17:29:20] ah, ok [17:31:34] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2212254 (10akosiaris) Got a perl stacktrace as well ``` [Sat Apr 16 17:24:47 2016] (eval 1714): Hi at (eval 1714) line 1. [Sat Apr 16 17:24:47 2016] (eval 1714): eval 'requi... [17:38:03] 06Operations, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2212261 (10Volans) I've run @jcrespo: I've run a script to check the tables (took me only few minutes to actually complete it) Please review it's logic (I'm new to mediawiki table structures) and **double check the t... [17:46:11] YuviPanda, you know if you're looking for puppet stuff to code-review I have 3 labs-related patches sitting around :) [17:51:28] (03CR) 10Yuvipanda: [C: 04-1] "Besides the inline comments, Where is the code that's actually sending the stats to labmon?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [17:51:58] (03PS2) 10Yuvipanda: deployment-prep shinken: fix check_command for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/283669 (https://phabricator.wikimedia.org/T132733) (owner: 10Alex Monk) [17:52:12] (03CR) 10Yuvipanda: [C: 032 V: 032] deployment-prep shinken: fix check_command for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/283669 (https://phabricator.wikimedia.org/T132733) (owner: 10Alex Monk) [17:52:40] (03CR) 10Alex Monk: "modules/keyholder/files/check_keyholder contains the code that's actually sending the stats to labmon" [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [17:53:09] (03CR) 10Yuvipanda: "ah, I missed that. It should most definitely be on a different file then, yes." [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [17:53:46] (03CR) 10Yuvipanda: "In fact, I think the right thing to do for things like this is to write a diamond plugin, rather than setup our own cronjobs." [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [17:54:19] (03CR) 10Alex Monk: "https://wikitech.wikimedia.org/wiki/Diamond -> 404 :(" [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [17:56:06] (03CR) 10Yuvipanda: "See modules/diamond/files/collector/sshsessions.py and modules/role/manifests/labs/instance.pp as examples." [puppet] - 10https://gerrit.wikimedia.org/r/283227 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [17:56:23] thanks YuviPanda :) [17:58:11] 06Operations, 07Documentation: Write documentation on how / when to use custom Diamond metrics collectors - https://phabricator.wikimedia.org/T132856#2212276 (10yuvipanda) [17:58:14] Krenair: np. I filed ^ [18:01:49] 06Operations, 07Documentation: Write documentation on how / when to use custom Diamond metrics collectors - https://phabricator.wikimedia.org/T132856#2212294 (10yuvipanda) I think a more general 'I want to monitor X in Y environment, what do I do?' would be nice as well. [18:08:42] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [18:11:01] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [18:24:30] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:26:41] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:30:44] YuviPanda, I've been thinking about https://phabricator.wikimedia.org/T111540 [18:31:04] customconfig/basic-instance-checks.cfg: check_command check_graphite_series_threshold!http://labmon1001.eqiad.wmnet!10!$HOSTNOTES$.$HOSTNAME$.diskspace.*.byte_percentfree!15!10!10min!1!--under [18:31:07] it's that *... [18:31:13] right. [18:31:52] kind of problematic because it assumes that any datapoint ever used is still appropriate to check [18:32:33] the reason I made it * was: 1. some instances are really old and had a fucking tiny /var (2G) that caused them to fill up, 2. Some aren't, and only need root, 3. some have /srv, 4. some have custom setups (a bunch have a mount point for /var/lib/docker, for example) 5. we have no way of dynamically finding out from shinken which instances have which mountpoints turned on atm [18:32:40] there's no external way to check what mounts exist on a host right? [18:32:56] that's not possible with shinkengen? [18:33:12] yup, there's no way to figure out which instance has which mountpoints mounted. [18:33:33] so to fix this properly, we'd need some way for instances to publish their list of mounts [18:33:47] and modify shinkengen to process that and spit out the appropriate config? [18:34:04] kinda. at that point it is starting to re-invent prometheus a little bit (prometheus.io) [18:34:42] for example, if there is a way for instances to expose current mount points why shouldn't they also just expose current free space? [18:35:02] and if we do that, why not have them just do the check themselves, and just have a central setup for reporting? [18:35:20] mm, we should generalize this to not be just for disk space. and then boom, reinvention :D [18:41:54] YuviPanda, and the improper solution... an extra check_graphite command for such wildcard checks that ignores ones without any data? [18:44:30] Krenair: yeah, that would be a reasonable bandaid. [18:44:57] well, for whatever definition of 'reasonable' the whole setup can survive. monitoring disk space via graphite is a bit insane, but that's what we got... [18:45:37] krenair@shinken-01:/etc/shinken$ /usr/lib/nagios/plugins/check_graphite -U http://labmon1001.eqiad.wmnet -T 10 check_series_threshold 'deployment-prep.deployment-elastic05.diskspace.*.byte_percentfree' -W 15 -C 10 --from 10min --perc 1 --under [18:45:37] UNKNOWN: No valid datapoints found for deployment-prep.deployment-elastic05.diskspace._var_log.byte_percentfree [18:45:38] :/ [18:46:05] right. so maybe an extra flag that makes it treats no valid data points as 'ignored'? [18:53:57] YuviPanda, I'm looking through SeriesThreshold.check_data [18:54:41] (I'll have to go really soon tho) [18:54:45] can't we just set up messages['UNKNOWN'] for this? [18:54:45] ok [18:57:01] maybe. I haven't looked at that script in years tho [18:57:24] I think the proper long term solution is to setup a multitenant monitoring solution that doesn't suck nor rely on puppet resource collection. [18:57:40] and also to fix the puppet module to not be such a shitshow on labs [19:32:55] (03PS1) 10Alex Monk: shinken: Allow undefined data in graphite for disk space checks [puppet] - 10https://gerrit.wikimedia.org/r/283779 (https://phabricator.wikimedia.org/T111540) [20:08:58] (03PS1) 10Alex Monk: deployment-prep shinken: deployment-salt is no longer the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/283783 [20:15:45] (03CR) 10Gehel: "Good job! Thanks for the time you spent! See comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [21:02:54] 06Operations: Investigate cp1044's strange Ganglia graphs - https://phabricator.wikimedia.org/T132859#2212399 (10Southparkfan) [21:47:06] (03CR) 10Dereckson: [C: 031] Add HD versions of logo for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) (owner: 10Urbanecm) [21:47:29] (03CR) 10Dereckson: "PS4: optipng" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) (owner: 10Urbanecm) [22:40:59] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 36 failures [23:08:38] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:08:59] PROBLEM - HHVM rendering on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:09:38] PROBLEM - RAID on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:09:38] PROBLEM - SSH on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:09:48] PROBLEM - Check size of conntrack table on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:09:49] PROBLEM - dhclient process on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:10:07] PROBLEM - DPKG on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:11:28] RECOVERY - SSH on mw1117 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [23:11:37] RECOVERY - dhclient process on mw1117 is OK: PROCS OK: 0 processes with command name dhclient [23:11:48] RECOVERY - DPKG on mw1117 is OK: All packages OK [23:17:17] PROBLEM - nutcracker port on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:17] PROBLEM - nutcracker process on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:17] PROBLEM - SSH on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:17:18] PROBLEM - configured eth on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:28] PROBLEM - dhclient process on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:47] PROBLEM - DPKG on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:57] PROBLEM - Disk space on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:18:18] PROBLEM - HHVM processes on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:18:37] PROBLEM - salt-minion processes on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:20:51] need for powercycle? [23:21:13] or is it fine? [23:28:47] RECOVERY - nutcracker port on mw1117 is OK: TCP OK - 0.000 second response time on port 11212 [23:28:47] RECOVERY - nutcracker process on mw1117 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:28:47] RECOVERY - SSH on mw1117 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [23:28:49] RECOVERY - configured eth on mw1117 is OK: OK - interfaces up [23:28:57] RECOVERY - dhclient process on mw1117 is OK: PROCS OK: 0 processes with command name dhclient [23:29:18] RECOVERY - DPKG on mw1117 is OK: All packages OK [23:29:28] RECOVERY - Disk space on mw1117 is OK: DISK OK [23:29:49] RECOVERY - HHVM processes on mw1117 is OK: PROCS OK: 6 processes with command name hhvm [23:30:07] RECOVERY - salt-minion processes on mw1117 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:34:39] RECOVERY - RAID on mw1117 is OK: OK: no RAID installed [23:34:50] RECOVERY - Check size of conntrack table on mw1117 is OK: OK: nf_conntrack is 0 % full [23:59:19] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.026 second response time on port 9042