[00:33:05] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:43:39] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:02:05] (03PS1) 10Ammarpad: Add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538408 (https://phabricator.wikimedia.org/T233104) [04:02:54] (03CR) 10jerkins-bot: [V: 04-1] Add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538408 (https://phabricator.wikimedia.org/T233104) (owner: 10Ammarpad) [04:06:51] (03PS2) 10Ammarpad: Add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538408 (https://phabricator.wikimedia.org/T233104) [09:11:57] 10Operations, 10MediaWiki-Releasing, 10Parsoid: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository - https://phabricator.wikimedia.org/T225601 (10Misterms735) @fgiunchedi I still have the problem. When I try to install nginx the following error... [09:15:07] PROBLEM - HTTPS non-canonical-redirect-1 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir [09:16:39] RECOVERY - HTTPS non-canonical-redirect-1 on ncredir2001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 431002 seconds left:Certificate wikipedia.com valid until 2019-10-29 08:00:32 +0000 (expires in 36 days) https://wikitech.wikimedia.org/wiki/Ncredir [10:29:42] (03CR) 10Urbanecm: [C: 04-1] "> Patch Set 4:" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536764 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2) [11:55:57] 10Operations, 10Traffic: puppet restarts nginx instead of reloading it on ncredir servers - https://phabricator.wikimedia.org/T233518 (10Vgutierrez) [13:57:45] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:08:19] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:57:40] (03PS1) 104nn1l2: Add support for some languages on Commons and stop support for nys on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480) [16:58:14] (03CR) 10Zoranzoki21: [C: 04-1] "Abandon this, as you made new https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/538427/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536764 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2) [17:04:34] (03PS2) 104nn1l2: Add support for some languages on Commons and stop support for nys on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480) [17:07:39] (03Abandoned) 104nn1l2: Add support for some languages on Commons and stop support for nys on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536764 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2) [17:30:58] 10Operations, 10Toolforge: When user create tool via toolsadmin, it doesn't create replica.my.cnf - https://phabricator.wikimedia.org/T233530 (10Zoranzoki21) I don't know which team is for this, I think to this is for #operations [17:31:53] 10Operations, 10Toolforge: When user create tool via toolsadmin, it doesn't create replica.my.cnf - https://phabricator.wikimedia.org/T233530 (10Zoranzoki21) I reported this to IRC yesterday and I talked with @Krenair (I think). [17:55:37] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2) [17:56:41] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2) [18:08:44] (03PS19) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 [18:37:48] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 (owner: 10Andrew Bogott) [18:42:45] PROBLEM - Host db1075 is DOWN: PING CRITICAL - Packet loss = 100% [18:42:47] PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops [18:43:23] PROBLEM - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:44:35] (03PS1) 10Andrew Bogott: Openstack: add some missing files for Newton [puppet] - 10https://gerrit.wikimedia.org/r/538430 (https://phabricator.wikimedia.org/T212302) [18:45:54] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: add some missing files for Newton [puppet] - 10https://gerrit.wikimedia.org/r/538430 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [18:47:02] I'm getting a warning on otrs-wiki about a high replag database lock [18:47:39] It's also slower than usual [18:48:51] PROBLEM - MariaDB Slave IO: s3 on db2105 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:49:21] PROBLEM - MariaDB Slave IO: s3 on dbstore1004 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:49:57] PROBLEM - MariaDB Slave IO: s3 on db1095 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:51:02] how much do you want to bet db1075 is in s3? [18:51:10] ohhhh noo [18:51:23] db1075 *is the s3 master* [18:51:37] andrewbogott, around? [18:52:15] Krenair: I am but in the middle of a few things — what's up? (I haven't read the backscroll yet but I could) [18:52:24] s3 master is down andrewbogott [18:52:46] andrewbogott, wondering if ops DBA should be paged but I'm not sure if the monitoring system has done so [18:53:10] backscroll to 18:42:45, host db1075 is down [18:53:11] hm, looks like not [18:53:28] I will see if I can contact marostegui [18:53:31] (If that didn't do it) [18:53:36] I've just poked him on telegram [18:53:51] cool [18:54:06] Is the host actually down? Or just mysql? [18:54:17] (03PS1) 10Andrew Bogott: nova: add a few more Newton files [puppet] - 10https://gerrit.wikimedia.org/r/538431 (https://phabricator.wikimedia.org/T212302) [18:54:25] you are likely capable of digging further than I, but: PROBLEM - Host db1075 is DOWN: PING CRITICAL - Packet loss = 100% [18:54:34] ssh is timing out for me [18:55:38] hm, and my mgmt password doesn't work — what's that about? [18:55:44] hey [18:55:45] PROBLEM - MariaDB Slave Lag: s3 on dbstore1004 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 923.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:55:47] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 923.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:55:47] Are you using the new one andrewbogott? [18:55:49] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic={udp_localhost-err,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-dataso [18:55:49] heus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [18:55:50] andrewbogott, see ops list [18:55:50] I'm connecting [18:55:57] PROBLEM - MariaDB Slave Lag: s3 on db2127 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 935.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:56:07] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 944.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:56:15] oh crap, ok [18:56:18] * andrewbogott digs up new password [18:56:19] heh [18:56:23] PROBLEM - MariaDB Slave Lag: s3 on db2109 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 960.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:56:29] PROBLEM - MariaDB Slave Lag: s3 #page on db1078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 963.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:56:29] PROBLEM - MariaDB Slave Lag: s3 on db2105 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 964.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:56:37] We were going to fail over that host on tuesday [18:56:38] PROBLEM - MariaDB Slave Lag: s3 #page on db1112 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 974.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:56:44] Let's see what's going on on the mgmt [18:56:45] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 983.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:56:59] PROBLEM - MariaDB Slave Lag: s3 #page on db1123 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 994.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:57:12] Can someone silence those alerts? [18:57:19] PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1015.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:57:20] I am going to see what's the issue with the master [18:57:33] marostegui: I'll work on silencing and acking [18:57:50] <_joe_> uhm [18:57:54] thanks andrewbogott [18:57:55] anything up? [18:58:00] s3 master is unhappy [18:58:11] <_joe_> just got paged [18:58:19] I am connecting to the idrac [18:58:25] <_joe_> sigh [18:58:27] marostegui: I'm here if I can help [18:58:34] volans|off: thanks! [18:58:53] BBU failed [18:58:57] so likely a storage crash [18:58:59] :( [18:59:00] we have seen that before [18:59:04] going to reboot the master [18:59:13] bleh [18:59:31] PROBLEM - MariaDB Slave IO: s3 #page on db1123 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:59:41] Server coming back [18:59:44] If it hasn't been reported yet, s3 doesn't appear to be updating [18:59:49] Danny_B, known [18:59:49] DannyS712: we are on it [18:59:55] I'm here too if needed [19:00:21] PROBLEM - MariaDB Slave IO: s3 #page on db1112 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:00:32] PROBLEM - MariaDB Slave IO: s3 #page on db1078 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:00:53] * andrewbogott not sure how to prevent pages that span so many different servers [19:01:12] * jbond42 here [19:02:17] Server is back, I am doing some checks before starting mysql [19:02:18] andrewbogott: what I've been doing is search for #page in icinga and then silence/acknowledge as needed [19:02:24] ACKNOWLEDGEMENT - Host db1075 is DOWN: PING CRITICAL - Packet loss = 100% andrew bogott Manuel + others working on this [19:02:33] RECOVERY - Host db1075 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [19:02:57] godog: once they show up as #page they've already paged haven't they? [19:03:46] Starting mysql [19:04:00] andrewbogott: yes [19:04:13] andrewbogott: the search will show all alerts that are pages, some will have fired already yeah [19:04:36] godog: ok, I see. At this point I think everything that's going to fire has already fired so I'm going to let them be for now [19:04:43] I'm around if needed, 10min from my laptop [19:05:00] andrewbogott: sorry to be more clear, once you see them in IRC they have paged, on Icinga UI only if they are in hard and critical state, in soft not yet [19:05:00] <_joe_> XioNoX: I don't think you are [19:05:53] Interesting, this is the same batch of hosts where we have seen BBU failures lately [19:06:00] PROBLEM - MariaDB Slave IO: s3 #page on db1075 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:06:09] PROBLEM - Check systemd state on db1075 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:06:16] <_joe_> I guess mysql is still starting? [19:06:30] done now [19:06:33] (03CR) 10Andrew Bogott: [C: 03+2] nova: add a few more Newton files [puppet] - 10https://gerrit.wikimedia.org/r/538431 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott) [19:06:35] doing one last check before removing read only [19:06:45] RECOVERY - MariaDB Slave IO: s3 #page on db1112 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:06:56] RECOVERY - MariaDB Slave IO: s3 #page on db1078 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:06:57] <_joe_> great [19:07:02] <_joe_> any idea what happened? [19:07:03] RECOVERY - MariaDB Slave IO: s3 on dbstore1004 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:07:05] read only OFF [19:07:08] _joe_: BBU failure [19:07:15] PROBLEM - Check whether ferm is active by checking the default input chain on db1075 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:07:18] recoveries should start coming [19:07:37] RECOVERY - MariaDB Slave IO: s3 #page on db1123 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:07:41] RECOVERY - MariaDB Slave IO: s3 #page on db1075 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:07:43] RECOVERY - MariaDB Slave IO: s3 on db1095 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:07:44] !log marostegui set s3 master RW [19:07:44] <_joe_> so just a failure causing reduced iops [19:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:49] The good thing is that this host will be failed over on tuesday [19:07:51] RECOVERY - Check systemd state on db1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:00] assuming incident docs will get written for this: I think paging should've happened far sooner for a DB master to go offline, and we lack a good way to alert ops manually (just happened to see and.rew active, not clear who if anyone would be keeping an eye on IRC) [19:08:08] _joe_: no, a bbu failure causing the host to fail (we have seen this before) [19:08:13] RECOVERY - MariaDB Slave IO: s3 on db2105 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:08:38] Yes, we are up [19:08:44] Or we should be :) [19:08:45] Krenair: what were the early signs of this? [19:08:53] RECOVERY - Check whether ferm is active by checking the default input chain on db1075 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:09:21] Are we supposed to be able to edit at this point? Because I still cannot on s3 wikis [19:09:39] cdanis: host down IRC alert that is not paging by default [19:09:47] <_joe_> DannyS712: yes you should be able to by now [19:09:48] cdanis, first thing visible was that icinga noticed the host (which after checking dbtree is the s3 master) was offline i.e. not responding to ping, it was not for another 9-10 minutes that I happened to look at this channel and realise something was wrong [19:10:27] <_joe_> something is wrong with the alerts then, but that's for later please [19:10:42] <_joe_> DannyS712: what wiki, for example? [19:10:52] _joe_: bnwiki says "The database is read-only until replication lag decreases" to me [19:11:07] https://bn.wikipedia.org/w/index.php?title=%E0%A6%AC%E0%A7%8D%E0%A6%AF%E0%A6%AC%E0%A6%B9%E0%A6%BE%E0%A6%B0%E0%A6%95%E0%A6%BE%E0%A6%B0%E0%A7%80:Martin_Urbanec/%E0%A6%96%E0%A7%87%E0%A6%B2%E0%A6%BE%E0%A6%98%E0%A6%B0&action=edit&redlink=1 in particular [19:11:19] otrs-wiki is back to normal speed but still has the message as well [19:11:21] <_joe_> ok, marostegui do we still have replication lag? [19:11:24] For me mswiki and enwikinews still don't work [19:11:25] I am checking [19:11:41] tendril reports zero lag [19:11:42] <_joe_> yeah quite a bit according to icinga [19:11:50] <_joe_> but tendril says zero yeah [19:11:54] https://tools.wmflabs.org/replag/ reports 31 minute lag on s3 [19:12:14] DannyS712: that's cloud replica lag [19:12:42] Oh. How can I see the actual wiki replica lags? [19:12:55] https://dbtree.wikimedia.org/ lists lag for each host but I'm not sure how much I trust it [19:13:28] I've checked a couple of random hosts and they seems to be in sync [19:13:35] There is no lag [19:13:47] logstash is a little backlogged atm, likely from mw logs, should recover soon [19:13:48] <_joe_> probably some caching that is persisting longer than it should? [19:14:06] https://bar.wikipedia.org/w/api.php?format=xml&action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb=1 - lag appears to be increasing here [19:14:18] read_only | OFF [19:14:41] I don't see errors on mediawiki [19:14:50] Krenair: not sure what that comes from, the lag on the hosts is 0 [19:15:04] interesting [19:15:11] <_joe_> yeah this isn't great [19:15:41] am no expert, could it be a heartbeat vs. normal mysql thing? [19:15:45] I am running puppet on master [19:15:56] marostegui: anything else to restart on the host? [19:16:01] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:16:03] jynus I'm still seeing errors on mediawikiwiki - https://www.mediawiki.org/w/index.php?title=Category:MediaWiki_database_tables&action=edit [19:16:07] Notice: /Stage[main]/Mariadb::Heartbeat/Exec[pt-heartbeat]/returns: executed successfully [19:16:13] RECOVERY - MariaDB Slave Lag: s3 #page on db1123 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:16:23] jynus: I did that already [19:16:24] <_joe_> the value is back to normal on barwiki [19:16:34] RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:16:37] looks better now? [19:16:38] but puppet runs @reboot [19:16:41] RECOVERY - MariaDB Slave Lag: s3 on dbstore1004 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:16:41] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:16:49] it executed, which meanst it worked now [19:16:49] volans|off: I ran it after mysql started [19:16:51] Yeah, just got an edit through on plwikisource which also was affected [19:16:54] I appear to be able to edit barwiki now [19:16:55] RECOVERY - MariaDB Slave Lag: s3 on db2127 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:16:57] idem [19:16:59] AntiComposite? [19:17:00] working for me again on mswiki [19:17:05] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:17:09] I can't edit the steward wiki... I get this warning. How much more do I need to wait? [19:17:09] 9:16 PM [19:17:09] 9:16 PM Warning: The database has been locked for maintenance, so you will not be able to save your edits right now. You may wish to copy and paste your text into a text file and save it for later. [19:17:09] 9:16 PM [19:17:09] 9:16 PM The system administrator who locked it offered this explanation: The database is read-only until replication lag decreases. [19:17:16] OTRS-wiki is OK [19:17:17] Trijnstel: we are on it [19:17:18] <_joe_> so the point is, the cache is probably lasting more than it should [19:17:19] RECOVERY - MariaDB Slave Lag: s3 on db2109 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:17:23] RECOVERY - MariaDB Slave Lag: s3 #page on db1078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:17:23] Trijnstel, can you try again now? [19:17:25] RECOVERY - MariaDB Slave Lag: s3 on db2105 is OK: OK slave_sql_lag Replication lag: 0.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:17:25] <_joe_> the *mediawiki* cache [19:17:33] RECOVERY - MariaDB Slave Lag: s3 #page on db1112 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:17:35] yeah! [19:17:40] I waited for more than 15 minutes... [19:17:44] and suddenly it's solved [19:17:48] marostegui: your run failed [19:17:59] I guess [19:18:03] Trijnstel: A few seconds before you joined the channel in fact ;) [19:18:03] race condition in the puppet run at reboot? [19:18:04] https://puppetboard.wikimedia.org/report/db1075.eqiad.wmnet/d897cb07f699f0e514b3ed4e1b2822317878554c [19:18:09] marostegui: ^^^ [19:18:13] maybe because MySQL needs to be started manually?? [19:18:19] volans|off: I guess I missed it [19:18:28] cdanis: yes, that is on purpose [19:18:31] I'm trying to see why [19:18:37] Too many things at the same time [19:18:49] jynus: i know, but perhaps there’s an unintended consequence here re: heartbeat [19:18:52] <_joe_> ok things are in place, I am off again. [19:19:02] marostegui: it said the socket was not there [19:19:04] I will create a ticket and I am out [19:19:21] <_joe_> yeah the right way to do this is to manage things via systemd probably [19:19:26] <_joe_> we can talk tomorrow marostegui [19:19:28] so maybe mysql was still starting [19:19:30] it probably was run while mysql was recoverin g [19:19:33] did anyone start an incident report ? [19:20:13] I imagine someone will during EU working hours tomorrow [19:20:19] cdanis: I will tomorrow [19:20:39] thanks marostegui I’ll work on it as well once I’m awake tomorrow [19:20:54] (Which might be early, I was up and awake 5:30am local today) [19:21:19] 10Operations, 10DBA: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui) [19:21:30] I am going offline [19:21:31] Bye [19:22:30] same here [19:24:07] 10Operations, 10DBA: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui) [19:26:29] PROBLEM - HP RAID on db1075 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:26:31] ACKNOWLEDGEMENT - HP RAID on db1075 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T233535 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:26:34] 10Operations, 10ops-eqiad: Degraded RAID on db1075 - https://phabricator.wikimedia.org/T233535 (10ops-monitoring-bot) [19:27:09] 10Operations, 10DBA: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui) [19:27:38] 10Operations, 10DBA: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui) [19:27:40] 10Operations, 10ops-eqiad: Degraded RAID on db1075 - https://phabricator.wikimedia.org/T233535 (10Marostegui) [19:32:14] 10Operations, 10DBA: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui) [19:32:49] RECOVERY - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:00:47] PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops [20:02:23] RECOVERY - MediaWiki codfw exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops [20:04:27] Hi, Hashar merged patch https://gerrit.wikimedia.org/r/#/c/integration/config/+/538419/ and it made problem with my repository. Can anyone merge fix https://gerrit.wikimedia.org/r/#/c/integration/config/+/538442/ [20:07:29] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [20:11:36] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Platonides) Yesterday, we had the same issue on biblio-es-l with all subscribers using a yahoo email address being automatically disabled delivery, as the max retry timeout for emails from... [20:11:48] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Platonides) [20:12:44] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Platonides) See T22507 for the same issue in 2009 [20:19:42] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Platonides) Similarly, on Wednesday ...@yahoo.com subscriptions were disabled for wikitech-l due to this same issue, bouncing https://lists.wikimedia.org/pipermail/wikitech-l/2019-... [21:27:45] (03PS1) 10Andrew Bogott: keystone: update wmtotop.py for Newton [puppet] - 10https://gerrit.wikimedia.org/r/538443 [21:27:47] (03PS1) 10Andrew Bogott: Keystone/Newton: create /var/lib/keystone [puppet] - 10https://gerrit.wikimedia.org/r/538444 [21:27:49] (03PS1) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445 [21:28:40] (03CR) 10jerkins-bot: [V: 04-1] Keystone/Newton: create /var/lib/keystone [puppet] - 10https://gerrit.wikimedia.org/r/538444 (owner: 10Andrew Bogott) [21:28:42] (03CR) 10Andrew Bogott: [C: 03+2] keystone: update wmtotop.py for Newton [puppet] - 10https://gerrit.wikimedia.org/r/538443 (owner: 10Andrew Bogott) [21:31:15] (03PS2) 10Andrew Bogott: Keystone/Newton: create /var/lib/keystone [puppet] - 10https://gerrit.wikimedia.org/r/538444 [21:31:17] (03PS2) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445 [21:45:25] (03PS3) 10Andrew Bogott: Keystone/Newton: create /var/lib/keystone [puppet] - 10https://gerrit.wikimedia.org/r/538444 [21:45:27] (03PS3) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445 [21:45:29] (03PS1) 10Andrew Bogott: Keystone/newton: explicitly exclude the python-ldap package [puppet] - 10https://gerrit.wikimedia.org/r/538449 [21:46:45] (03CR) 10Andrew Bogott: [C: 03+2] Keystone/newton: explicitly exclude the python-ldap package [puppet] - 10https://gerrit.wikimedia.org/r/538449 (owner: 10Andrew Bogott) [21:54:59] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:16:07] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:22:35] (03PS4) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445 [22:22:37] (03PS1) 10Andrew Bogott: Keystone/Newton: update the password_whitelist plugin [puppet] - 10https://gerrit.wikimedia.org/r/538450 [22:22:39] (03PS1) 10Andrew Bogott: wmfkeystonehooks: update to work with keystone Newton [puppet] - 10https://gerrit.wikimedia.org/r/538451 [22:51:20] (03PS2) 10Andrew Bogott: wmfkeystonehooks: update to work with keystone Newton [puppet] - 10https://gerrit.wikimedia.org/r/538451 [22:51:22] (03PS5) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445 [22:52:33] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: update to work with keystone Newton [puppet] - 10https://gerrit.wikimedia.org/r/538451 (owner: 10Andrew Bogott) [22:52:42] (03CR) 10Andrew Bogott: [C: 03+2] Keystone/Newton: update the password_whitelist plugin [puppet] - 10https://gerrit.wikimedia.org/r/538450 (owner: 10Andrew Bogott) [22:56:57] (03PS6) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445 [22:56:59] (03PS1) 10Andrew Bogott: keystone.conf: remove 'verbose' config option [puppet] - 10https://gerrit.wikimedia.org/r/538453 [22:58:57] (03CR) 10Andrew Bogott: [C: 03+2] keystone.conf: remove 'verbose' config option [puppet] - 10https://gerrit.wikimedia.org/r/538453 (owner: 10Andrew Bogott)