[00:33:05] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:43:39] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:02:05] <wikibugs>	 (03PS1) 10Ammarpad: Add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538408 (https://phabricator.wikimedia.org/T233104)
[04:02:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538408 (https://phabricator.wikimedia.org/T233104) (owner: 10Ammarpad)
[04:06:51] <wikibugs>	 (03PS2) 10Ammarpad: Add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538408 (https://phabricator.wikimedia.org/T233104)
[09:11:57] <wikibugs>	 10Operations, 10MediaWiki-Releasing, 10Parsoid: signatures were invalid: EXPKEYSIG 90E9F83F22250DD7 MediaWiki releases repository <wikitech-l@lists.wikimedia.org> - https://phabricator.wikimedia.org/T225601 (10Misterms735) @fgiunchedi I still have the problem. When I try to install nginx the following error...
[09:15:07] <icinga-wm>	 PROBLEM - HTTPS non-canonical-redirect-1 on ncredir2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Ncredir
[09:16:39] <icinga-wm>	 RECOVERY - HTTPS non-canonical-redirect-1 on ncredir2001 is OK: SSL OK - OCSP staple validity for wikipedia.com has 431002 seconds left:Certificate wikipedia.com valid until 2019-10-29 08:00:32 +0000 (expires in 36 days) https://wikitech.wikimedia.org/wiki/Ncredir
[10:29:42] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "> Patch Set 4:" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536764 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2)
[11:55:57] <wikibugs>	 10Operations, 10Traffic: puppet restarts nginx instead of reloading it on ncredir servers - https://phabricator.wikimedia.org/T233518 (10Vgutierrez)
[13:57:45] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:08:19] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:57:40] <wikibugs>	 (03PS1) 104nn1l2: Add support for some languages on Commons and stop support for nys on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480)
[16:58:14] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 04-1] "Abandon this, as you made new https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/538427/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536764 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2)
[17:04:34] <wikibugs>	 (03PS2) 104nn1l2: Add support for some languages on Commons and stop support for nys on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480)
[17:07:39] <wikibugs>	 (03Abandoned) 104nn1l2: Add support for some languages on Commons and stop support for nys on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536764 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2)
[17:30:58] <wikibugs>	 10Operations, 10Toolforge: When user create tool via toolsadmin, it doesn't create replica.my.cnf - https://phabricator.wikimedia.org/T233530 (10Zoranzoki21) I don't know which team is for this, I think to this is for #operations
[17:31:53] <wikibugs>	 10Operations, 10Toolforge: When user create tool via toolsadmin, it doesn't create replica.my.cnf - https://phabricator.wikimedia.org/T233530 (10Zoranzoki21) I reported this to IRC yesterday and I talked with @Krenair (I think).
[17:55:37] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2)
[17:56:41] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/538427 (https://phabricator.wikimedia.org/T230480) (owner: 104nn1l2)
[18:08:44] <wikibugs>	 (03PS19) 10Andrew Bogott: codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703
[18:37:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: move nova/keystone/glance hosts to Newton [puppet] - 10https://gerrit.wikimedia.org/r/537703 (owner: 10Andrew Bogott)
[18:42:45] <icinga-wm>	 PROBLEM - Host db1075 is DOWN: PING CRITICAL - Packet loss = 100%
[18:42:47] <icinga-wm>	 PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[18:43:23] <icinga-wm>	 PROBLEM - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:44:35] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack: add some missing files for Newton [puppet] - 10https://gerrit.wikimedia.org/r/538430 (https://phabricator.wikimedia.org/T212302)
[18:45:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack: add some missing files for Newton [puppet] - 10https://gerrit.wikimedia.org/r/538430 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott)
[18:47:02] <AntiComposite>	 I'm getting a warning on otrs-wiki about a high replag database lock
[18:47:39] <AntiComposite>	 It's also slower than usual
[18:48:51] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on db2105 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:49:21] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore1004 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:49:57] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on db1095 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:51:02] <Krenair>	 how much do you want to bet db1075 is in s3?
[18:51:10] <Krenair>	 ohhhh noo
[18:51:23] <Krenair>	 db1075 *is the s3 master*
[18:51:37] <Krenair>	 andrewbogott, around?
[18:52:15] <andrewbogott>	 Krenair: I am but in the middle of a few things — what's up?  (I haven't read the backscroll yet but I could)
[18:52:24] <Urbanecm>	 s3 master is down andrewbogott 
[18:52:46] <Krenair>	 andrewbogott, wondering if ops DBA should be paged but I'm not sure if the monitoring system has done so
[18:53:10] <Krenair>	 backscroll to 18:42:45, host db1075 is down
[18:53:11] <andrewbogott>	 hm, looks like not
[18:53:28] <andrewbogott>	 I will see if I can contact marostegui 
[18:53:31] <andrewbogott>	 (If that didn't do it)
[18:53:36] <Reedy>	 I've just poked him on telegram
[18:53:51] <andrewbogott>	 cool
[18:54:06] <Reedy>	 Is the host actually down? Or just mysql?
[18:54:17] <wikibugs>	 (03PS1) 10Andrew Bogott: nova: add a few more Newton files [puppet] - 10https://gerrit.wikimedia.org/r/538431 (https://phabricator.wikimedia.org/T212302)
[18:54:25] <Krenair>	 you are likely capable of digging further than I, but: <icinga-wm> PROBLEM - Host db1075 is DOWN: PING CRITICAL - Packet loss = 100%
[18:54:34] <andrewbogott>	 ssh is timing out for me
[18:55:38] <andrewbogott>	 hm, and my mgmt password doesn't work — what's that about?
[18:55:44] <marostegui>	 hey
[18:55:45] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore1004 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 923.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:55:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 923.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:55:47] <Reedy>	 Are you using the new one andrewbogott?
[18:55:49] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic={udp_localhost-err,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-dataso
[18:55:49] <icinga-wm>	 heus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[18:55:50] <Krenair>	 andrewbogott, see ops list
[18:55:50] <marostegui>	 I'm connecting
[18:55:57] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2127 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 935.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:56:07] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 944.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:56:15] <andrewbogott>	 oh crap, ok
[18:56:18] * andrewbogott digs up new password
[18:56:19] <Reedy>	 heh
[18:56:23] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2109 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 960.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:56:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 #page on db1078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 963.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:56:29] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2105 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 964.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:56:37] <marostegui>	 We were going to fail over that host on tuesday
[18:56:38] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 #page on db1112 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 974.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:56:44] <marostegui>	 Let's see what's going on on the mgmt
[18:56:45] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 983.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:56:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 #page on db1123 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 994.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:57:12] <marostegui>	 Can someone silence those alerts?
[18:57:19] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1015.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:57:20] <marostegui>	 I am going to see what's the issue with the master
[18:57:33] <andrewbogott>	 marostegui: I'll work on silencing and acking
[18:57:50] <_joe_>	 uhm
[18:57:54] <marostegui>	 thanks andrewbogott 
[18:57:55] <chaomodus>	 anything up?
[18:58:00] <Reedy>	 s3 master is unhappy
[18:58:11] <_joe_>	 just got paged
[18:58:19] <marostegui>	 I am connecting to the idrac
[18:58:25] <_joe_>	 sigh
[18:58:27] <volans|off>	 marostegui: I'm here if I can help
[18:58:34] <marostegui>	 volans|off: thanks!
[18:58:53] <marostegui>	 BBU failed
[18:58:57] <marostegui>	 so likely a storage crash
[18:58:59] <volans|off>	 :(
[18:59:00] <marostegui>	 we have seen that before
[18:59:04] <marostegui>	 going to reboot the master
[18:59:13] <chaomodus>	 bleh
[18:59:31] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 #page on db1123 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:59:41] <marostegui>	 Server coming back
[18:59:44] <DannyS712>	 If it hasn't been reported yet, s3 doesn't appear to be updating
[18:59:49] <Krenair>	 Danny_B, known
[18:59:49] <marostegui>	 DannyS712: we are on it
[18:59:55] <godog>	 I'm here too if needed
[19:00:21] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 #page on db1112 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:00:32] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 #page on db1078 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1075.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1075.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:00:53] * andrewbogott not sure how to prevent pages that span so many different servers
[19:01:12] * jbond42 here  
[19:02:17] <marostegui>	 Server is back, I am doing some checks before starting mysql
[19:02:18] <godog>	 andrewbogott: what I've been doing is search for #page in icinga and then silence/acknowledge as needed
[19:02:24] <icinga-wm>	 ACKNOWLEDGEMENT - Host db1075 is DOWN: PING CRITICAL - Packet loss = 100% andrew bogott Manuel + others working on this
[19:02:33] <icinga-wm>	 RECOVERY - Host db1075 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[19:02:57] <andrewbogott>	 godog: once they show up as #page they've already paged haven't they?
[19:03:46] <marostegui>	 Starting mysql
[19:04:00] <volans|off>	 andrewbogott: yes
[19:04:13] <godog>	 andrewbogott: the search will show all alerts that are pages, some will have fired already yeah
[19:04:36] <andrewbogott>	 godog: ok, I see.  At this point I think everything that's going to fire has already fired so I'm going to let them be for now
[19:04:43] <XioNoX>	 I'm around if needed, 10min from my laptop
[19:05:00] <volans|off>	 andrewbogott: sorry to be more clear, once you see them in IRC they have paged, on Icinga UI only if they are in hard and critical state, in soft not yet
[19:05:00] <_joe_>	 XioNoX: I don't think you are
[19:05:53] <marostegui>	 Interesting, this is the same batch of hosts where we have seen BBU failures lately
[19:06:00] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 #page on db1075 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:06:09] <icinga-wm>	 PROBLEM - Check systemd state on db1075 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:06:16] <_joe_>	 I guess mysql is still starting?
[19:06:30] <marostegui>	 done now
[19:06:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova: add a few more Newton files [puppet] - 10https://gerrit.wikimedia.org/r/538431 (https://phabricator.wikimedia.org/T212302) (owner: 10Andrew Bogott)
[19:06:35] <marostegui>	 doing one last check before removing read only
[19:06:45] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 #page on db1112 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:06:56] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 #page on db1078 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:06:57] <_joe_>	 great
[19:07:02] <_joe_>	 any idea what happened?
[19:07:03] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore1004 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:07:05] <marostegui>	 read only OFF
[19:07:08] <marostegui>	 _joe_: BBU failure
[19:07:15] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on db1075 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:07:18] <marostegui>	 recoveries should start coming
[19:07:37] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 #page on db1123 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:07:41] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 #page on db1075 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:07:43] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on db1095 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:07:44] <volans|off>	 !log marostegui set s3 master RW
[19:07:44] <_joe_>	 so just a failure causing reduced iops
[19:07:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:07:49] <marostegui>	 The good thing is that this host will be failed over on tuesday
[19:07:51] <icinga-wm>	 RECOVERY - Check systemd state on db1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:08:00] <Krenair>	 assuming incident docs will get written for this: I think paging should've happened far sooner for a DB master to go offline, and we lack a good way to alert ops manually (just happened to see and.rew active, not clear who if anyone would be keeping an eye on IRC)
[19:08:08] <marostegui>	 _joe_: no, a bbu failure causing the host to fail (we have seen this before)
[19:08:13] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on db2105 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:08:38] <marostegui>	 Yes, we are up
[19:08:44] <marostegui>	 Or we should be :)
[19:08:45] <cdanis>	 Krenair: what were the early signs of this?
[19:08:53] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on db1075 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:09:21] <DannyS712>	 Are we supposed to be able to edit at this point? Because I still cannot on s3 wikis
[19:09:39] <volans|off>	 cdanis: host down IRC alert that is not paging by default
[19:09:47] <_joe_>	 DannyS712: yes you should be able to by now
[19:09:48] <Krenair>	 cdanis, first thing visible was that icinga noticed the host (which after checking dbtree is the s3 master) was offline i.e. not responding to ping, it was not for another 9-10 minutes that I happened to look at this channel and realise something was wrong
[19:10:27] <_joe_>	 something is wrong with the alerts then, but that's for later please
[19:10:42] <_joe_>	 DannyS712: what wiki, for example?
[19:10:52] <Urbanecm>	 _joe_: bnwiki says "The database is read-only until replication lag decreases" to me
[19:11:07] <Urbanecm>	 https://bn.wikipedia.org/w/index.php?title=%E0%A6%AC%E0%A7%8D%E0%A6%AF%E0%A6%AC%E0%A6%B9%E0%A6%BE%E0%A6%B0%E0%A6%95%E0%A6%BE%E0%A6%B0%E0%A7%80:Martin_Urbanec/%E0%A6%96%E0%A7%87%E0%A6%B2%E0%A6%BE%E0%A6%98%E0%A6%B0&action=edit&redlink=1 in particular
[19:11:19] <AntiComposite>	 otrs-wiki is back to normal speed but still has the message as well
[19:11:21] <_joe_>	 ok, marostegui do we still have replication lag?
[19:11:24] <DannyS712>	 For me mswiki and enwikinews still don't work
[19:11:25] <marostegui>	 I am checking
[19:11:41] <volans|off>	 tendril reports zero lag
[19:11:42] <_joe_>	 yeah quite a bit according to icinga
[19:11:50] <_joe_>	 but tendril says zero yeah
[19:11:54] <DannyS712>	 https://tools.wmflabs.org/replag/ reports 31 minute lag on s3
[19:12:14] <Urbanecm>	 DannyS712: that's cloud replica lag
[19:12:42] <DannyS712>	 Oh. How can I see the actual wiki replica lags?
[19:12:55] <Krenair>	 https://dbtree.wikimedia.org/ lists lag for each host but I'm not sure how much I trust it
[19:13:28] <volans|off>	 I've checked a couple of random hosts and they seems to be in sync
[19:13:35] <marostegui>	 There is no lag
[19:13:47] <godog>	 logstash is a little backlogged atm, likely from mw logs, should recover soon
[19:13:48] <_joe_>	 probably some caching that is persisting longer than it should?
[19:14:06] <Krenair>	 https://bar.wikipedia.org/w/api.php?format=xml&action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb=1 - lag appears to be increasing here
[19:14:18] <jynus>	 read_only     | OFF 
[19:14:41] <jynus>	 I don't see errors on mediawiki
[19:14:50] <marostegui>	 Krenair: not sure what that comes from, the lag on the hosts is 0
[19:15:04] <Krenair>	 interesting
[19:15:11] <_joe_>	 yeah this isn't great
[19:15:41] <Krenair>	 am no expert, could it be a heartbeat vs. normal mysql thing?
[19:15:45] <jynus>	 I am running puppet on master
[19:15:56] <volans|off>	 marostegui: anything else to restart on the host?
[19:16:01] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:16:03] <DannyS712>	 jynus I'm still seeing errors on mediawikiwiki - https://www.mediawiki.org/w/index.php?title=Category:MediaWiki_database_tables&action=edit
[19:16:07] <jynus>	 Notice: /Stage[main]/Mariadb::Heartbeat/Exec[pt-heartbeat]/returns: executed successfully
[19:16:13] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 #page on db1123 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:16:23] <marostegui>	 jynus: I did that already
[19:16:24] <_joe_>	 the value is back to normal on barwiki
[19:16:34] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:16:37] <Krenair>	 looks better now?
[19:16:38] <volans|off>	 but puppet runs @reboot
[19:16:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on dbstore1004 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:16:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:16:49] <jynus>	 it executed, which meanst it worked now
[19:16:49] <marostegui>	 volans|off: I ran it after mysql started
[19:16:51] <odder>	 Yeah, just got an edit through on plwikisource which also was affected
[19:16:54] <Krenair>	 I appear to be able to edit barwiki now
[19:16:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2127 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:16:57] <Nemo_bis>	 idem
[19:16:59] <Krenair>	 AntiComposite?
[19:17:00] <DannyS712>	 working for me again on mswiki
[19:17:05] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:17:09] <Trijnstel>	 I can't edit the steward wiki... I get this warning. How much more do I need to wait?
[19:17:09] <Trijnstel>	 9:16 PM 
[19:17:09] <Trijnstel>	 9:16 PM Warning: The database has been locked for maintenance, so you will not be able to save your edits right now. You may wish to copy and paste your text into a text file and save it for later.
[19:17:09] <Trijnstel>	 9:16 PM 
[19:17:09] <Trijnstel>	 9:16 PM The system administrator who locked it offered this explanation: The database is read-only until replication lag decreases.
[19:17:16] <AntiComposite>	 OTRS-wiki is OK
[19:17:17] <marostegui>	 Trijnstel: we are on it
[19:17:18] <_joe_>	 so the  point is, the cache is probably lasting more than it should
[19:17:19] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2109 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:17:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 #page on db1078 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:17:23] <Krenair>	 Trijnstel, can you try again now?
[19:17:25] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2105 is OK: OK slave_sql_lag Replication lag: 0.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:17:25] <_joe_>	 the *mediawiki* cache
[19:17:33] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 #page on db1112 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:17:35] <Trijnstel>	 yeah!
[19:17:40] <Trijnstel>	 I waited for more than 15 minutes...
[19:17:44] <Trijnstel>	 and suddenly it's solved
[19:17:48] <volans|off>	 marostegui: your run failed
[19:17:59] <marostegui>	 I guess
[19:18:03] <odder>	 Trijnstel: A few seconds before you joined the channel in fact ;)
[19:18:03] <cdanis>	 race condition in the puppet run at reboot?
[19:18:04] <volans|off>	 https://puppetboard.wikimedia.org/report/db1075.eqiad.wmnet/d897cb07f699f0e514b3ed4e1b2822317878554c
[19:18:09] <volans|off>	 marostegui: ^^^
[19:18:13] <cdanis>	 maybe because MySQL needs to be started manually??
[19:18:19] <marostegui>	 volans|off: I guess I missed it
[19:18:28] <jynus>	 cdanis: yes, that is on purpose
[19:18:31] <volans|off>	 I'm trying to see why
[19:18:37] <marostegui>	 Too many things at the same time
[19:18:49] <cdanis>	 jynus: i know, but perhaps there’s an unintended consequence here re: heartbeat 
[19:18:52] <_joe_>	 ok things are in place, I am off again.
[19:19:02] <volans|off>	 marostegui: it said the socket was not there
[19:19:04] <marostegui>	 I will create a ticket and I am out
[19:19:21] <_joe_>	 yeah the right way to do this is to manage things via systemd probably
[19:19:26] <_joe_>	 we can talk tomorrow marostegui
[19:19:28] <volans|off>	 so maybe mysql was still starting
[19:19:30] <jynus>	 it probably was run while mysql was recoverin g
[19:19:33] <cdanis>	 did anyone start an incident report ?
[19:20:13] <Krenair>	 I imagine someone will during EU working hours tomorrow
[19:20:19] <marostegui>	 cdanis: I will tomorrow
[19:20:39] <cdanis>	 thanks marostegui I’ll work on it as well once I’m awake tomorrow 
[19:20:54] <cdanis>	 (Which might be early, I was up and awake 5:30am local today)
[19:21:19] <wikibugs>	 10Operations, 10DBA: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui)
[19:21:30] <marostegui>	 I am going offline
[19:21:31] <marostegui>	 Bye
[19:22:30] <godog>	 same here
[19:24:07] <wikibugs>	 10Operations, 10DBA: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui)
[19:26:29] <icinga-wm>	 PROBLEM - HP RAID on db1075 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:26:31] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db1075 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T233535 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:26:34] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1075 - https://phabricator.wikimedia.org/T233535 (10ops-monitoring-bot)
[19:27:09] <wikibugs>	 10Operations, 10DBA: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui)
[19:27:38] <wikibugs>	 10Operations, 10DBA: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui)
[19:27:40] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1075 - https://phabricator.wikimedia.org/T233535 (10Marostegui)
[19:32:14] <wikibugs>	 10Operations, 10DBA: db1075 (s3 master) crashed - https://phabricator.wikimedia.org/T233534 (10Marostegui)
[19:32:49] <icinga-wm>	 RECOVERY - MediaWiki eqiad exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:00:47] <icinga-wm>	 PROBLEM - MediaWiki codfw exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=codfw https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[20:02:23] <icinga-wm>	 RECOVERY - MediaWiki codfw exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops
[20:04:27] <Zoranzoki21>	 Hi, Hashar merged patch https://gerrit.wikimedia.org/r/#/c/integration/config/+/538419/ and it made problem with my repository. Can anyone merge fix https://gerrit.wikimedia.org/r/#/c/integration/config/+/538442/
[20:07:29] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[20:11:36] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Platonides) Yesterday, we had the same issue on biblio-es-l with all subscribers using a yahoo email address being automatically disabled delivery, as the max retry timeout for emails from...
[20:11:48] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Platonides)
[20:12:44] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Platonides) See T22507 for the same issue in 2009
[20:19:42] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Platonides) Similarly, on Wednesday ...@yahoo.com subscriptions were disabled for wikitech-l due to this same issue, bouncing https://lists.wikimedia.org/pipermail/wikitech-l/2019-...
[21:27:45] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone: update wmtotop.py for Newton [puppet] - 10https://gerrit.wikimedia.org/r/538443
[21:27:47] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone/Newton: create /var/lib/keystone [puppet] - 10https://gerrit.wikimedia.org/r/538444
[21:27:49] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445
[21:28:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone/Newton: create /var/lib/keystone [puppet] - 10https://gerrit.wikimedia.org/r/538444 (owner: 10Andrew Bogott)
[21:28:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] keystone: update wmtotop.py for Newton [puppet] - 10https://gerrit.wikimedia.org/r/538443 (owner: 10Andrew Bogott)
[21:31:15] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystone/Newton: create /var/lib/keystone [puppet] - 10https://gerrit.wikimedia.org/r/538444
[21:31:17] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445
[21:45:25] <wikibugs>	 (03PS3) 10Andrew Bogott: Keystone/Newton: create /var/lib/keystone [puppet] - 10https://gerrit.wikimedia.org/r/538444
[21:45:27] <wikibugs>	 (03PS3) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445
[21:45:29] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone/newton: explicitly exclude the python-ldap package [puppet] - 10https://gerrit.wikimedia.org/r/538449
[21:46:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Keystone/newton: explicitly exclude the python-ldap package [puppet] - 10https://gerrit.wikimedia.org/r/538449 (owner: 10Andrew Bogott)
[21:54:59] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:16:07] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:22:35] <wikibugs>	 (03PS4) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445
[22:22:37] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone/Newton: update the password_whitelist plugin [puppet] - 10https://gerrit.wikimedia.org/r/538450
[22:22:39] <wikibugs>	 (03PS1) 10Andrew Bogott: wmfkeystonehooks: update to work with keystone Newton [puppet] - 10https://gerrit.wikimedia.org/r/538451
[22:51:20] <wikibugs>	 (03PS2) 10Andrew Bogott: wmfkeystonehooks: update to work with keystone Newton [puppet] - 10https://gerrit.wikimedia.org/r/538451
[22:51:22] <wikibugs>	 (03PS5) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445
[22:52:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: update to work with keystone Newton [puppet] - 10https://gerrit.wikimedia.org/r/538451 (owner: 10Andrew Bogott)
[22:52:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Keystone/Newton: update the password_whitelist plugin [puppet] - 10https://gerrit.wikimedia.org/r/538450 (owner: 10Andrew Bogott)
[22:56:57] <wikibugs>	 (03PS6) 10Andrew Bogott: Keystone/newton: install python-keystone [puppet] - 10https://gerrit.wikimedia.org/r/538445
[22:56:59] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone.conf: remove 'verbose' config option [puppet] - 10https://gerrit.wikimedia.org/r/538453
[22:58:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] keystone.conf: remove 'verbose' config option [puppet] - 10https://gerrit.wikimedia.org/r/538453 (owner: 10Andrew Bogott)