[00:18:06] <icinga-wm>	 PROBLEM - puppet last run on db1103 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:30:04] <wikibugs>	 06Operations: puppet mechanism updating motd is broken - https://phabricator.wikimedia.org/T80998#882211 (10faidon) More information please? In any case, if it is, please file a new task, don't revive all these old ones.
[00:31:36] <icinga-wm>	 PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:46:06] <icinga-wm>	 RECOVERY - puppet last run on db1103 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[01:00:36] <icinga-wm>	 RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[01:11:56] <icinga-wm>	 PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:40:56] <icinga-wm>	 RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[02:11:56] <icinga-wm>	 PROBLEM - nova-compute process on labvirt1013 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute
[02:13:56] <icinga-wm>	 RECOVERY - nova-compute process on labvirt1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute
[02:20:13] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 14s)
[02:20:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:26:11] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Thu May 18 02:26:11 UTC 2017 (duration 5m 59s)
[02:26:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:20:56] <wikibugs>	 (03PS3) 10Aaron Schulz: Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210)
[05:01:01] <Jamesofur>	 !log insert decryption key for WMF Board Election 
[05:01:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:54:14] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: db2058: Predictive RAID failure - https://phabricator.wikimedia.org/T165498#3272571 (10Marostegui) And the disk finally failed: T165629  ```      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 600 GB, Failed) ```  Let's follow up on the failed task.
[05:54:56] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: db2058: Predictive RAID failure - https://phabricator.wikimedia.org/T165498#3272573 (10Marostegui) 05Open>03Invalid
[05:55:57] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T165629#3272144 (10Marostegui) p:05Triage>03Normal a:03Papaul Please @Papaul proceed and change this disk when you have time. Thanks!
[05:58:46] <marostegui>	 !log Deploy alter table s2.revision table - dbstore1001
[05:58:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:56] <wikibugs>	 (03CR) 10Marostegui: [C: 032] MariaDB: Repool db2062 after maintenanace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo)
[06:02:58] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354179 (https://phabricator.wikimedia.org/T162611)
[06:04:31] <wikibugs>	 (03Merged) 10jenkins-bot: MariaDB: Repool db2062 after maintenanace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo)
[06:05:14] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354179 (https://phabricator.wikimedia.org/T162611)
[06:05:43] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2062 - T116557 (duration: 00m 39s)
[06:05:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:51] <stashbot>	 T116557: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557
[06:05:52] <wikibugs>	 (03CR) 10jenkins-bot: MariaDB: Repool db2062 after maintenanace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354092 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo)
[06:07:04] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354179 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[06:08:24] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354179 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[06:08:33] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354179 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[06:09:33] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1090 - T162611 (duration: 00m 38s)
[06:09:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:42] <stashbot>	 T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611
[06:10:14] <marostegui>	 !log Deploy alter table s2.revision table - db1090 - T162611
[06:10:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:23] <marostegui>	 !log Deploy alter table s2.revision table - dbstore1001 - T162611
[06:10:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:55] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: clean up duplicate GRANTs for phstats user [puppet] - 10https://gerrit.wikimedia.org/r/348779 (owner: 10Dzahn)
[06:21:26] <marostegui>	 !log Deploy alter table s2.revision table - labsdb1001 - T162611
[06:21:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:21:34] <stashbot>	 T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611
[06:24:27] <marostegui>	 !log Deploy alter table on s2.ptwiki directly on codfw master (db2017) after running the clean up duplicates script - https://phabricator.wikimedia.org/T164530
[06:24:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:43:44] <moritzm>	 !log installing tiff security updates
[06:43:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:54] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "OK, but not sure if it will work (e.g. if some text has been already sent)." [software/dbtree] - 10https://gerrit.wikimedia.org/r/353388 (https://phabricator.wikimedia.org/T163143) (owner: 10Dzahn)
[07:00:34] <wikibugs>	 (03PS1) 10Ema: Bump version number in setup.py [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/354180
[07:01:41] <marostegui>	 !log Deploy alter table on s2.plwiki directly on codfw master (db2017) after running the clean up duplicates script - https://phabricator.wikimedia.org/T164530
[07:01:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:39] <wikibugs>	 (03CR) 10Marostegui: "I merged this. Thanks for picking this up Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/348779 (owner: 10Dzahn)
[07:14:32] <wikibugs>	 06Operations, 10Traffic: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#3272649 (10ema)
[07:14:39] <wikibugs>	 06Operations, 10Traffic: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#3272664 (10ema) p:05Triage>03Normal
[07:23:26] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:31] <icinga-wm>	 PROBLEM - mysqld processes on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:36] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s6 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:36] <icinga-wm>	 PROBLEM - Disk space on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:46] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:46] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s4 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:46] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s1 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:46] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s7 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:46] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s3 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:46] <icinga-wm>	 PROBLEM - configured eth on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:46] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m2 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:47] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s2 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:49] <icinga-wm>	 PROBLEM - MariaDB Slave IO: m3 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:52] <icinga-wm>	 PROBLEM - MariaDB disk space on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:52] <icinga-wm>	 PROBLEM - dhclient process on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:52] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:52] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s6 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:52] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: m3 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:52] <icinga-wm>	 PROBLEM - DPKG on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:52] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:56] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: x1 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:56] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s5 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:56] <icinga-wm>	 PROBLEM - MariaDB Slave IO: s7 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:23:56] <icinga-wm>	 PROBLEM - Check size of conntrack table on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:24:01] <marostegui>	 :|
[07:24:08] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:24:08] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:24:16] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:24:16] <icinga-wm>	 PROBLEM - Check systemd state on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:24:16] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: s5 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:24:16] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:24:21] <jynus>	 why is it complaining=?
[07:24:22] <akosiaris>	 I guess dbstore2001 died ?
[07:24:27] <marostegui>	 it didn't
[07:24:27] <jynus>	 nope
[07:24:28] <marostegui>	 it is up
[07:25:14] <akosiaris>	 ok then, it's icinga .. I am looking
[07:25:22] <ema>	 icinga checks' duration is 6d 23h, maybe some expired downtime?
[07:26:55] <marostegui>	 but it is complaining about evertything
[07:26:56] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:26:57] <akosiaris>	 connect to address 10.192.0.32 port 5666: Connection refused
[07:27:00] <marostegui>	 disk space, puppet, etc
[07:27:06] <icinga-wm>	 PROBLEM - HP RAID on dbstore2001 is CRITICAL: Return code of 255 is out of bounds
[07:27:13] <jynus>	 npr issue?
[07:27:16] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on dbstore2001 is OK: OK slave_sql_lag not a slave
[07:27:22] <jynus>	 npre?
[07:27:25] <akosiaris>	 !log restart nagios-nrpe-server on dbstore2001
[07:27:26] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86638.68 seconds
[07:27:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:32] <icinga-wm>	 RECOVERY - mysqld processes on dbstore2001 is OK: PROCS OK: 1 process with command name mysqld
[07:27:34] <akosiaris>	 now.. why did that happen ?
[07:27:36] <icinga-wm>	 RECOVERY - Disk space on dbstore2001 is OK: DISK OK
[07:27:36] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s6 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[07:27:46] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86684.35 seconds
[07:27:46] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s4 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:27:46] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s1 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:27:46] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s7 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[07:27:46] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s3 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:27:46] <icinga-wm>	 RECOVERY - configured eth on dbstore2001 is OK: OK - interfaces up
[07:27:46] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m2 on dbstore2001 is OK: OK slave_io_state not a slave
[07:27:47] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s2 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:27:52] <icinga-wm>	 RECOVERY - MariaDB disk space on dbstore2001 is OK: DISK OK
[07:27:52] <icinga-wm>	 RECOVERY - MariaDB Slave IO: m3 on dbstore2001 is OK: OK slave_io_state not a slave
[07:27:52] <icinga-wm>	 RECOVERY - dhclient process on dbstore2001 is OK: PROCS OK: 0 processes with command name dhclient
[07:27:56] <icinga-wm>	 RECOVERY - DPKG on dbstore2001 is OK: All packages OK
[07:27:56] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: x1 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional)
[07:27:56] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m3 on dbstore2001 is OK: OK slave_sql_state not a slave
[07:27:56] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s1 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[07:27:56] <icinga-wm>	 RECOVERY - Check size of conntrack table on dbstore2001 is OK: OK: nf_conntrack is 0 % full
[07:27:56] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s7 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:27:56] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s5 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:27:57] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m2 on dbstore2001 is OK: OK slave_sql_lag not a slave
[07:28:06] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 77036.97 seconds
[07:28:06] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86521.97 seconds
[07:28:06] <icinga-wm>	 RECOVERY - MariaDB Slave IO: x1 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:28:06] <icinga-wm>	 RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 15 minutes ago with 0 failures
[07:28:07] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: m2 on dbstore2001 is OK: OK slave_sql_state not a slave
[07:28:07] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86512.68 seconds
[07:28:07] <icinga-wm>	 RECOVERY - MariaDB Slave IO: s6 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes
[07:28:07] <icinga-wm>	 RECOVERY - salt-minion processes on dbstore2001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[07:28:07] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[07:28:08] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on dbstore2001 is OK: OK ferm input default policy is set
[07:28:08] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s2 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[07:28:09] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: s4 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes
[07:28:34] <jynus>	 marostegui, 6 days ago was when that server had problems, could that had caused a problem on monitoring or something?
[07:28:48] <marostegui>	 jynus: Don't think so, beause it was only MySQL
[07:28:50] <jynus>	 it is a question, I do not know when that had problems
[07:28:56] <marostegui>	 The rest was perfectly fine
[07:29:09] <akosiaris>	 May 18 07:27:12 dbstore2001 systemd[1]: Stopping LSB: Start/Stop the Nagios remote plugin execution daemon...
[07:29:09] <akosiaris>	 May 11 07:47:32 dbstore2001 sudo[42906]: pam_unix(sudo:session): session closed for user root
[07:29:09] <akosiaris>	 May 11 07:47:32 dbstore2001 sudo[42906]: pam_unix(sudo:session): session opened for user root by (uid=0)
[07:29:09] <akosiaris>	 May 11 07:47:32 dbstore2001 sudo[42906]: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib/nagios/plugins/check_ferm
[07:29:16] <akosiaris>	 that's in reverse chronological order
[07:29:17] <jynus>	 or maybe the script got confused?
[07:29:24] <akosiaris>	 why are 7 days missing from nagios-nrpe-server logs
[07:29:25] <akosiaris>	 ?
[07:29:43] <jynus>	 or overloaded and made monitoring fail?
[07:30:01] <jynus>	 I would restart that server
[07:30:11] <jynus>	 maybe there are host problems
[07:30:29] <ema>	 akosiaris: that's also the icinga checks duration (7d)
[07:30:44] <akosiaris>	 ema:  ?
[07:30:50] <marostegui>	 During the problems last week, the server had the disk utilization 100% for a few hours, that was all (during mysql start)
[07:30:55] <akosiaris>	 I don't follow
[07:31:03] <marostegui>	 https://grafana-admin.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=dbstore2001&var-network=eth0&from=now-7d&to=now
[07:31:24] <akosiaris>	 maybe too early for me, but what is the icinga checks duration ?
[07:31:40] <ema>	 akosiaris: in icinga the checks were marked as failing since 7d
[07:31:44] <ema>	 eg: dbstore2001 Check the NTP synchronisation status of timesyncd CRITICAL    2017-05-18 07:26:47 6d 23h 34m 50s
[07:31:45] <akosiaris>	 ah
[07:32:14] <marostegui>	 That does match more or less the server issues
[07:32:17] <marostegui>	 server = mysql issues
[07:32:32] <jynus>	 so after mysql got overloaded, maybe monitoring got in a bad state
[07:32:32] <akosiaris>	 ok so nagios-nrpe-server was most probably not running for the last 7 days
[07:32:37] <ema>	 right
[07:32:41] <jynus>	 something like that
[07:32:49] <akosiaris>	 was the server in scheduled downtime or something ?
[07:32:54] <jynus>	 and maybe manuel had acked the problems due to the mysql problems?
[07:33:00] <marostegui>	 akosiaris: I probably downtimed it while debugging the issues
[07:33:03] <jynus>	 acked/downtime/etc
[07:33:06] <jynus>	 which is ok
[07:33:17] <jynus>	 if it happens on friday and it is not a critical host like this
[07:33:29] <marostegui>	 But how can MySQL overload break monitoring? Becausei f you guys check the graph I posted earlier, the server wasn't really overloaded
[07:33:30] <jynus>	 "just don't page me for a week"
[07:33:37] <jynus>	 in theory no
[07:33:41] <jynus>	 but who knows
[07:33:44] <jynus>	 it has so many checks
[07:33:49] <jynus>	 about replication
[07:33:55] <jynus>	 that maybe it got overloaded too
[07:34:25] <marostegui>	 Yes, could be
[07:34:27] <jynus>	 it will probably have >21 replication checks
[07:34:36] <marostegui>	 Some sort of cascade effect or something
[07:34:59] <jynus>	 which all break becaues the timeout on the check is not handled appropiately (that is what I am fixing in the new check script)
[07:35:17] <jynus>	 or
[07:35:22] <jynus>	 alternatively
[07:35:35] <jynus>	 npre could have cause the mysql issues in the first place
[07:36:08] <marostegui>	 that is an interesting thought indeed
[07:36:11] <jynus>	 (indirectly, because of the check does not respond well to the tiemeout)
[07:36:36] <jynus>	 so there are 2 actionables here- fix the check script, which I was going to do anyway
[07:36:43] <moritzm>	 !log installing freetype security updates on trusty (jessie already fixed)
[07:36:48] <jynus>	 and investigate why npre can die
[07:36:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:16] <icinga-wm>	 RECOVERY - HP RAID on dbstore2001 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor
[07:37:31] <jynus>	 it is not only the mysql check
[07:37:55] <jynus>	 the HP RAID is also prone to timeout^
[07:38:04] <marostegui>	 Normally I have seen graphs not graphing when the server is super overloaded, and in this case, there are no graph gaps
[07:38:08] <jynus>	 so we have a perfect storm there
[07:38:26] <jynus>	 no, mysql was working well, at least in the last 7 days
[07:38:42] <marostegui>	 No, I mean all graphs stopping to draw, disk space, load, everything
[07:39:01] <jynus>	 yes, I am agreein with you :-)
[07:39:19] <marostegui>	 But if we check the last 7 day load graph, we can see that when the mysql issues were happening, the load was nothing too outstanding
[07:39:23] <marostegui>	 very strange
[07:39:54] <jynus>	 the checks get stuck, they usually won't create more processes
[07:40:40] <jynus>	 we need to get rid of the blocking checks both on nagios and prometheus
[07:40:43] <akosiaris>	 there's nothing in the logs about how nagios-nrpe-server died
[07:40:51] <akosiaris>	 I can guess why it was not restarted by puppet
[07:41:01] <jynus>	 why?
[07:41:07] <akosiaris>	 the init script relies on the pid file
[07:41:08] <jynus>	 not fully killed?
[07:41:13] <akosiaris>	 and the pid file was still around
[07:41:18] <akosiaris>	 it's a crappy initscript put simply
[07:41:36] <jynus>	 but the pid is there precisely to read it! :-)
[07:41:56] <akosiaris>	 status_of_proc -p $PIDDIR/nrpe.pid "$DAEMON" "$NAME" && exit 0 || exit $?
[07:42:00] <akosiaris>	 that's what it does
[07:42:10] <akosiaris>	 I would expect that to work fine
[07:43:00] <jynus>	 yes, it should
[07:43:12] <jynus>	 so maybe the process was somehow alive and dead?
[07:43:21] <akosiaris>	 zombie ?
[07:43:29] <akosiaris>	 no it would have shown in my ps 
[07:43:35] <akosiaris>	 there was nothing running 
[07:43:40] <jynus>	 not unix-zombie
[07:44:23] <jynus>	 maybe the replacement process was overwritten but could not lock its state or something
[07:44:31] <jynus>	 I do not know
[07:44:39] <jynus>	 this is one of a kind
[07:44:40] <akosiaris>	 I just killed it btw just to test
[07:45:01] <marostegui>	 I am trying to check puppet logs to see if it was trying to bring it up or complaning our something, and no, nothing
[07:45:03] <jynus>	 why are we still using init for nagios, is it what jessie has?
[07:45:12] <akosiaris>	 yes
[07:45:15] <jynus>	 ok
[07:45:25] <akosiaris>	 ah there we go
[07:45:30] <jynus>	 to be fair, the checks were failing
[07:45:35] <akosiaris>	 so I 've just did a kill <pid>
[07:45:37] <jynus>	 so it is not like we hadn't notice
[07:45:49] <jynus>	 I do not think there is much actuable there
[07:45:55] <akosiaris>	 and service nagios-nrpe-server status reports everything ok
[07:46:01] <marostegui>	 nice
[07:46:10] <jynus>	 so it is only that check
[07:46:11] <akosiaris>	 * nagios-nrpe-server.service - LSB: Start/Stop the Nagios remote plugin execution daemon
[07:46:11] <akosiaris>	    Loaded: loaded (/etc/init.d/nagios-nrpe-server)
[07:46:11] <akosiaris>	    Active: active (exited) since Thu 2017-05-18 07:43:06 UTC; 2min 13s ago
[07:46:11] <akosiaris>	   Process: 31235 ExecStop=/etc/init.d/nagios-nrpe-server stop (code=exited, status=0/SUCCESS)
[07:46:12] <akosiaris>	   Process: 31310 ExecStart=/etc/init.d/nagios-nrpe-server start (code=exited, status=0/SUCCESS)
[07:46:16] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3739.30 Read Requests/Sec=1828.80 Write Requests/Sec=1.60 KBytes Read/Sec=36236.00 KBytes_Written/Sec=445.20
[07:46:19] <akosiaris>	 so... lemme check what puppet now does
[07:46:32] <akosiaris>	 I am really really suprised this has not bitten us worse
[07:46:53] <akosiaris>	 yup puppet DID not start nagios-nrpe-server on a test run
[07:47:08] <akosiaris>	 so my guess is that somehow nagios-nrpe-server stopped running
[07:47:26] <jynus>	 sorry, I do not fully understand, but maybe you want to write a ticket about that, and of course I can help?
[07:47:50] <akosiaris>	 I 'd prefer to spend the time to ship a systemd unit and fix the problem for good ;-)
[07:48:02] <jynus>	 I agree :-)
[07:48:35] <jynus>	 and as I said, checks were failing, so it was not going as badly
[07:48:49] <moritzm>	 the stretch package already has a systemd unit
[07:49:31] <ema>	 moritzm: with a sane Restart= value? :)
[07:49:57] <akosiaris>	 moritzm: nice.. I guess I 'll just steal it and ship that :-)
[07:50:51] <moritzm>	 ema: Restart=on-abort, yep
[07:51:47] <jynus>	 oh, it paged, sorry
[07:52:16] <jynus>	 icinga normally doesn't fail
[07:53:28] <akosiaris>	 I like how we have systemd unit checks being done by nrpe
[07:53:34] <akosiaris>	 but nrpe itself is not run via systemd
[07:53:48] <akosiaris>	 can't help but laugh
[07:54:24] <jynus>	 :-)
[07:54:34] <jynus>	 it cannot check itself
[07:54:46] <akosiaris>	 ah that would be a nice check
[07:55:00] <akosiaris>	 :D
[07:56:16] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.30 Read Requests/Sec=3.40 Write Requests/Sec=0.40 KBytes Read/Sec=15.58 KBytes_Written/Sec=4.00
[07:56:56] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on dbstore2001 is OK: OK: synced at Thu 2017-05-18 07:56:47 UTC.
[08:01:58] <moritzm>	 !log reboot rhenium for update to Linux 4.9
[08:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:20] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: restbase: convert deployment-prep to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/354182
[08:09:26] <marostegui>	 !log Deploy alter table on s1.enwiki directly on codfw master (db2016) after running the clean up duplicates script - https://phabricator.wikimedia.org/T164530
[08:09:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] restbase: convert deployment-prep to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/354182 (owner: 10Giuseppe Lavagetto)
[08:09:53] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: nrpe: Ship a systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/354183
[08:10:09] <_joe_>	 meh
[08:12:24] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: restbase: convert deployment-prep to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/354182
[08:15:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] "noop in production" [puppet] - 10https://gerrit.wikimedia.org/r/354182 (owner: 10Giuseppe Lavagetto)
[08:16:51] <apergos>	 !log reboot dataset1001 for kernel update
[08:16:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:01] <wikibugs>	 (03PS1) 10Elukey: Fix MediaWiki centralauth errors graphite alarm [puppet] - 10https://gerrit.wikimedia.org/r/354184
[08:18:41] <wikibugs>	 (03CR) 10Elukey: [C: 032] Fix MediaWiki centralauth errors graphite alarm [puppet] - 10https://gerrit.wikimedia.org/r/354184 (owner: 10Elukey)
[08:20:10] <elukey>	 _joe_ you can merge --^ whenever you are ready
[08:20:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "One nit, but LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354183 (owner: 10Alexandros Kosiaris)
[08:21:36] <icinga-wm>	 PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/mnt/data]
[08:22:05] <_joe_>	 elukey: oh yeah gimme a min sorry
[08:22:20] <_joe_>	 also thanks, I planned on fixing that
[08:22:48] <elukey>	 I left the other two because the syntax seems fine but no datapoints :(
[08:24:08] <_joe_>	 it wasn't "no datapoints"
[08:24:16] <_joe_>	 it's "not enough datapoints"
[08:24:18] <_joe_>	 and I had a fix
[08:24:30] <_joe_>	 laters :P
[08:24:41] <_joe_>	 deployment-prep is sucking my soul out of me
[08:25:20] <elukey>	 ok nice (for the metrics, not the soul :)
[08:25:26] <icinga-wm>	 RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[08:25:32] <elukey>	 fixed --^
[08:31:20] <elukey>	 _joe_ the aqs refactoring seems really good, thanks a lot for doing it
[08:32:09] <moritzm>	 !log upgrading mw1180-mw1188, mw1200-mw1208 to new hhvm-luasandbox/hhvm-luasandbox-dbg packages
[08:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:45] <wikibugs>	 (03PS1) 10Thcipriani: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186
[08:35:47] <greg-g>	 _joe_: <3
[08:36:12] <wikibugs>	 (03PS2) 10Thcipriani: Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148)
[08:37:16] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::cassandra: remove useless pick() [puppet] - 10https://gerrit.wikimedia.org/r/354187
[08:38:11] <_joe_>	 greg-g: it's totally not your team's fault, I think we do have a path to a solution I outlined in T161675
[08:38:11] <stashbot>	 T161675: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675
[08:38:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Hm so it looks like out of 215 system::role stanzas, 154 use the role:: prefix and 61 don't. It's clearly not consistent and a sign we wil" [puppet] - 10https://gerrit.wikimedia.org/r/354172 (owner: 10Dzahn)
[08:38:55] <wikibugs>	 (03CR) 10Greg Grossmeier: [C: 031] Scap3: deploy jobrunner with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani)
[08:39:17] <wikibugs>	 (03CR) 10Greg Grossmeier: [C: 031] "That was Chad btw ^" [puppet] - 10https://gerrit.wikimedia.org/r/354186 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani)
[08:39:58] <greg-g>	 _joe_: yeah, we were just talking about that. And the <3 is true love, we feel your pain :)
[08:40:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] profile::cassandra: remove useless pick() [puppet] - 10https://gerrit.wikimedia.org/r/354187 (owner: 10Giuseppe Lavagetto)
[08:42:04] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1090 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354188 (https://phabricator.wikimedia.org/T159753)
[08:43:54] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354188 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[08:45:11] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354188 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[08:45:38] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354188 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[08:46:34] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 - T159753 T164530 (duration: 00m 39s)
[08:46:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:43] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[08:46:44] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[08:48:53] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "Thanks a lot, looks great! Just a couple of questions/notes for Eric before proceeding:" [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto)
[08:49:12] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354189
[08:50:25] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354189 (owner: 10Marostegui)
[08:51:26] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354189 (owner: 10Marostegui)
[08:51:35] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354189 (owner: 10Marostegui)
[08:52:17] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 - T159753 T164530 (duration: 00m 39s)
[08:52:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:26] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[08:52:26] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[08:57:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Also strip rpcbind/nfs-common deps on jessie installs [puppet] - 10https://gerrit.wikimedia.org/r/354190 (https://phabricator.wikimedia.org/T106477)
[09:01:08] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1090, depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354191 (https://phabricator.wikimedia.org/T162611)
[09:02:58] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1090, depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354191 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[09:03:49] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: set db2051 as enabled for full reimage [puppet] - 10https://gerrit.wikimedia.org/r/354192
[09:03:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: nrpe: Ship a systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354183 (owner: 10Alexandros Kosiaris)
[09:03:58] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1090, depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354191 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[09:05:43] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1090, depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354191 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[09:06:10] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1090, depool db1076 - T162611 (duration: 00m 39s)
[09:06:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:18] <stashbot>	 T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611
[09:06:52] <marostegui>	 !log Deploy alter table s2.revision table - db1076 - https://phabricator.wikimedia.org/T162611
[09:06:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:11] <moritzm>	 !log upgrading image scalers mw1294/mw1295 to Linux 4.9 and HHVM 3.18
[09:07:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:58] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: nrpe: Ship a systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/354183
[09:08:49] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354195 (https://phabricator.wikimedia.org/T159753)
[09:10:18] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354195 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[09:11:17] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354195 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[09:11:28] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354195 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[09:14:36] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1083 - T159753 T164530 (duration: 00m 39s)
[09:14:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:45] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[09:14:45] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[09:15:32] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: deployment-prep: additional fixes to restbase hiera [puppet] - 10https://gerrit.wikimedia.org/r/354196
[09:16:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] deployment-prep: additional fixes to restbase hiera [puppet] - 10https://gerrit.wikimedia.org/r/354196 (owner: 10Giuseppe Lavagetto)
[09:16:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] deployment-prep: additional fixes to restbase hiera [puppet] - 10https://gerrit.wikimedia.org/r/354196 (owner: 10Giuseppe Lavagetto)
[09:16:43] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: set db2051 as enabled for full reimage [puppet] - 10https://gerrit.wikimedia.org/r/354192
[09:17:58] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: set db2051 as enabled for full reimage [puppet] - 10https://gerrit.wikimedia.org/r/354192 (owner: 10Jcrespo)
[09:21:04] <wikibugs>	 06Operations, 07Zuul: Add a stretch debian package for zuul - https://phabricator.wikimedia.org/T165621#3272870 (10hashar) 05Open>03declined This is premature. Will come to it when it is time :-}
[09:23:52] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1083, depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354200 (https://phabricator.wikimedia.org/T159753)
[09:25:20] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: restbase: remove legacy classes, roles [puppet] - 10https://gerrit.wikimedia.org/r/354201
[09:25:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cassandra: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/354202
[09:28:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] restbase: remove legacy classes, roles [puppet] - 10https://gerrit.wikimedia.org/r/354201 (owner: 10Giuseppe Lavagetto)
[09:28:30] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1083, depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354200 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[09:28:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] cassandra: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/354202 (owner: 10Giuseppe Lavagetto)
[09:31:08] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1083, depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354200 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[09:31:20] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1083, depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354200 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[09:33:28] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1083, depool db1080 - T159753 T164530 (duration: 00m 38s)
[09:33:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:36] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[09:33:37] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[09:37:32] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: set db2051 as enabled for full reimage [puppet] - 10https://gerrit.wikimedia.org/r/354192
[09:43:33] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1080, depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354203 (https://phabricator.wikimedia.org/T159753)
[09:45:17] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1080, depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354203 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[09:45:57] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: set db2051 as enabled for full reimage [puppet] - 10https://gerrit.wikimedia.org/r/354192 (owner: 10Jcrespo)
[09:46:14] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1080, depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354203 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[09:46:23] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1080, depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354203 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[09:47:07] <moritzm>	 !log upgrading image scalers in codfw to Linux 4.9 and HHVM 3.18
[09:47:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:00] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1080, depool db1073 - T159753 T164530 (duration: 00m 39s)
[09:49:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:09] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[09:49:09] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[09:59:08] <icinga-wm>	 PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time
[10:00:08] <icinga-wm>	 RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 74668 bytes in 0.236 second response time
[10:07:51] <wikibugs>	 (03PS7) 10Mforns: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey)
[10:08:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey)
[10:08:44] <wikibugs>	 06Operations, 10Gerrit: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631#3272935 (10demon) We can't move them behind LVS. Unlike Phabricator, which uses a separate hostname for the SSH service, Gerrit exposes them over the same domain. Last time we...
[10:10:49] <wikibugs>	 (03PS1) 10Jcrespo: [WIP]Initial commit of existent python scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/354206
[10:21:33] <wikibugs>	 (03PS8) 10Elukey: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933)
[10:22:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey)
[10:26:36] <wikibugs>	 (03PS9) 10Elukey: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933)
[10:37:28] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Document servermon optimization [puppet] - 10https://gerrit.wikimedia.org/r/354207 (https://phabricator.wikimedia.org/T164604)
[10:54:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] New upstream version 1.8.3 [calico-cni] - 10https://gerrit.wikimedia.org/r/353867 (owner: 10Giuseppe Lavagetto)
[10:55:55] <wikibugs>	 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Enable Kafka TLS and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#3273006 (10Ottomata)
[10:56:18] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:56:19] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:56:19] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:56:19] <icinga-wm>	 PROBLEM - swift-account-server on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:56:19] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:56:19] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:56:28] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:56:28] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:56:28] <icinga-wm>	 PROBLEM - salt-minion processes on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:56:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Updating debian version [calico-cni] - 10https://gerrit.wikimedia.org/r/353868 (owner: 10Giuseppe Lavagetto)
[10:57:08] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[10:57:08] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[10:57:09] <icinga-wm>	 RECOVERY - swift-account-server on ms-be1019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[10:57:09] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[10:57:09] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[10:57:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] package name change [calico-cni] - 10https://gerrit.wikimedia.org/r/353869 (owner: 10Giuseppe Lavagetto)
[10:57:18] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[10:57:18] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[10:57:18] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[10:57:18] <icinga-wm>	 RECOVERY - salt-minion processes on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[11:01:42] <wikibugs>	 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Encrypt Kafka traffic, and restrict access via ACLs - https://phabricator.wikimedia.org/T121561#3273015 (10Ottomata)
[11:10:26] <marostegui>	 !log Run pt-table-checksum on s7.metawiki - T163190
[11:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:35] <stashbot>	 T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190
[11:11:59] <wikibugs>	 (03PS3) 10Hashar: interface: add rspec boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/340420
[11:13:12] <wikibugs>	 (03PS10) 10Mforns: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey)
[11:19:27] <wikibugs>	 (03PS5) 10Hashar: interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840
[11:20:12] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1073, depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354208 (https://phabricator.wikimedia.org/T159753)
[11:20:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Update docker-host.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/354209
[11:20:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar)
[11:20:31] <wikibugs>	 (03CR) 10Hashar: "Rebased. Interestingly the add_ip6_mapped no more use the $::interfaces but $facts['interfaces']  so I had to slightly update the pre cond" [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar)
[11:20:40] <wikibugs>	 06Operations, 10Salt, 06Services, 10Trebuchet: `git deploy service restart` asked for sudo password - https://phabricator.wikimedia.org/T126359#3273037 (10demon) 05Open>03declined Nobody cares about Trebuchet anymore.
[11:20:50] <wikibugs>	 10Blocked-on-Operations, 06Operations, 10Parsoid, 10Salt: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#3273060 (10demon)
[11:20:56] <wikibugs>	 06Operations, 10Salt, 10Trebuchet, 13Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#3273058 (10demon) 05Open>03declined Nobody cares about Trebuchet anymore.
[11:21:23] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1073, depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354208 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[11:22:53] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1073, depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354208 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[11:23:01] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1073, depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354208 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[11:23:17] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Document servermon optimization [puppet] - 10https://gerrit.wikimedia.org/r/354207 (https://phabricator.wikimedia.org/T164604)
[11:23:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Document servermon optimization [puppet] - 10https://gerrit.wikimedia.org/r/354207 (https://phabricator.wikimedia.org/T164604) (owner: 10Alexandros Kosiaris)
[11:23:36] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Update docker-host.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/354209
[11:23:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Update docker-host.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/354209 (owner: 10Alexandros Kosiaris)
[11:23:55] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1073, depool db1072 - T159753 T164530 (duration: 00m 39s)
[11:24:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:03] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[11:24:03] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[11:24:33] <wikibugs>	 (03PS6) 10Hashar: interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840
[11:30:05] <wikibugs>	 (03CR) 10Hashar: [C: 031] "I have cherry picked the patch on deployment-prep again :-}" [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar)
[11:34:16] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove non-ascii character from servermon.rb [puppet] - 10https://gerrit.wikimedia.org/r/354212
[11:34:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove non-ascii character from servermon.rb [puppet] - 10https://gerrit.wikimedia.org/r/354212 (owner: 10Alexandros Kosiaris)
[11:37:24] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Workaround: use locally installed glide binary [calico-cni] - 10https://gerrit.wikimedia.org/r/354213
[11:38:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] "I will find a better solution later when we use stretch as well." [calico-cni] - 10https://gerrit.wikimedia.org/r/354213 (owner: 10Giuseppe Lavagetto)
[11:38:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Workaround: use locally installed glide binary [calico-cni] - 10https://gerrit.wikimedia.org/r/354213 (owner: 10Giuseppe Lavagetto)
[11:53:41] <wikibugs>	 (03CR) 10ArielGlenn: [C: 031] "Better than nothing; at least it will cover some failure cases." [software/dbtree] - 10https://gerrit.wikimedia.org/r/353388 (https://phabricator.wikimedia.org/T163143) (owner: 10Dzahn)
[12:06:37] <wikibugs>	 06Operations, 10ops-eqiad: rack and setup 24 parsoid servers - https://phabricator.wikimedia.org/T165520#3273162 (10Cmjohnson) Servers are racked 3 per rack and 6 per row.
[12:31:31] <wikibugs>	 06Operations, 07HHVM, 07Upstream: HHVM: Crash in server worker - https://phabricator.wikimedia.org/T165669#3273206 (10MoritzMuehlenhoff)
[12:40:42] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1076, depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354218 (https://phabricator.wikimedia.org/T159753)
[12:42:18] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1076, depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354218 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[12:42:51] <moritzm>	 !log upgrading mw1161 (job runner) to HHVM 3.18
[12:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:16] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1076, depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354218 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[12:43:28] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1076, depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354218 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[12:44:16] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1076, depool db1074 - T159753 T164530 (duration: 00m 39s)
[12:44:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:24] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[12:44:24] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[12:44:29] <marostegui>	 !log Deploy alter table s2.revision table - db1074 - https://phabricator.wikimedia.org/T162611
[12:44:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:08] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1072, depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354220 (https://phabricator.wikimedia.org/T159753)
[12:48:25] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1072, depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354220 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[12:49:22] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1072, depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354220 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[12:49:31] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1072, depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354220 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[12:50:20] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1072, depool db1066 - T159753 T164530 (duration: 00m 38s)
[12:50:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:29] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[12:50:30] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[12:51:27] <moritzm>	 !log upgrading mw1209-mw1219 to Linux 4.9 and HHVM 3.18
[12:51:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:42] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354221 (https://phabricator.wikimedia.org/T159753)
[12:55:45] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354221 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[12:56:40] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354221 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[12:56:49] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354221 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[12:57:35] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1066 - T159753 T164530 (duration: 00m 38s)
[12:57:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:44] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[12:57:44] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[13:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170518T1300).
[13:00:25] <wikibugs>	 07Puppet, 10DBA, 10Monitoring, 07Documentation, 13Patch-For-Review: Document performance optimization of servermon and/or puppet reporting tools - https://phabricator.wikimedia.org/T164604#3273290 (10akosiaris) The patch above does document everything in the servermon.rb reporter (which is the applicatio...
[13:03:36] <wikibugs>	 07Puppet, 10DBA, 10Monitoring, 07Documentation, 13Patch-For-Review: Document performance optimization of servermon and/or puppet reporting tools - https://phabricator.wikimedia.org/T164604#3273297 (10jcrespo) Let's close this as the scope is for me done, and let's open a new ticket with lower priority wi...
[13:14:08] <wikibugs>	 06Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674#3273365 (10akosiaris) p:05Triage>03Lowest
[13:14:19] <elukey>	 !log reloaded kafkatee to test T151748
[13:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:28] <stashbot>	 T151748: Cron conflict for kafkatee logrotate on oxygen - https://phabricator.wikimedia.org/T151748
[13:14:56] <elukey>	 !log AMEND prev: reloaded kafkatee on oxygen
[13:15:00] <wikibugs>	 07Puppet, 10DBA, 10Monitoring, 07Documentation, 13Patch-For-Review: Document performance optimization of servermon and/or puppet reporting tools - https://phabricator.wikimedia.org/T164604#3239514 (10akosiaris) 05Open>03Resolved a:03akosiaris Agreed. Relevant stuff copied over to T165674 (marked lo...
[13:15:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:48] <wikibugs>	 06Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674#3273350 (10akosiaris) Note that this is occuring seldomly and not causing any issues whatsoever. It's mostly out of personal interest that we are looking into this, hence the very low priority.
[13:16:51] <wikibugs>	 (03PS3) 10Muehlenhoff: Add initial class for ferm rules shared by all labstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/353508 (https://phabricator.wikimedia.org/T165136)
[13:20:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ferm service for rpc.statd on labstore [puppet] - 10https://gerrit.wikimedia.org/r/354226 (https://phabricator.wikimedia.org/T165136)
[13:20:37] <wikibugs>	 07Puppet, 10DBA, 10Monitoring, 07Documentation, 13Patch-For-Review: Document performance optimization of servermon and/or puppet reporting tools - https://phabricator.wikimedia.org/T164604#3273391 (10jcrespo) The "separate task" I usually suggest in these cases has a double reason- it makes clear the sco...
[13:32:21] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354228 (https://phabricator.wikimedia.org/T159753)
[13:33:09] <jynus>	 !log stopping mariadb and preparing for reimage at db2051
[13:33:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:00] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354228 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[13:35:16] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354228 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[13:35:49] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354228 (https://phabricator.wikimedia.org/T159753) (owner: 10Marostegui)
[13:40:31] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1055 - T159753 T164530 (duration: 01m 03s)
[13:40:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:40] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[13:40:40] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[13:42:17] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354229
[13:44:44] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354229 (owner: 10Marostegui)
[13:45:11] <icinga-wm>	 PROBLEM - Host kubernetes2001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:45:40] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354229 (owner: 10Marostegui)
[13:45:50] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354229 (owner: 10Marostegui)
[13:45:53] <icinga-wm>	 PROBLEM - Host kubernetes2004 is DOWN: PING CRITICAL - Packet loss = 100%
[13:45:53] <icinga-wm>	 PROBLEM - Host kubernetes2002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:46:01] <icinga-wm>	 RECOVERY - Host kubernetes2001 is UP: PING OK - Packet loss = 0%, RTA = 1.33 ms
[13:46:01] <icinga-wm>	 RECOVERY - Host kubernetes2004 is UP: PING OK - Packet loss = 0%, RTA = 1.09 ms
[13:46:01] <icinga-wm>	 RECOVERY - Host kubernetes2002 is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms
[13:46:18] <jynus>	 those are not production, right?
[13:47:00] <_joe_>	 right
[13:47:02] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1055 - T159753 T164530 (duration: 00m 39s)
[13:47:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:10] <stashbot>	 T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753
[13:47:10] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[13:48:01] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0]
[13:49:10] <elukey>	 mmmm
[13:49:26] <jynus>	 1210/11 complaining about redis?
[13:49:29] <_joe_>	 elukey: you taking a look?
[13:49:48] <elukey>	 yep, seems a brief spike
[13:49:50] <elukey>	 from https://logstash.wikimedia.org/app/kibana#/dashboard/memcached
[13:49:53] <wikibugs>	 06Operations, 10Traffic, 10netops: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256#3273427 (10BBlack) I'm probably backtracking into territory that was once known here, but after the long delay I felt I had to go back and re-validate what's going on with the port...
[13:50:19] <moritzm>	 that's fine, these are depooled for a reboot and the delayed nutcracker leads to false positives
[13:50:26] <elukey>	 I was about to ask
[13:50:27] <elukey>	 super
[13:50:41] <_joe_>	 moritzm: you didn't merge your change?
[13:50:41] <moritzm>	 I'd appreciate a followup review of https://gerrit.wikimedia.org/r/#/c/353556/
[13:50:42] <jynus>	 cool, then
[13:51:01] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0]
[13:51:03] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3273428 (10akosiaris) @Papaul the mistake was clearly in the partman recipe. Fixed in https://gerrit.wikimedia.org/r/#/c/354209/ and the boxes are up and running...
[13:51:04] <moritzm>	 no, revised my patch after reading up on the systemd.unit docs
[13:51:22] <moritzm>	 will add it to deployment-prep for some tests later on or tomorrow
[13:51:24] <_joe_>	 moritzm: also, I'm thinking
[13:51:38] <_joe_>	 we might want to have hhvm and not nutcracker
[13:52:17] <_joe_>	 so in terms of puppet code, we might want to do it a bit differently?
[13:52:34] <_joe_>	 I have to think about it
[13:52:37] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3273431 (10akosiaris)
[13:52:39] <moritzm>	 hhvm is enabled for service startup at boot, nutcracker is the problem
[13:52:54] <_joe_>	 should we just enable nutcracker as well?
[13:53:02] <moritzm>	 see https://phabricator.wikimedia.org/T163795#3254215
[13:53:12] <moritzm>	 we should do both, enable nutcracker for startup
[13:53:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Assign roles to kubernetes200X hosts [puppet] - 10https://gerrit.wikimedia.org/r/354230 (https://phabricator.wikimedia.org/T164851)
[13:53:33] <moritzm>	 and my service dependency patch
[13:53:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Assign roles to kubernetes200X hosts [puppet] - 10https://gerrit.wikimedia.org/r/354230 (https://phabricator.wikimedia.org/T164851) (owner: 10Alexandros Kosiaris)
[13:53:56] <_joe_>	 yeah, I'm just saying we should find a better way to add the dependency
[13:54:37] <moritzm>	 if the dependency is declared, systemd will also sort it correctly during boot startup
[13:55:51] <moritzm>	 I'll test this with with various scenarios in deployment-prep, but I'm fairly sure it's the correct way to declare those. but further comments/review appreciated
[13:57:11] <icinga-wm>	 PROBLEM - DPKG on kubernetes2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[13:57:12] <_joe_>	 yeah my point is that we might have cases where we install hhvm and not nutcracker
[13:58:11] <icinga-wm>	 RECOVERY - DPKG on kubernetes2004 is OK: All packages OK
[13:59:51] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata]
[13:59:51] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata]
[14:00:02] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata]
[14:00:11] <icinga-wm>	 PROBLEM - puppet last run on kubernetes2003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 25 seconds ago with 3 failures. Failed resources (up to 3 shown): Package[darmstadtium.eqiad.wmnet/calico/node],Logical_volume[data],Logical_volume[metadata]
[14:00:19] <akosiaris>	 expected ^
[14:03:04] <moritzm>	 _joe_: hmm, good point, we in fact have such a case (osmium)
[14:03:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "Needs to be revised, we have at least one server running HHVM which doesn't use nutcracker (osmium)" [puppet] - 10https://gerrit.wikimedia.org/r/353556 (https://phabricator.wikimedia.org/T163795) (owner: 10Muehlenhoff)
[14:08:19] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on kubernetes2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Still fixing the CNI package issue
[14:08:19] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on kubernetes2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Still fixing the CNI package issue
[14:08:19] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on kubernetes2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Still fixing the CNI package issue
[14:08:19] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on kubernetes2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] alexandros kosiaris Still fixing the CNI package issue
[14:12:31] <akosiaris>	 !log perform a final reboot on kubernetes200X
[14:12:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:16] <icinga-wm>	 PROBLEM - Host kubernetes2003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:14:26] <icinga-wm>	 RECOVERY - Host kubernetes2003 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[14:16:36] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:16:46] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:17:16] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:17:16] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[14:17:17] <akosiaris>	 damned docker
[14:17:32] <akosiaris>	 "Error starting daemon: error initializing graphdriver: devmapper: Base Device UUID and Filesystem verification failed: devicemapper: Error running deviceCreate (ActivateDevice) dm_task_run failed
[14:20:36] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on kubernetes2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. alexandros kosiaris still bringing the service online
[14:20:37] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. alexandros kosiaris still bringing the service online
[14:20:37] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. alexandros kosiaris still bringing the service online
[14:20:37] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. alexandros kosiaris still bringing the service online
[14:23:32] <wikibugs>	 06Operations: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703#3273472 (10MoritzMuehlenhoff) None of the packages removed for 8.8 were present in our environment.  These are fully rolled out: logback irqbalance libdatetime-timezone-perl wget vim groovy
[14:32:05] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3273486 (10Papaul) @akosiaris Thanks will resume the install.
[14:32:36] <XioNoX>	 !log rebooting mr1-ulsfo for software upgrade - T164970
[14:32:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:44] <stashbot>	 T164970: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970
[14:35:36] <icinga-wm>	 ACKNOWLEDGEMENT - Host asw-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi upgrading mr1-ulsfo
[14:45:56] <wikibugs>	 06Operations, 10Traffic: Define 3-host infra cluster for traffic pops - https://phabricator.wikimedia.org/T96852#3273494 (10BBlack) The tentative and limited plan for now is to deploy 3x misc/infra hosts (meaning all the hosts other than lvs and cp) at each cache site and not use virtualization.  We might revi...
[14:46:09] <wikibugs>	 06Operations, 10ops-ulsfo, 06DC-Ops, 10netops: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3273495 (10ayounsi) 05Open>03Resolved
[14:46:53] <moritzm>	 !log rebooting restbase1008 for update to Linux 4.9 and to pick up OpenJDK security updates
[14:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:26] <icinga-wm>	 PROBLEM - Apache HTTP on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:58:27] <icinga-wm>	 PROBLEM - puppet last run on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:58:36] <icinga-wm>	 PROBLEM - Check systemd state on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:58:45] <moritzm>	 ^downtime exprited, fixing
[14:58:46] <icinga-wm>	 PROBLEM - nutcracker port on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:58:46] <icinga-wm>	 PROBLEM - Disk space on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:58:46] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:58:46] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:58:47] <icinga-wm>	 PROBLEM - HHVM processes on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:58:47] <icinga-wm>	 PROBLEM - configured eth on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:58:47] <icinga-wm>	 PROBLEM - nutcracker process on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:58:56] <icinga-wm>	 PROBLEM - HHVM rendering on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:58:56] <icinga-wm>	 PROBLEM - DPKG on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:58:56] <icinga-wm>	 PROBLEM - dhclient process on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:59:06] <icinga-wm>	 PROBLEM - SSH on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:59:16] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[14:59:16] <icinga-wm>	 PROBLEM - salt-minion processes on mw1219 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:00:07] <icinga-wm>	 RECOVERY - salt-minion processes on mw1219 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[15:00:16] <icinga-wm>	 RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[15:00:26] <icinga-wm>	 RECOVERY - Check systemd state on mw1219 is OK: OK - running: The system is fully operational
[15:00:36] <icinga-wm>	 RECOVERY - Disk space on mw1219 is OK: DISK OK
[15:00:36] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1219 is OK: OK: nf_conntrack is 0 % full
[15:00:36] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1219 is OK: OK ferm input default policy is set
[15:00:36] <icinga-wm>	 RECOVERY - configured eth on mw1219 is OK: OK - interfaces up
[15:00:37] <icinga-wm>	 RECOVERY - HHVM processes on mw1219 is OK: PROCS OK: 6 processes with command name hhvm
[15:00:46] <icinga-wm>	 RECOVERY - dhclient process on mw1219 is OK: PROCS OK: 0 processes with command name dhclient
[15:00:46] <icinga-wm>	 RECOVERY - DPKG on mw1219 is OK: All packages OK
[15:00:56] <icinga-wm>	 RECOVERY - SSH on mw1219 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[15:01:06] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 5.838 second response time
[15:01:16] <icinga-wm>	 RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.916 second response time
[15:02:46] <icinga-wm>	 RECOVERY - HHVM rendering on mw1219 is OK: HTTP OK: HTTP/1.1 200 OK - 74782 bytes in 0.299 second response time
[15:07:01] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1074, depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354237 (https://phabricator.wikimedia.org/T162611)
[15:07:15] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Repool db1074, depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354237 (https://phabricator.wikimedia.org/T162611)
[15:07:34] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for maintenance on db1074 finish" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354237 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[15:18:00] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T165629#3273518 (10Papaul) Dear Mr Papaul Tshibamba,  Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.  Your reque...
[15:18:26] <wikibugs>	 06Operations, 10ops-codfw, 10DBA: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T165629#3273519 (10Marostegui) Thanks!
[15:28:13] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1074, depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354237 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[15:31:31] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1074, depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354237 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[15:31:43] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1074, depool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354237 (https://phabricator.wikimedia.org/T162611) (owner: 10Marostegui)
[15:32:24] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: codfw: kubernetes200[1-4] racking and onsite setup task - https://phabricator.wikimedia.org/T164851#3273537 (10Papaul) a:05Papaul>03akosiaris
[15:34:06] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1074, depool db1060 - T162611 (duration: 00m 39s)
[15:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:16] <stashbot>	 T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611
[15:34:36] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational
[15:34:48] <marostegui>	 !log Deploy alter table s2.revision table - db1060 - https://phabricator.wikimedia.org/T162611
[15:34:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:04] <wikibugs>	 (03CR) 10Eevans: "> Thanks a lot, looks great! Just a couple of questions/notes for" [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto)
[15:37:36] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:47:06] <wikibugs>	 (03CR) 10Elukey: "> Sorry, I'm still trying to construct the mental model of how" [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto)
[15:48:16] <wikibugs>	 06Operations, 13Patch-For-Review, 15User-Elukey: Cron conflict for kafkatee logrotate on oxygen - https://phabricator.wikimedia.org/T151748#3273551 (10elukey)
[15:54:19] <_joe_>	 !log uploaded package cni to jessie-wikimedia
[15:54:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:26] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[15:55:55] <_joe_>	 paravoid: ^^ as promised :P
[15:59:00] <paravoid>	 heh
[15:59:16] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational
[16:00:05] <jouncebot>	 godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170518T1600). Please do the needful.
[16:00:44] <elukey>	 there seems to be no patches to merge afaics
[16:02:16] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:04:16] <wikibugs>	 (03CR) 10Eevans: "> > It is setting it to 25165824 (in the cassandra profile), no?" [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto)
[16:04:26] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[16:04:36] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational
[16:07:36] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:07:52] <wikibugs>	 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3273597 (10elukey) Apache's mod_ssl seems to default to not expecting any response after sending a close notify:  ``` #   SSL Protocol Adjustments: #   The safe and defau...
[16:11:25] <elukey>	 !log upgraded cassandra-tools-wmf on aqs hosts
[16:11:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:25] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[16:18:25] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[16:24:33] <wikibugs>	 06Operations, 10Phabricator, 13Patch-For-Review, 06Release-Engineering-Team (Watching / External): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3273662 (10RobH) This was already installed and has puppet/salt accepted, seems the ticket just got neglected.  @Paladox: You had...
[16:24:47] <wikibugs>	 06Operations, 10Phabricator, 06Release-Engineering-Team (Watching / External): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3273663 (10RobH) a:05RobH>03None
[16:25:28] <wikibugs>	 06Operations, 10Phabricator, 06Release-Engineering-Team (Watching / External): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3273668 (10Paladox) @RobH hi, that would be releng (@mmodell) for service implementation.
[16:27:08] <wikibugs>	 06Operations, 10Phabricator, 06Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3273670 (10RobH) a:03mmodell @Paladox: Thanks!  @mmodell: It looks like we got this system spun up and installed awhile ago.  I've assigned this task to you for s...
[16:28:47] <moritzm>	 !log restarting cassandra on restbase1010, restbase1011, restbase1016, restbase1018 to pick up OpenJDK security updates
[16:28:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:08] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3273679 (10RobH)
[16:33:31] <wikibugs>	 06Operations: Audit / document reasons for not enabling HT? - https://phabricator.wikimedia.org/T165618#3273689 (10jcrespo) BBlack - I am not disagreeing with you, in fact we already do throughput limitation at application side by limiting thread concurrency to 64 on our servers, which is more than the number of...
[16:44:56] <wikibugs>	 06Operations, 07Performance, 15User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3273706 (10elukey) @Krinkle, @aaron - any opinion? There are a couple of hosts that might be better to decom since the hw is really hold (like rdb1007),...
[16:49:54] <wikibugs>	 (03PS1) 10Hoo man: Log "api-readonly" errors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354246 (https://phabricator.wikimedia.org/T164191)
[16:53:12] <wikibugs>	 (03Draft1) 10Paladox: Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643)
[16:53:15] <wikibugs>	 (03PS2) 10Paladox: Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643)
[17:00:04] <jouncebot>	 gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170518T1700).
[17:00:17] <subbu>	 no parsoid deployment today
[17:02:15] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100%
[17:02:41] <wikibugs>	 (03PS3) 10Paladox: Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643)
[17:03:35] <icinga-wm>	 PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 36, down: 1, dormant: 0, excluded: 0, unused: 0BRfe-0/0/0: down - OOB-transit: UnitedLayer OOB connection (UL CID: 0502) [100Mbps Cu]BR
[17:09:15] <moritzm>	 !log upgrading mw2130-mw2139 to Linux 4.9 and HHVM 3.18
[17:09:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:45] <robh>	 mr1 outage is expected.
[17:10:11] <robh>	 !log mr1-ulsfo having oob connection re-routed at ulsfo, will flap a bit from 1700-1730 gmt
[17:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:35] <icinga-wm>	 RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 38, down: 0, dormant: 0, excluded: 0, unused: 0
[17:20:38] <wikibugs>	 06Operations, 10DBA, 10Pybal, 07Availability: Create a backend check for pybal to monitor the MySQL protocol being up - https://phabricator.wikimedia.org/T165677#3273775 (10jcrespo)
[17:23:15] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 55.71 ms
[17:27:15] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0]
[17:28:15] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0]
[17:32:20] <wikibugs>	 (03Abandoned) 10Brion VIBBER: Make MediaWiki treat $lang of be_x_oldwiki as be-tarask, just don't change the real DB name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236966 (https://phabricator.wikimedia.org/T111853) (owner: 10Alex Monk)
[17:32:28] <wikibugs>	 (03Restored) 10Brion VIBBER: Make MediaWiki treat $lang of be_x_oldwiki as be-tarask, just don't change the real DB name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236966 (https://phabricator.wikimedia.org/T111853) (owner: 10Alex Monk)
[17:33:57] <wikibugs>	 (03Abandoned) 10Brion VIBBER: Disable mp3 uploads for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349733 (https://phabricator.wikimedia.org/T115170) (owner: 10TheDJ)
[17:39:15] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0]
[17:41:15] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0]
[17:57:25] <icinga-wm>	 PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:57:45] <icinga-wm>	 PROBLEM - puppet last run on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:58:35] <icinga-wm>	 RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 26 minutes ago with 0 failures
[18:00:00] <AaronSchulz>	 jynus: https://gerrit.wikimedia.org/r/#/c/354138/
[18:00:05] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170518T1800).
[18:00:05] <jouncebot>	 Jamesofur: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[18:00:52] <jynus>	 AaronSchulz, please try an op in a better timezone, I was finishing and about to leave
[18:01:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz)
[18:01:43] <jynus>	 it is 20h here
[18:02:00] <AaronSchulz>	 jynus: well, I wanted you to sign off before pinging anyone else at least
[18:02:09] <jynus>	 I just did
[18:03:13] <jynus>	 please keep an eye on graphite after deploying
[18:04:19] * Jamesofur is here
[18:19:45] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational
[18:22:45] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:25:26] <icinga-wm>	 RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[18:26:10] <Dereckson>	 Hello
[18:26:34] <Dereckson>	 Jamesofur: I can deploy
[18:27:18] <Dereckson>	 I've just reached the hackathon hotel, and I were going to go to metalab, but I can deploy this first.
[18:29:58] <Jamesofur>	 Dereckson: \o/ 
[18:30:14] <Jamesofur>	 shouldn't be long
[18:30:28] <Dereckson>	 famous last words
[18:30:48] <Jamesofur>	 true story
[18:32:18] <Dereckson>	 Jamesofur: live on mwdebug1002
[18:36:43] <Jamesofur>	 Dereckson: looks good
[18:36:47] <Dereckson>	 ok
[18:37:37] <logmsgbot>	 !log dereckson@tin Synchronized php-1.30.0-wmf.1/extensions/SecurePoll/includes/pages/DumpPage.php: Revert "Dump should return decrypted votes" (T145695) (duration: 00m 48s)
[18:37:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:47] <stashbot>	 T145695: Dump should return decrypted votes - https://phabricator.wikimedia.org/T145695
[18:38:07] <Dereckson>	 here you are.
[18:38:39] <Jamesofur>	 and works on live
[18:38:40] <Jamesofur>	 thanks :)
[18:47:35] <icinga-wm>	 PROBLEM - configured eth on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:48:25] <icinga-wm>	 RECOVERY - configured eth on ms-be1020 is OK: OK - interfaces up
[19:04:36] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational
[19:06:31] <urandom>	 !log T164865: configure RESTBase tables for size-tiered compaction (dev env only)
[19:06:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:39] <stashbot>	 T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865
[19:07:35] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[19:35:12] <wikibugs>	 (03PS2) 10Dzahn: wikistats: grant db permissions on first run (labs) [puppet] - 10https://gerrit.wikimedia.org/r/353944
[19:41:11] <wikibugs>	 (03PS3) 10Dzahn: wikistats: grant db permissions on first run (labs) [puppet] - 10https://gerrit.wikimedia.org/r/353944
[19:52:38] <urandom>	 !log T164865: restarting RESTBase-dev to apply range delete-based render retention
[19:52:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:52:46] <stashbot>	 T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865
[20:03:52] <urandom>	 !log T164865: restarting RESTBase-dev, range delete-based render retention
[20:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:00] <stashbot>	 T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865
[20:11:20] <wikibugs>	 06Operations, 10ops-eqiad, 15User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3274118 (10Cmjohnson) HP is sending me a new battery and wants me to upgrade the f/w.   Part/s shipped: 871264-001 Part description: SPS-BATT PACK MC 96W V3 Carrier...
[20:18:52] <wikibugs>	 (03CR) 10Dzahn: [C: 032] wikistats: grant db permissions on first run (labs) [puppet] - 10https://gerrit.wikimedia.org/r/353944 (owner: 10Dzahn)
[20:19:13] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "labs-only" [puppet] - 10https://gerrit.wikimedia.org/r/353944 (owner: 10Dzahn)
[20:39:19] <wikibugs>	 (03CR) 10Dzahn: [C: 032] return HTTP 503 if database connection fails [software/dbtree] - 10https://gerrit.wikimedia.org/r/353388 (https://phabricator.wikimedia.org/T163143) (owner: 10Dzahn)
[20:39:59] <wikibugs>	 (03CR) 10Dzahn: [V: 032 C: 032] return HTTP 503 if database connection fails [software/dbtree] - 10https://gerrit.wikimedia.org/r/353388 (https://phabricator.wikimedia.org/T163143) (owner: 10Dzahn)
[20:44:24] <mutante>	 !log terbium / dbtree - deploying gerrit:353388 (sudo -u mwdeploy git pull origin   in /srv/dbtree) (T163143)
[20:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:32] <stashbot>	 T163143: dbtree: don't return 200 on error pages - https://phabricator.wikimedia.org/T163143
[20:47:03] <mutante>	 !log wasat - git pull - bring to latest, the last changed had never been deployed here like on terbium, but it's also not a backend for dbtree yet (T163141)
[20:47:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:11] <stashbot>	 T163141: dbtree: make wasat a working backend and become active-active  - https://phabricator.wikimedia.org/T163141
[21:04:29] <Zppix>	 hey addshore  are you busy?
[21:13:44] <wikibugs>	 (03CR) 10Dzahn: [C: 032] gerrit: rename "server" IP to "service" IP [puppet] - 10https://gerrit.wikimedia.org/r/354075 (owner: 10Dzahn)
[21:15:24] <wikibugs>	 (03PS2) 10Dzahn: gerrit: rename "server" IP to "service" IP [puppet] - 10https://gerrit.wikimedia.org/r/354075
[21:19:08] <wikibugs>	 (03PS3) 10Dzahn: gerrit: rename "server" IP to "service" IP [puppet] - 10https://gerrit.wikimedia.org/r/354075
[21:20:03] <wikibugs>	 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3274227 (10BBlack) So from the above, apache really has 3 different modes of operation:  1) default - sends close notify, but does not wait for a matching close notify fr...
[21:29:03] <wikibugs>	 (03PS4) 10Dzahn: gerrit: rename "server" IP to "service" IP [puppet] - 10https://gerrit.wikimedia.org/r/354075
[21:45:14] <wikibugs>	 06Operations, 10ops-eqiad: rack and setup 24 parsoid servers - https://phabricator.wikimedia.org/T165520#3274287 (10RobH)
[21:48:46] <wikibugs>	 06Operations, 10ops-eqiad: rack and setup 24 parsoid servers - https://phabricator.wikimedia.org/T165520#3274294 (10RobH) p:05Triage>03Normal a:03Cmjohnson
[21:49:45] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational
[21:52:45] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:18:36] <wikibugs>	 07Puppet, 06Labs, 10MediaWiki-Vagrant, 13Patch-For-Review, 15User-bd808: Make role::labs::mediawiki_vagrant work on Debian Jessie host systems - https://phabricator.wikimedia.org/T154340#3274360 (10bd808) 05Open>03Resolved This has been functional for several months. I think I just lost track of the...
[22:29:15] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational
[22:32:15] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:36:08] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "i removed the "profile::" part and it also compiles as no-op now http://puppet-compiler.wmflabs.org/6485/" [puppet] - 10https://gerrit.wikimedia.org/r/354075 (owner: 10Dzahn)
[22:36:16] <wikibugs>	 (03PS5) 10Dzahn: gerrit: rename "server" IP to "service" IP [puppet] - 10https://gerrit.wikimedia.org/r/354075
[22:49:45] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational
[22:52:45] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:00:05] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170518T2300).
[23:04:35] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2004 is OK: OK - running: The system is fully operational
[23:07:36] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[23:09:48] <Dereckson>	 Nothing to SWAT.