[00:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180216T0000).
[00:00:06] <jouncebot>	 MatmaRex: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:02:00] <MatmaRex>	 hi
[00:05:53] <MatmaRex>	 anyone?
[00:06:55] <ebernhardson>	 i can ship that i suppose
[00:08:18] <wikibugs>	 (03PS2) 10Krinkle: [WIP] extract2: Set wiki context directly instead of MW_LANG indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410109
[00:08:25] <wikibugs>	 (03PS2) 10Krinkle: [WIP] multiversion: Remove support for MW_LANG env override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410110
[00:11:28] <MatmaRex>	 ebernhardson: i would appreciate that
[00:14:05] <ebernhardson>	 MatmaRex: pulled to mwdebug1001
[00:14:52] <MatmaRex>	 ebernhardson: works as expected!
[00:16:52] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.31.0-wmf.21/extensions/ProofreadPage/modules/page/ext.proofreadpage.page.edit.js: SWAT:  T187454 fix text selection on #wpTextbox1 (duration: 00m 58s)
[00:17:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:17:07] <stashbot>	 T187454: "encapsulateSelection" broken with the proofread-page content model - https://phabricator.wikimedia.org/T187454
[00:17:16] <ebernhardson>	 MatmaRex: all synced out
[00:19:56] <MatmaRex>	 thanks ebernhardson
[00:19:59] <MatmaRex>	 works in prod
[00:28:45] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] Move all dblists on noc to dblists/ directory, rather than individually (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394199 (owner: 10Chad)
[00:30:27] <icinga-wm>	 PROBLEM - HHVM rendering on mw2220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:31:17] <icinga-wm>	 RECOVERY - HHVM rendering on mw2220 is OK: HTTP OK: HTTP/1.1 200 OK - 79320 bytes in 0.281 second response time
[01:00:27] <icinga-wm>	 PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:00:47] <icinga-wm>	 PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:02:08] <icinga-wm>	 PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:02:47] <icinga-wm>	 PROBLEM - puppet last run on elastic1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:03:08] <icinga-wm>	 PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:03:28] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:03:38] <icinga-wm>	 PROBLEM - puppet last run on wtp1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:03:58] <icinga-wm>	 PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:03:58] <icinga-wm>	 PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:04:07] <icinga-wm>	 PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:04:17] <icinga-wm>	 PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:04:17] <icinga-wm>	 PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:08:36] <paladox>	 Puppetdb?
[01:08:55] <paladox>	 mutante: herron ^^
[01:28:38] <icinga-wm>	 RECOVERY - puppet last run on wtp1026 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures
[01:29:07] <icinga-wm>	 RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[01:29:07] <icinga-wm>	 RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[01:29:07] <icinga-wm>	 RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[01:29:17] <icinga-wm>	 RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[01:29:17] <icinga-wm>	 RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[01:30:28] <icinga-wm>	 RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[01:30:38] <icinga-wm>	 RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[01:32:08] <icinga-wm>	 RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[01:32:47] <icinga-wm>	 RECOVERY - puppet last run on elastic1048 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[01:33:07] <icinga-wm>	 RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[01:33:37] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[02:07:47] <icinga-wm>	 PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token]
[02:37:47] <icinga-wm>	 RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[03:08:07] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on labpuppetmaster1001 is OK: OK ferm input default policy is set
[03:08:17] <icinga-wm>	 RECOVERY - Check systemd state on labpuppetmaster1002 is OK: OK - running: The system is fully operational
[03:11:35] <icinga-wm>	 ACKNOWLEDGEMENT - ensure kvm processes are running on labtestvirt2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm andrew bogott These are part of a new test that chase is working on no need to alert.
[03:11:36] <icinga-wm>	 ACKNOWLEDGEMENT - ensure kvm processes are running on labtestvirt2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm andrew bogott These are part of a new test that chase is working on no need to alert.
[03:25:18] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 768.99 seconds
[03:59:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 267.93 seconds
[06:35:15] <marostegui>	 !log  Deploy schema change on s5 primary master db1070 - T185128 T153182
[06:35:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:32] <stashbot>	 T153182: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182
[06:35:33] <stashbot>	 T185128: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128
[06:37:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/410599 (https://phabricator.wikimedia.org/T187355) (owner: 10Bstorm)
[06:37:09] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1067 and db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411029
[06:39:54] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T187419#3977674 (10Marostegui) 05Open>03Resolved All good now - thanks Papaul!  ```        logicaldrive 1 (3.3 TB, RAID 1+0, OK)        physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)       physicald...
[06:40:48] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 and db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411029 (owner: 10Marostegui)
[06:41:18] <moritzm>	 !log installing installing quagga security updates
[06:41:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:43:18] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067 and db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411029 (owner: 10Marostegui)
[06:43:31] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 and db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411029 (owner: 10Marostegui)
[06:46:24] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1089 and db1067 - T162807 (duration: 00m 59s)
[06:46:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:36] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[07:22:20] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: Release new version of conftool [software/conftool] - 10https://gerrit.wikimedia.org/r/410226
[07:23:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Release new version of conftool [software/conftool] - 10https://gerrit.wikimedia.org/r/410226 (owner: 10Giuseppe Lavagetto)
[07:24:51] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: Release new version of conftool [software/conftool] - 10https://gerrit.wikimedia.org/r/410226
[07:26:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Release new version of conftool [software/conftool] - 10https://gerrit.wikimedia.org/r/410226 (owner: 10Giuseppe Lavagetto)
[07:26:15] <_joe_>	 uhm
[07:26:24] <_joe_>	 gbp works on this version, wth?
[07:30:30] <wikibugs>	 (03PS9) 10Giuseppe Lavagetto: Release new version of conftool [software/conftool] - 10https://gerrit.wikimedia.org/r/410226
[07:32:38] <_joe_>	 ook, that's better
[07:32:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Release new version of conftool [software/conftool] - 10https://gerrit.wikimedia.org/r/410226 (owner: 10Giuseppe Lavagetto)
[07:37:45] <wikibugs>	 (03CR) 10Muehlenhoff: cassandra: enable component/cassandra33 where applicable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410252 (https://phabricator.wikimedia.org/T186619) (owner: 10Eevans)
[08:03:07] <icinga-wm>	 PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 0%, RTA = 2444.27 ms
[08:03:27] <icinga-wm>	 RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 2.96 ms
[08:04:27] <icinga-wm>	 PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:04:47] <icinga-wm>	 PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 100%
[08:04:47] <icinga-wm>	 PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100%
[08:04:57] <icinga-wm>	 PROBLEM - SSH on ganeti1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:05:28] <icinga-wm>	 PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:28] <icinga-wm>	 PROBLEM - Host dubnium is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:28] <icinga-wm>	 PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:28] <icinga-wm>	 PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:37] <icinga-wm>	 PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:37] <icinga-wm>	 PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:48] <icinga-wm>	 PROBLEM - Host logstash1007 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:48] <icinga-wm>	 PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:57] <icinga-wm>	 PROBLEM - Host install1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:05:58] <icinga-wm>	 PROBLEM - Host rutherfordium is DOWN: PING CRITICAL - Packet loss = 100%
[08:06:41] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:07:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-json-udp_11514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but poo
[08:07:17] <icinga-wm>	 g-udp_10514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled
[08:07:37] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-json-udp_11514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but poo
[08:07:37] <icinga-wm>	 g-udp_10514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled
[08:07:38] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-json-udp_11514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but poo
[08:07:38] <icinga-wm>	 g-udp_10514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled
[08:07:48] <icinga-wm>	 RECOVERY - Host bohrium is UP: PING WARNING - Packet loss = 37%, RTA = 2.11 ms
[08:07:57] <icinga-wm>	 RECOVERY - SSH on ganeti1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[08:07:57] <icinga-wm>	 RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 2.55 ms
[08:07:58] <icinga-wm>	 RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 2.63 ms
[08:07:58] <icinga-wm>	 RECOVERY - Host chlorine is UP: PING OK - Packet loss = 0%, RTA = 2.45 ms
[08:07:58] <icinga-wm>	 RECOVERY - Host dubnium is UP: PING OK - Packet loss = 0%, RTA = 2.46 ms
[08:07:58] <icinga-wm>	 RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 2.30 ms
[08:07:58] <icinga-wm>	 RECOVERY - Host rutherfordium is UP: PING OK - Packet loss = 0%, RTA = 2.31 ms
[08:07:58] <icinga-wm>	 RECOVERY - Host install1002 is UP: PING OK - Packet loss = 0%, RTA = 2.40 ms
[08:07:59] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Depool db1053 from s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410932 (https://phabricator.wikimedia.org/T183469)
[08:07:59] <icinga-wm>	 RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 2.46 ms
[08:07:59] <icinga-wm>	 RECOVERY - Host logstash1007 is UP: PING OK - Packet loss = 0%, RTA = 2.39 ms
[08:08:00] <icinga-wm>	 RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 3.91 ms
[08:08:07] <icinga-wm>	 RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms
[08:08:07] <icinga-wm>	 RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 1.91 ms
[08:08:16] <akosiaris>	 the ganeti host on which bohrium was on.. no surprise there
[08:08:21] <akosiaris>	 probably some extra IO 
[08:08:42] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on logstash.svc.eqiad.wmnet is OK: TCP OK - 0.003 second response time on 10.2.2.36 port 10514
[08:08:47] <elukey>	 ah so the theory is that bohrium causes the IO spike and then the freeze?
[08:08:59] <elukey>	 :(
[08:09:07] <akosiaris>	 it's not ofc bohrium's fault
[08:09:31] <akosiaris>	 but it's the one VM where some IO spikes are expected
[08:09:41] <akosiaris>	 more like normal part of the process
[08:09:41] <jynus>	 can you ptu bohrium on its own vm and see what happens?
[08:09:55] <akosiaris>	 it's own hardware you mean
[08:10:00] <jynus>	 yes, sorry
[08:10:05] <jynus>	 its own vm host
[08:10:22] <akosiaris>	 I've been trying to reproduce the problem with a single vm on a host and had no luck so far
[08:10:38] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[08:10:39] <akosiaris>	 which means that what you suggest would alleviate the problem
[08:10:45] <jynus>	 isn't that a win?
[08:10:47] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[08:10:58] <icinga-wm>	 PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:11:00] <akosiaris>	 but it would only be an rsync away
[08:11:07] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:11:23] <jynus>	 mortiz mentioned drbd interactions, and that is something I would consider
[08:11:35] <_joe_>	 we're running logstash on ganeti?
[08:11:37] <_joe_>	 uhm
[08:11:56] <akosiaris>	 yeah and I violated on purpose the rules about placement yesterday
[08:11:57] <icinga-wm>	 PROBLEM - Host rutherfordium is DOWN: PING CRITICAL - Packet loss = 100%
[08:11:57] <icinga-wm>	 PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 100%
[08:11:57] <icinga-wm>	 PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100%
[08:12:06] <jynus>	 drbd is the nfs of distributed filesystems
[08:12:26] <_joe_>	 nfs is the nfs of distributed filesystems
[08:12:29] <_joe_>	 :)
[08:12:31] <akosiaris>	 normally not all logstash vms would be placed on the same host
[08:12:48] <icinga-wm>	 PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100%
[08:12:48] <icinga-wm>	 PROBLEM - Host releases1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:12:52] <akosiaris>	 I think this time around ganeti1006 fully died
[08:12:58] <icinga-wm>	 PROBLEM - Host logstash1007 is DOWN: PING CRITICAL - Packet loss = 100%
[08:12:58] <icinga-wm>	 PROBLEM - Host webperf1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:12:58] <icinga-wm>	 PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100%
[08:12:58] <icinga-wm>	 PROBLEM - Host netmon1003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:12:58] <icinga-wm>	 PROBLEM - Host dubnium is DOWN: PING CRITICAL - Packet loss = 100%
[08:12:58] <icinga-wm>	 PROBLEM - Host install1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:12:59] * akosiaris powercycles
[08:13:07] <icinga-wm>	 PROBLEM - Host mwdebug1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:13:07] <icinga-wm>	 PROBLEM - Host hassium is DOWN: PING CRITICAL - Packet loss = 100%
[08:13:47] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1007.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1007.eqiad.wmnet are marked down but pooled: logstash-json-udp_11514_udp: Servers logstash1007.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1007.eqiad.wmnet are marked down but poo
[08:13:47] <icinga-wm>	 g-udp_10514_udp: Servers logstash1007.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1008.eqiad.wmnet are marked down but pooled
[08:13:47] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - logstash-json-tcp_11514: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-log4j_4560: Servers logstash1007.eqiad.wmnet are marked down but pooled: logstash-json-udp_11514_udp: Servers logstash1008.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1007.eqiad.wmnet are marked down but poo
[08:13:47] <icinga-wm>	 g-udp_10514_udp: Servers logstash1007.eqiad.wmnet are marked down but pooled: kibana_80: Servers logstash1007.eqiad.wmnet are marked down but pooled
[08:13:58] <akosiaris>	 !log powercycle ganeti1006
[08:14:07] <icinga-wm>	 PROBLEM - SSH on ganeti1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:14:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:17] <akosiaris>	 !log powercycle ganeti1006 T181121
[08:14:20] <jynus>	 as long as that doesn't bring down our whole infrastructure...
[08:14:21] <akosiaris>	 that's more like it
[08:14:22] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on kibana.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:14:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:30] <stashbot>	 T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121
[08:16:08] <icinga-wm>	 PROBLEM - ganeti-confd running on ganeti1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd
[08:16:17] <icinga-wm>	 RECOVERY - SSH on ganeti1006 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[08:17:07] <icinga-wm>	 RECOVERY - ganeti-confd running on ganeti1006 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd
[08:18:27] <icinga-wm>	 RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms
[08:18:37] <icinga-wm>	 RECOVERY - Host chlorine is UP: PING OK - Packet loss = 0%, RTA = 6.40 ms
[08:18:37] <icinga-wm>	 RECOVERY - Host rutherfordium is UP: PING OK - Packet loss = 0%, RTA = 6.77 ms
[08:18:47] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5
[08:18:47] <icinga-wm>	 RECOVERY - Host mwdebug1002 is UP: PING OK - Packet loss = 0%, RTA = 7.58 ms
[08:18:47] <icinga-wm>	 RECOVERY - Host releases1001 is UP: PING OK - Packet loss = 0%, RTA = 6.85 ms
[08:18:57] <icinga-wm>	 RECOVERY - Host logstash1007 is UP: PING OK - Packet loss = 0%, RTA = 7.47 ms
[08:18:57] <icinga-wm>	 RECOVERY - Host dubnium is UP: PING OK - Packet loss = 0%, RTA = 7.93 ms
[08:18:58] <icinga-wm>	 RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 622 bytes in 0.198 second response time
[08:18:58] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 623 bytes in 0.232 second response time
[08:19:07] <icinga-wm>	 RECOVERY - Host bohrium is UP: PING WARNING - Packet loss = 58%, RTA = 8.39 ms
[08:19:07] <icinga-wm>	 RECOVERY - Host hassium is UP: PING OK - Packet loss = 0%, RTA = 9.38 ms
[08:19:17] <icinga-wm>	 RECOVERY - Host netmon1003 is UP: PING OK - Packet loss = 0%, RTA = 8.25 ms
[08:19:22] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on kibana.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 3162 bytes in 0.026 second response time
[08:19:27] <icinga-wm>	 RECOVERY - Host webperf1001 is UP: PING OK - Packet loss = 0%, RTA = 8.56 ms
[08:19:27] <icinga-wm>	 RECOVERY - Host planet1001 is UP: PING OK - Packet loss = 0%, RTA = 8.98 ms
[08:20:57] <icinga-wm>	 RECOVERY - Host install1002 is UP: PING OK - Packet loss = 0%, RTA = 8.54 ms
[08:22:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1053 from s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410932 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[08:23:37] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy
[08:23:50] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db1053 from s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410932 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[08:23:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[08:24:07] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[08:26:34] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db1053 from s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410932 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[08:28:37] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Remove db1053 for mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411188 (https://phabricator.wikimedia.org/T183469)
[08:30:10] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1053 (duration: 00m 57s)
[08:30:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:48] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5
[08:34:21] <akosiaris>	 !log manually allocate logstash1008 on ganeti1005 to undo the manual override of sensible allocation rules by ganeti
[08:34:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:41] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Traffic, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3977824 (10Vgutierrez) 05Open>03Resolved
[08:43:58] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Remove db1053 for mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411188 (https://phabricator.wikimedia.org/T183469)
[08:44:33] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Remove db1053 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411188 (https://phabricator.wikimedia.org/T183469)
[08:44:50] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Remove db1053 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411188 (https://phabricator.wikimedia.org/T183469)
[08:48:16] <akosiaris>	 !log doing IO stress tests on ganeti1005. T181121
[08:48:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:32] <stashbot>	 T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121
[08:55:28] <wikibugs>	 (03PS8) 10Alexandros Kosiaris: otrs: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409462 (owner: 10Dzahn)
[08:55:32] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] otrs: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/409462 (owner: 10Dzahn)
[08:56:58] <wikibugs>	 (03PS7) 10Krinkle: [DNM] Add cron job for expired userrights maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/382631 (https://phabricator.wikimedia.org/T176754) (owner: 10EddieGP)
[09:00:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "Noop on mendelevium. One thing that is a bit interesting is what will happen on new installation. Those File[/etc/apache2/mods-available/s" [puppet] - 10https://gerrit.wikimedia.org/r/409462 (owner: 10Dzahn)
[09:12:39] <godog>	 herron mutante what do you think re: https://gerrit.wikimedia.org/r/c/410758/ ?
[09:22:36] <wikibugs>	 (03PS14) 10Muehlenhoff: Add support for selective automatic restarts of stateless services [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991)
[09:23:06] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: httpd: Make sure we purge package's status.conf [puppet] - 10https://gerrit.wikimedia.org/r/411189
[09:23:08] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: apache/httpd: Support IPv6 in status page [puppet] - 10https://gerrit.wikimedia.org/r/411190
[09:23:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "Addressed in https://gerrit.wikimedia.org/r/411189" [puppet] - 10https://gerrit.wikimedia.org/r/409462 (owner: 10Dzahn)
[09:25:27] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] httpd: Make sure we purge package's status.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/411189 (owner: 10Alexandros Kosiaris)
[09:26:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Maybe let's do httpd first, and apache later? this will go out to all servers basically :P" [puppet] - 10https://gerrit.wikimedia.org/r/411190 (owner: 10Alexandros Kosiaris)
[09:28:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: httpd: Make sure we purge package's status.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/411189 (owner: 10Alexandros Kosiaris)
[09:37:04] <wikibugs>	 (03PS1) 10Elukey: profile::oozie::server: remove symlink installed by Oozie's package [puppet] - 10https://gerrit.wikimedia.org/r/411192 (https://phabricator.wikimedia.org/T184794)
[09:38:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: httpd: Make sure we purge package's status.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/411189 (owner: 10Alexandros Kosiaris)
[09:38:42] <_joe_>	 win 66
[09:38:51] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::oozie::server: remove symlink installed by Oozie's package [puppet] - 10https://gerrit.wikimedia.org/r/411192 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey)
[09:39:00] <vgutierrez>	 _joe_: E_TOO_MANY_WINDOWS
[09:39:19] <_joe_>	 vgutierrez: oh I restarted irssi a couple weeks ago, or I'd be in the 100s
[09:39:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "that's ok then :)" [puppet] - 10https://gerrit.wikimedia.org/r/411189 (owner: 10Alexandros Kosiaris)
[09:41:11] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: httpd: Make sure we purge package's status.conf [puppet] - 10https://gerrit.wikimedia.org/r/411189
[09:41:13] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: httpd: Support IPv6 in status page [puppet] - 10https://gerrit.wikimedia.org/r/411190
[09:41:15] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: apache: Support IPv6 in status [puppet] - 10https://gerrit.wikimedia.org/r/411193
[09:41:27] <wikibugs>	 10Operations: Create 2 VMs in codfw for mwdebug20001 and 2002 - https://phabricator.wikimedia.org/T187468#3977922 (10fgiunchedi) p:05Triage>03Normal
[09:41:38] <wikibugs>	 10Operations, 10ops-eqiad: Decommission mw2017 and mw2099 - https://phabricator.wikimedia.org/T187467#3977923 (10fgiunchedi) p:05Triage>03Normal
[09:41:44] <wikibugs>	 10Operations, 10ops-eqiad: Decommission mw1259-mw1260 - https://phabricator.wikimedia.org/T187466#3977924 (10fgiunchedi) p:05Triage>03Normal
[09:42:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "done" [puppet] - 10https://gerrit.wikimedia.org/r/411190 (owner: 10Alexandros Kosiaris)
[09:42:40] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T187442#3977925 (10fgiunchedi) p:05Triage>03Normal
[09:42:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "https://puppet-compiler.wmflabs.org/compiler02/9995/mendelevium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/411189 (owner: 10Alexandros Kosiaris)
[09:42:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] httpd: Make sure we purge package's status.conf [puppet] - 10https://gerrit.wikimedia.org/r/411189 (owner: 10Alexandros Kosiaris)
[09:42:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] httpd: Support IPv6 in status page [puppet] - 10https://gerrit.wikimedia.org/r/411190 (owner: 10Alexandros Kosiaris)
[09:45:03] <wikibugs>	 (03PS15) 10Muehlenhoff: Add support for selective automatic restarts of stateless services [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991)
[09:45:27] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#3977928 (10jcrespo)
[09:46:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add support for selective automatic restarts of stateless services [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:46:53] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411194 (https://phabricator.wikimedia.org/T162807)
[09:47:19] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Remove db1053 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411188 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[09:48:54] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Remove db1053 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411188 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[09:49:03] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Remove db1053 from mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411188 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[09:49:40] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411194 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[09:49:45] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411194 (https://phabricator.wikimedia.org/T162807)
[09:53:07] <marostegui>	 jynus: have you deployed your change? I want to deploy mine, I can rebase and deploy both if you want
[09:53:09] <wikibugs>	 10Operations, 10monitoring: Many "NRPE: Unable to read output" from "long running screen/tmux" checks in icinga - https://phabricator.wikimedia.org/T187528#3977961 (10fgiunchedi)
[09:53:12] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411194 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[09:53:51] <jynus>	 it is ongoing
[09:53:54] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Remove db1053 (duration: 00m 56s)
[09:53:56] <marostegui>	 ah
[09:53:58] <marostegui>	 hehe
[09:53:59] <jynus>	 it requires 2 deploys
[09:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:24] <marostegui>	 yeah, I will wait for codfw to be finished 
[09:54:27] <wikibugs>	 (03PS1) 10Elukey: hive-env.sh: use HIVE_METASTORE_HADOOP_OPTS to configure the metastore [puppet/cdh] - 10https://gerrit.wikimedia.org/r/411195 (https://phabricator.wikimedia.org/T184794)
[09:55:07] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Remove db1053 (duration: 00m 56s)
[09:55:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:23] <jynus>	 done
[09:55:29] <marostegui>	 thanks!
[09:56:26] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1099:3311 - T162807 (duration: 00m 56s)
[09:56:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:40] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[09:57:40] <wikibugs>	 (03CR) 10Elukey: [V: 032 C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9998/analytics1003.eqiad.wmnet/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/411195 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey)
[09:57:46] <wikibugs>	 10Operations, 10monitoring: Many "NRPE: Unable to read output" from "long running screen/tmux" checks in icinga - https://phabricator.wikimedia.org/T187528#3977977 (10fgiunchedi) cc @Dzahn as original author of the check
[10:00:30] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Move db1053 from eqiad:core:s2 to eqiad:misc:m3 (phabricator) [puppet] - 10https://gerrit.wikimedia.org/r/411196 (https://phabricator.wikimedia.org/T183469)
[10:01:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1053 from eqiad:core:s2 to eqiad:misc:m3 (phabricator) [puppet] - 10https://gerrit.wikimedia.org/r/411196 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:01:47] <wikibugs>	 (03CR) 10Jcrespo: [V: 032] "Ignoring linter" [puppet] - 10https://gerrit.wikimedia.org/r/411196 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:01:53] <wikibugs>	 (03PS1) 10Elukey: Update the cdh module to its latest change [puppet] - 10https://gerrit.wikimedia.org/r/411197 (https://phabricator.wikimedia.org/T184794)
[10:03:07] <wikibugs>	 (03CR) 10Elukey: [C: 032] Update the cdh module to its latest change [puppet] - 10https://gerrit.wikimedia.org/r/411197 (https://phabricator.wikimedia.org/T184794) (owner: 10Elukey)
[10:18:28] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Prepare reimage of db1053 and db2042 to strech [puppet] - 10https://gerrit.wikimedia.org/r/411198 (https://phabricator.wikimedia.org/T183469)
[10:19:40] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Prepare reimage of db1053 and db2042 to strech [puppet] - 10https://gerrit.wikimedia.org/r/411198 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:20:27] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Prepare reimage of db1053 and db2042 to strech [puppet] - 10https://gerrit.wikimedia.org/r/411198 (https://phabricator.wikimedia.org/T183469)
[10:21:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Prepare reimage of db1053 and db2042 to strech [puppet] - 10https://gerrit.wikimedia.org/r/411198 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:25:57] <icinga-wm>	 PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100%
[10:26:07] <icinga-wm>	 PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100%
[10:26:08] <jynus>	 ?
[10:26:19] <jynus>	 is it ganeti again?
[10:26:38] <icinga-wm>	 PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:29:37] <icinga-wm>	 RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[10:30:37] <icinga-wm>	 RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms
[10:30:37] <icinga-wm>	 RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms
[10:30:48] <icinga-wm>	 PROBLEM - puppet last run on db1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:34:19] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3854215 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on sarin.codfw.wmnet for hosts: ``` ['db2042.codfw.wmnet'] ``` The log can...
[10:34:30] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3978016 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1053.eqiad.wmnet'] ``` The log...
[10:34:53] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Move db2042 from codfw:core:s1 to misc:s3 (phabricator) [puppet] - 10https://gerrit.wikimedia.org/r/410794 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:34:59] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Move db2042 from codfw:core:s1 to misc:s3 (phabricator) [puppet] - 10https://gerrit.wikimedia.org/r/410794 (https://phabricator.wikimedia.org/T183470)
[10:35:03] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Move db2042 from codfw:core:s1 to misc:s3 (phabricator) [puppet] - 10https://gerrit.wikimedia.org/r/410794 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:35:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2042 from codfw:core:s1 to misc:s3 (phabricator) [puppet] - 10https://gerrit.wikimedia.org/r/410794 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[10:36:16] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Move db1053 from eqiad:core:s2 to eqiad:misc:m3 (phabricator) [puppet] - 10https://gerrit.wikimedia.org/r/411196 (https://phabricator.wikimedia.org/T183469)
[10:36:22] <wikibugs>	 (03CR) 10Jcrespo: [V: 032] mariadb: Move db1053 from eqiad:core:s2 to eqiad:misc:m3 (phabricator) [puppet] - 10https://gerrit.wikimedia.org/r/411196 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:36:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1053 from eqiad:core:s2 to eqiad:misc:m3 (phabricator) [puppet] - 10https://gerrit.wikimedia.org/r/411196 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:36:59] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Move db1053 from eqiad:core:s2 to eqiad:misc:m3 (phabricator) [puppet] - 10https://gerrit.wikimedia.org/r/411196 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[10:52:24] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: tweak node_exporter ignored_devices and ignored_fs_types [puppet] - 10https://gerrit.wikimedia.org/r/404430
[10:52:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: tweak node_exporter ignored_devices and ignored_fs_types [puppet] - 10https://gerrit.wikimedia.org/r/404430 (owner: 10Filippo Giunchedi)
[10:53:50] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10netops: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3978036 (10Marostegui) Please change this to db2093 as we have decided to rename that host from tendril2001 to db2093 (T186123#3975533) Thanks and sorry for che changes!
[10:54:37] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: tweak node_exporter ignored_devices and ignored_fs_types [puppet] - 10https://gerrit.wikimedia.org/r/404430
[10:57:21] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3978043 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1053.eqiad.wmnet'] ```  and were **ALL** successful.
[10:59:16] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3978052 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2042.codfw.wmnet'] ```  and were **ALL** successful.
[11:00:57] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411199
[11:03:12] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411199 (owner: 10Marostegui)
[11:03:32] <wikibugs>	 (03PS1) 10Marostegui: db1093: Update socket path [puppet] - 10https://gerrit.wikimedia.org/r/411200
[11:05:16] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411199 (owner: 10Marostegui)
[11:05:40] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Disable notifications for db1043 and db2012 [puppet] - 10https://gerrit.wikimedia.org/r/411201 (https://phabricator.wikimedia.org/T183469)
[11:06:14] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Disable notifications for db1043 and db2012 [puppet] - 10https://gerrit.wikimedia.org/r/411201 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[11:06:25] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1093 (duration: 00m 56s)
[11:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:41] <marostegui>	 !log Stop MySQL on db1093 for mariadb and kernel upgrade, also update socket path
[11:06:42] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411199 (owner: 10Marostegui)
[11:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:09] <wikibugs>	 (03PS2) 10Marostegui: db1093: Update socket path [puppet] - 10https://gerrit.wikimedia.org/r/411200
[11:07:57] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db1093: Update socket path [puppet] - 10https://gerrit.wikimedia.org/r/411200 (owner: 10Marostegui)
[11:09:08] <elukey>	 !log restart nfaccd on rhenium to see if it picks up the new kafka topic config (3 partitions)
[11:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:47] <elukey>	 not really
[11:10:30] <wikibugs>	 (03CR) 10Hashar: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394199 (owner: 10Chad)
[11:10:33] <wikibugs>	 (03CR) 10Hashar: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410110 (owner: 10Krinkle)
[11:11:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move all dblists on noc to dblists/ directory, rather than individually [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394199 (owner: 10Chad)
[11:14:40] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411202
[11:17:20] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411202 (owner: 10Marostegui)
[11:18:54] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411202 (owner: 10Marostegui)
[11:19:04] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411202 (owner: 10Marostegui)
[11:20:04] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1093 (duration: 00m 56s)
[11:20:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:01] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Change phabricator db config to modern defaults [puppet] - 10https://gerrit.wikimedia.org/r/411203 (https://phabricator.wikimedia.org/T183470)
[11:21:26] <wikibugs>	 (03PS1) 10Elukey: pmacct: set kafka_patition to -1 on nfacctd.conf [puppet] - 10https://gerrit.wikimedia.org/r/411204 (https://phabricator.wikimedia.org/T181036)
[11:22:24] <wikibugs>	 10Operations, 10monitoring: es1019 ipmi and mgmt unresponsive - https://phabricator.wikimedia.org/T187530#3978115 (10fgiunchedi) p:05Triage>03Normal
[11:23:24] <wikibugs>	 (03CR) 10Hashar: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411199 (owner: 10Marostegui)
[11:23:27] <wikibugs>	 (03CR) 10Hashar: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411194 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[11:23:57] <wikibugs>	 (03CR) 10Elukey: [C: 032] pmacct: set kafka_patition to -1 on nfacctd.conf [puppet] - 10https://gerrit.wikimedia.org/r/411204 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey)
[11:23:58] <wikibugs>	 10Operations, 10monitoring: es1019 ipmi and mgmt unresponsive - https://phabricator.wikimedia.org/T187530#3978115 (10Marostegui) This is a slave, so if we need to reboot it, it should be doable.
[11:24:38] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411194 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui)
[11:24:40] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411199 (owner: 10Marostegui)
[11:25:08] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411205
[11:28:06] <hashar>	 !log Switching operations/mediawiki-config job for composer to Docker | https://gerrit.wikimedia.org/r/#/c/411206/
[11:28:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:33] <ema>	 !log cp3036: restart varnish-fe to clear 'child restarted' alert  
[11:28:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:21] <wikibugs>	 (03PS1) 10Elukey: camus: set mapreduce jobs to 3 [puppet] - 10https://gerrit.wikimedia.org/r/411207 (https://phabricator.wikimedia.org/T181036)
[11:30:21] <wikibugs>	 (03PS2) 10Elukey: camus: set netflow's mapreduce jobs to 3 [puppet] - 10https://gerrit.wikimedia.org/r/411207 (https://phabricator.wikimedia.org/T181036)
[11:30:39] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411205 (owner: 10Marostegui)
[11:31:00] <wikibugs>	 (03CR) 10Elukey: [C: 032] camus: set netflow's mapreduce jobs to 3 [puppet] - 10https://gerrit.wikimedia.org/r/411207 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey)
[11:31:50] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411205 (owner: 10Marostegui)
[11:32:03] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411205 (owner: 10Marostegui)
[11:32:22] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] "Looking good: https://puppet-compiler.wmflabs.org/compiler02/9999/db1059.eqiad.wmnet/ db1059 was already using a symlink on /tmp" [puppet] - 10https://gerrit.wikimedia.org/r/411203 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo)
[11:32:27] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Change phabricator db config to modern defaults [puppet] - 10https://gerrit.wikimedia.org/r/411203 (https://phabricator.wikimedia.org/T183470)
[11:33:01] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1093 (duration: 00m 56s)
[11:33:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:55] <jynus>	 !log changing socket location on phabricator db hosts T148507
[11:34:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:09] <jynus>	 !log stopping mysql on db1043, db2012 for clonning data away
[11:35:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:24] <wikibugs>	 10Operations, 10monitoring: es1019 ipmi and mgmt unresponsive - https://phabricator.wikimedia.org/T187530#3978185 (10fgiunchedi) Yeah it looks like it'll need a power drain like last time in parent task. cc #ops-eqiad and @Cmjohnson for visibility
[11:36:33] <arturo>	 moritzm: would you +1 https://gerrit.wikimedia.org/r/#/c/410177/ ?
[11:37:23] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193)
[11:38:18] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411208
[11:38:37] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[11:38:41] <moritzm>	 arturo: sure, is later the afternoon okay? currently busy with other tasks
[11:38:47] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1008 is CRITICAL: CRITICAL check_failover servers up 1 down 1
[11:38:52] <wikibugs>	 10Operations, 10ops-eqiad, 10monitoring: es1019 ipmi and mgmt unresponsive - https://phabricator.wikimedia.org/T187530#3978201 (10fgiunchedi)
[11:38:55] <jynus>	 the haproxy is me, it is the replicas I am curretnly rebuilding
[11:39:02] <jynus>	 see last log
[11:39:18] <arturo>	 moritzm: well, I was hoping for something more in realtime
[11:39:31] <icinga-wm>	 PROBLEM - mysqld processes on db1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[11:39:40] <jynus>	 mm
[11:40:07] <godog>	 I'm assuming that's you too jynus ? ^
[11:40:20] <jynus>	 the last puppet run wiped the disable notifications
[11:40:29] <jynus>	 because it hasn't run on icinga yet
[11:42:09] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411208 (owner: 10Marostegui)
[11:42:43] <jynus>	 did you get a page? I didn't
[11:42:54] <marostegui>	 I did
[11:43:18] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411208 (owner: 10Marostegui)
[11:43:31] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411208 (owner: 10Marostegui)
[11:44:00] <wikibugs>	 10Operations, 10ops-codfw: db2049 management unable to login via ssh - https://phabricator.wikimedia.org/T187534#3978203 (10fgiunchedi)
[11:44:30] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1093 (duration: 00m 56s)
[11:44:32] <godog>	 yeah that paged for me
[11:44:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:02] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411209
[11:46:58] <jynus>	 I merged https://gerrit.wikimedia.org/r/#/c/411201/ at 12:06 PM
[11:47:30] <volans|off>	 I got it too
[11:48:13] <jynus>	 I didn't, it showed out of my notification hours
[11:48:29] <wikibugs>	 10Operations, 10ops-codfw: db2049 management unable to login via ssh - https://phabricator.wikimedia.org/T187534#3978218 (10Marostegui) This is a slave, so if @Papaul needs to reboot it to get it fixed, we can easily depool it.
[11:52:39] <jynus>	 now I get a page
[11:58:04] <wikibugs>	 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, and 2 others: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3978245 (10elukey) After the last round of patches nfacctd/pmacct are sending events to Kafka using three topic partitions rath...
[11:58:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] toollabs: add apt pinnings for key packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[11:59:51] <wikibugs>	 (03PS9) 10Ema: icinga: add check_established_connections plugin [puppet] - 10https://gerrit.wikimedia.org/r/409921 (https://phabricator.wikimedia.org/T170847)
[12:00:00] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] icinga: add check_established_connections plugin [puppet] - 10https://gerrit.wikimedia.org/r/409921 (https://phabricator.wikimedia.org/T170847) (owner: 10Ema)
[12:00:19] <arturo>	 thanks moritzm !!!
[12:01:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Move python3 into standard packages, it's 2018 after all. [puppet] - 10https://gerrit.wikimedia.org/r/411211
[12:03:28] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411209 (owner: 10Marostegui)
[12:03:31] <wikibugs>	 (03CR) 10Chad: toollabs: add apt pinnings for key packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[12:04:39] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411209 (owner: 10Marostegui)
[12:04:55] <wikibugs>	 (03CR) 10Volans: [C: 031] "+1000 \o/ (to be on the safe side I suggest to run a full compiler, just in case)" [puppet] - 10https://gerrit.wikimedia.org/r/411211 (owner: 10Muehlenhoff)
[12:05:49] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1093 (duration: 00m 56s)
[12:06:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:38] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411209 (owner: 10Marostegui)
[12:08:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[12:12:40] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411212
[12:14:07] <wikibugs>	 (03PS1) 10Marostegui: db1094: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/411213
[12:14:47] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411212 (owner: 10Marostegui)
[12:15:57] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411212 (owner: 10Marostegui)
[12:16:42] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411212 (owner: 10Marostegui)
[12:16:44] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411215
[12:17:06] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411215
[12:17:15] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1094 (duration: 00m 56s)
[12:17:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:55] <marostegui>	 !log Stop MySQL on db1094 for mariadb upgrade, kernel upgrade and socket location upgrade
[12:18:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:18] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db1094: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/411213 (owner: 10Marostegui)
[12:20:46] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411215 (owner: 10Marostegui)
[12:21:54] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411215 (owner: 10Marostegui)
[12:21:59] <wikibugs>	 (03PS1) 10Jcrespo: Mariadb: Set default basedir for phabricator (Mariadb 10.1) [puppet] - 10https://gerrit.wikimedia.org/r/411216 (https://phabricator.wikimedia.org/T183469)
[12:22:04] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3311" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411215 (owner: 10Marostegui)
[12:22:38] <wikibugs>	 (03PS2) 10Jcrespo: Mariadb: Set default basedir for phabricator (Mariadb 10.1) [puppet] - 10https://gerrit.wikimedia.org/r/411216 (https://phabricator.wikimedia.org/T183469)
[12:23:07] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Mariadb: Set default basedir for phabricator (Mariadb 10.1) [puppet] - 10https://gerrit.wikimedia.org/r/411216 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[12:23:34] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1099:3311 - T162807 (duration: 00m 56s)
[12:23:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:48] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[12:31:30] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411218
[12:32:48] <icinga-wm>	 PROBLEM - Check systemd state on serpens is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:33:33] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411218 (owner: 10Marostegui)
[12:35:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-openldap-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411219 (https://phabricator.wikimedia.org/T135991)
[12:35:40] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411218 (owner: 10Marostegui)
[12:36:20] <moritzm>	 ^ serpens is me
[12:36:42] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411218 (owner: 10Marostegui)
[12:36:57] <icinga-wm>	 RECOVERY - Check systemd state on serpens is OK: OK - running: The system is fully operational
[12:37:06] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1094 (duration: 00m 56s)
[12:37:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:44] <ema>	 !log cp3049: restart varnish-fe to clear 'child restarted' alert
[12:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:52] <wikibugs>	 (03PS1) 10Jcrespo: dbproxy: Failover db1043 phabricator replica db to db1053 [puppet] - 10https://gerrit.wikimedia.org/r/411220 (https://phabricator.wikimedia.org/T183469)
[12:40:37] <wikibugs>	 (03CR) 10Marostegui: [C: 031] dbproxy: Failover db1043 phabricator replica db to db1053 [puppet] - 10https://gerrit.wikimedia.org/r/411220 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[12:41:10] <wikibugs>	 (03PS1) 10Jcrespo: dbproxy: Failover db1043 phabricator replica db to db1053 [dns] - 10https://gerrit.wikimedia.org/r/411221 (https://phabricator.wikimedia.org/T183469)
[12:41:14] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411222
[12:41:28] <wikibugs>	 (03CR) 10Marostegui: [C: 031] dbproxy: Failover db1043 phabricator replica db to db1053 [dns] - 10https://gerrit.wikimedia.org/r/411221 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[12:41:30] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] dbproxy: Failover db1043 phabricator replica db to db1053 [puppet] - 10https://gerrit.wikimedia.org/r/411220 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[12:44:21] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411222 (owner: 10Marostegui)
[12:44:42] <jynus>	 !log reload dbproxy1003 configuration
[12:44:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:07] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0
[12:45:18] <jynus>	 db1059,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP, db1053,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP
[12:45:29] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411222 (owner: 10Marostegui)
[12:46:04] <jynus>	 !log reload dbproxy1008 configuration
[12:46:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:36] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1094 (duration: 00m 56s)
[12:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:48] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411222 (owner: 10Marostegui)
[12:47:17] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1008 is OK: OK check_failover servers up 2 down 0
[12:50:46] <wikibugs>	 (03CR) 10Hashar: "A side effect is all those modules now suddenly depends on "base".  At least for Nodepool, we don't use standard nor base modules.  Althou" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/411211 (owner: 10Muehlenhoff)
[12:51:59] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411223
[12:53:05] <wikibugs>	 (03CR) 10Muehlenhoff: Move python3 into standard packages, it's 2018 after all. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/411211 (owner: 10Muehlenhoff)
[12:53:43] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests: Decommission db1043 - https://phabricator.wikimedia.org/T187542#3978426 (10jcrespo) p:05Triage>03Normal
[12:57:53] <wikibugs>	 (03CR) 10Chad: [C: 031] "Half of these are just plain redundant anyway. I mean python3-yaml implies python3 duh." [puppet] - 10https://gerrit.wikimedia.org/r/411211 (owner: 10Muehlenhoff)
[13:00:01] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411223 (owner: 10Marostegui)
[13:01:12] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411223 (owner: 10Marostegui)
[13:01:25] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411223 (owner: 10Marostegui)
[13:02:17] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1094 (duration: 00m 56s)
[13:02:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:38] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411226
[13:07:06] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067 and db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411227
[13:07:09] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1067 and db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411227
[13:09:17] <wikibugs>	 10Operations, 10ops-codfw, 10hardware-requests: Decommission db2012 - https://phabricator.wikimedia.org/T187543#3978461 (10jcrespo) p:05Triage>03Normal
[13:09:25] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1067 and db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411227 (owner: 10Marostegui)
[13:10:36] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067 and db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411227 (owner: 10Marostegui)
[13:10:50] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067 and db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411227 (owner: 10Marostegui)
[13:11:11] <wikibugs>	 (03PS1) 10Jcrespo: Mariadb: Schedule db1043 and db2012 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/411229 (https://phabricator.wikimedia.org/T187543)
[13:11:31] <wikibugs>	 10Operations, 10ops-eqiad: Decommission mw1259-mw1260 - https://phabricator.wikimedia.org/T187466#3978490 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[13:11:56] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1089 and db1067 - T162807 (duration: 00m 55s)
[13:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:08] <stashbot>	 T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807
[13:12:47] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411226
[13:12:55] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193)
[13:13:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[13:14:46] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411226 (owner: 10Marostegui)
[13:15:43] <wikibugs>	 (03PS1) 10Jcrespo: dblist: Update latest m3 movements [software] - 10https://gerrit.wikimedia.org/r/411230
[13:15:45] <wikibugs>	 (03PS1) 10Jcrespo: Update mariadb packages to 10.0.34 and 10.1.31 [software] - 10https://gerrit.wikimedia.org/r/411231
[13:15:54] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411226 (owner: 10Marostegui)
[13:15:56] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193)
[13:16:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] toollabs: add apt pinnings for key packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[13:16:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[13:16:26] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] Update mariadb packages to 10.0.34 and 10.1.31 [software] - 10https://gerrit.wikimedia.org/r/411231 (owner: 10Jcrespo)
[13:16:31] <wikibugs>	 (03PS2) 10Jcrespo: Update mariadb packages to 10.0.34 and 10.1.31 [software] - 10https://gerrit.wikimedia.org/r/411231
[13:16:33] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] Update mariadb packages to 10.0.34 and 10.1.31 [software] - 10https://gerrit.wikimedia.org/r/411231 (owner: 10Jcrespo)
[13:17:08] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] dblist: Update latest m3 movements [software] - 10https://gerrit.wikimedia.org/r/411230 (owner: 10Jcrespo)
[13:17:12] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3978507 (10EddieGP) Next steps: # Deploy https://gerrit.wikimedia.org/r...
[13:17:13] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1094 (duration: 00m 56s)
[13:17:14] <wikibugs>	 (03PS2) 10Jcrespo: dblist: Update latest m3 movements [software] - 10https://gerrit.wikimedia.org/r/411230
[13:17:16] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] dblist: Update latest m3 movements [software] - 10https://gerrit.wikimedia.org/r/411230 (owner: 10Jcrespo)
[13:17:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:32] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411226 (owner: 10Marostegui)
[13:21:03] <wikibugs>	 (03CR) 10Chad: [C: 031] "Also we should kill extract2 outright but hey baby steps." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410109 (owner: 10Krinkle)
[13:21:20] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: toollabs: add apt pinnings for key packages [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193)
[13:24:41] <wikibugs>	 (03PS2) 10Jcrespo: Mariadb: Schedule db1043 and db2012 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/411229 (https://phabricator.wikimedia.org/T187543)
[13:27:40] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] dbproxy: Failover db1043 phabricator replica db to db1053 [dns] - 10https://gerrit.wikimedia.org/r/411221 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo)
[13:38:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "Thanks for taking care of this!" [puppet] - 10https://gerrit.wikimedia.org/r/411211 (owner: 10Muehlenhoff)
[13:40:48] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Mariadb: Schedule db1043 and db2012 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/411229 (https://phabricator.wikimedia.org/T187543) (owner: 10Jcrespo)
[13:43:45] <wikibugs>	 (03PS2) 10Muehlenhoff: Move python3 into standard packages, it's 2018 after all. [puppet] - 10https://gerrit.wikimedia.org/r/411211
[13:49:44] <wikibugs>	 (03PS1) 10BBlack: Revert "Varnish: block MJ12bot" [puppet] - 10https://gerrit.wikimedia.org/r/411238
[13:53:56] <wikibugs>	 (03CR) 10BBlack: [C: 032] Revert "Varnish: block MJ12bot" [puppet] - 10https://gerrit.wikimedia.org/r/411238 (owner: 10BBlack)
[14:06:28] <chasemp>	 !log T184209 initial setup of labs-instances2-b-codfw and hosts
[14:06:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:36] <wikibugs>	 10Operations, 10ops-eqiad: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121#3978654 (10akosiaris) First successful and intended reproduction!. After ~2 hours we got  ``` Feb 16 10:29:06 ganeti1005 kernel: [669669.675614] qemu-system-x86: page allocation stalls...
[14:15:24] <akosiaris>	 !log doing more IO stress tests on ganeti1005. T181121. Seems like we can reproduce
[14:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:39] <stashbot>	 T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121
[14:28:43] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS: connect eth2 for labneutron2001 and 2002 - https://phabricator.wikimedia.org/T187552#3978705 (10chasemp)
[14:28:55] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS: connect eth2 for labneutron2001 and 2002 - https://phabricator.wikimedia.org/T187552#3978719 (10chasemp) p:05Triage>03Normal
[14:29:41] <wikibugs>	 (03PS1) 10Elukey: profile::kafka::burrow: add prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/411249
[14:30:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::kafka::burrow: add prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/411249 (owner: 10Elukey)
[14:39:15] <wikibugs>	 (03PS2) 10Elukey: profile::kafka::burrow: add prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/411249
[14:44:30] <wikibugs>	 (03PS3) 10Elukey: profile::kafka::burrow: add prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/411249
[14:56:39] <wikibugs>	 (03PS4) 10Elukey: profile::kafka::burrow: add prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/411249
[14:58:28] <wikibugs>	 (03CR) 10Herron: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/410758 (https://phabricator.wikimedia.org/T181519) (owner: 10Filippo Giunchedi)
[14:59:48] <icinga-wm>	 PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Packet loss = 100%
[14:59:58] <icinga-wm>	 PROBLEM - Host sca1004 is DOWN: PING CRITICAL - Packet loss = 100%
[15:01:09] <icinga-wm>	 PROBLEM - SSH on ganeti1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:03:08] <icinga-wm>	 RECOVERY - SSH on ganeti1005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[15:03:42] <wikibugs>	 (03PS5) 10Elukey: profile::kafka::burrow: add prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/411249 (https://phabricator.wikimedia.org/T180442)
[15:03:55] <wikibugs>	 (03PS1) 10Chad: robots.txt: Combine various NS_SPECIAL disallows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411255
[15:04:29] <wikibugs>	 (03PS1) 10Muehlenhoff: wmf-auto-restart: Fix restarts if multiple libraries in need of a restart [puppet] - 10https://gerrit.wikimedia.org/r/411256 (https://phabricator.wikimedia.org/T135991)
[15:05:38] <icinga-wm>	 RECOVERY - Host sca1004 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms
[15:05:39] <icinga-wm>	 RECOVERY - Host logstash1008 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms
[15:08:34] <akosiaris>	 ok looks like a reproduced it once more :-)
[15:09:32] <elukey>	 nice! how did you do it?
[15:09:47] <akosiaris>	 elukey: actually you did it. https://phabricator.wikimedia.org/T181121#3978654
[15:10:36] <akosiaris>	 fun part is that all this is done via the VM. I now want to see if I can reproduce it from the host
[15:10:50] <akosiaris>	 that and see if DRBD is or is not related per moritzm suggestion
[15:11:54] <elukey>	 good luck! :)
[15:14:08] <akosiaris>	 !log poweroff sca1004, switch from DRBD to plain disk template T181121
[15:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:22] <stashbot>	 T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121
[15:16:39] <akosiaris>	 !log run T181121#3978654 oneliner once more on sca1004, this time the VM has no DRBD 
[15:16:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:12] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on analytics1057 - https://phabricator.wikimedia.org/T187146#3978838 (10Cmjohnson)
[15:23:14] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban: Broken disk on analytics1057 - https://phabricator.wikimedia.org/T187162#3978840 (10Cmjohnson)
[15:24:48] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on analytics1057 - https://phabricator.wikimedia.org/T187146#3965895 (10Cmjohnson) @elukey the disk has been replaced. you may need to add it back   Return tracking info USPS 9202 3946 5301 2437 9877 74 FEDEX 8611918 2393026 74737795
[15:25:36] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on analytics1057 - https://phabricator.wikimedia.org/T187146#3978851 (10Cmjohnson) a:03elukey Assigning to @elukey to add back and resolve task
[15:27:41] <godog>	 !log shut ms-be1018 for bbu swap - T186988
[15:27:51] <godog>	 cmjohnson1: ^ should be powered off shortly
[15:27:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:54] <stashbot>	 T186988: Degraded RAID on ms-be1018 - https://phabricator.wikimedia.org/T186988
[15:28:06] <logmsgbot>	 !log andrew@tin Started deploy [horizon/deploy@58d2718]: first attempt at ocata branch
[15:28:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:28] <wikibugs>	 (03PS1) 10Ema: etcd: Introduce reconnectTimeout [debs/pybal] - 10https://gerrit.wikimedia.org/r/411264 (https://phabricator.wikimedia.org/T169765)
[15:29:34] <logmsgbot>	 !log andrew@tin Finished deploy [horizon/deploy@58d2718]: first attempt at ocata branch (duration: 01m 28s)
[15:29:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:08] <icinga-wm>	 PROBLEM - Varnish HTTP text-backend - port 3128 on cp5007 is CRITICAL: connect to address 10.132.0.107 and port 3128: Connection refused
[15:33:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] wmf-auto-restart: Fix restarts if multiple libraries in need of a restart [puppet] - 10https://gerrit.wikimedia.org/r/411256 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:33:05] <ema>	 bblack: I guess this is you, right? ^
[15:33:08] <icinga-wm>	 RECOVERY - Varnish HTTP text-backend - port 3128 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 217 bytes in 0.498 second response time
[15:34:03] <wikibugs>	 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#3978884 (10Cmjohnson) a:05Cmjohnson>03RobH @robh the disk has been replaced. Assigning back to you  Return tracking info  USPS 9202 3946 5301 2437 9854 35 FEDEX 9611918 2393026 74735456
[15:35:23] <bblack>	 ema: yess :)
[15:37:06] <logmsgbot>	 !log andrew@tin Started deploy [horizon/deploy@29f9afb]: second attempt at ocata branch
[15:37:09] <icinga-wm>	 PROBLEM - Varnish HTTP text-backend - port 3128 on cp5007 is CRITICAL: connect to address 10.132.0.107 and port 3128: Connection refused
[15:37:09] <icinga-wm>	 RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy
[15:37:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] wmf-auto-restart: Fix restarts if multiple libraries in need of a restart [puppet] - 10https://gerrit.wikimedia.org/r/411256 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:39:08] <icinga-wm>	 PROBLEM - Host ms-be1018.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:40:28] <logmsgbot>	 !log andrew@tin Finished deploy [horizon/deploy@29f9afb]: second attempt at ocata branch (duration: 03m 22s)
[15:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:58] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1057 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
[15:42:12] <godog>	 and I icinga-downtime'd ms-be1018, ah
[15:43:02] <elukey>	 cmjohnson1: all good with analytics1057, thanks!
[15:44:18] <icinga-wm>	 RECOVERY - Host ms-be1018.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.01 ms
[15:44:28] <wikibugs>	 (03PS9) 10Bstorm: servers: install hp-health on all HP servers [puppet] - 10https://gerrit.wikimedia.org/r/410599 (https://phabricator.wikimedia.org/T187355)
[15:44:39] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on analytics1057 - https://phabricator.wikimedia.org/T187146#3978952 (10elukey) 05Open>03Resolved Disk configured and Hadoop worker node back serving traffic, thanks!
[15:46:58] <icinga-wm>	 RECOVERY - HP RAID on ms-be1018 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK
[15:47:12] <godog>	 cmjohnson1: ^ all good \o/ thanks
[15:47:36] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1018 - https://phabricator.wikimedia.org/T186988#3978958 (10fgiunchedi) 05Open>03Resolved Fixed!  ``` 15:46 -icinga-wm:#wikimedia-operations- RECOVERY - HP RAID on ms-be1018 is OK: OK: Slot 1: OK: 1I:1:5,            1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I...
[15:47:48] <wikibugs>	 10Operations, 10ops-codfw: Decommission mw2017 and mw2099 - https://phabricator.wikimedia.org/T187467#3978970 (10Cmjohnson)
[15:47:55] <godog>	 herron: thanks for the review! mind taking a look at https://gerrit.wikimedia.org/r/c/410759/ too?
[15:48:07] <wikibugs>	 10Operations, 10ops-codfw: Decommission mw2017 and mw2099 - https://phabricator.wikimedia.org/T187467#3976153 (10Cmjohnson) a:03Papaul
[15:50:18] <icinga-wm>	 RECOVERY - Varnish HTTP text-backend - port 3128 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 217 bytes in 0.482 second response time
[15:50:25] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests: Decommisson restbase-dev100[1-3] - https://phabricator.wikimedia.org/T171179#3979012 (10Cmjohnson) 05Open>03Resolved Fixed
[15:50:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3979014 (10Cmjohnson)
[15:50:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-etherpad-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411276 (https://phabricator.wikimedia.org/T135991)
[15:53:58] <wikibugs>	 (03CR) 10Bstorm: [C: 032] servers: install hp-health on all HP servers [puppet] - 10https://gerrit.wikimedia.org/r/410599 (https://phabricator.wikimedia.org/T187355) (owner: 10Bstorm)
[15:57:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: profile::kafka::burrow: add prometheus monitoring (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/411249 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey)
[15:58:30] <icinga-wm>	 PROBLEM - Host labstore1006 is DOWN: PING CRITICAL - Packet loss = 100%
[15:59:10] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS: connect eth2 for labneutron2001 and 2002 - https://phabricator.wikimedia.org/T187552#3979061 (10chasemp)
[15:59:51] <icinga-wm>	 PROBLEM - Host labstore1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:00:56] <logmsgbot>	 !log andrew@tin Started deploy [horizon/deploy@16d0b17]: ocata branch with upper constraints
[16:01:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:37] <chasemp>	 madhuvishy: do we know taht labstore1006 is down?
[16:02:42] <chasemp>	 I think yes
[16:03:14] <logmsgbot>	 !log andrew@tin Finished deploy [horizon/deploy@16d0b17]: ocata branch with upper constraints (duration: 02m 18s)
[16:03:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:44] <cmjohnson1>	 chasemp sorry that's me moving it
[16:04:52] <cmjohnson1>	 thought it was already in maint mode
[16:04:54] <madhuvishy>	 chasemp: no. cmjohnson1 are you working on labstore1006?
[16:04:57] <madhuvishy>	 Oh cool
[16:05:06] <madhuvishy>	 I can downtime
[16:05:54] <paravoid>	 cmjohnson1: don't forget to !log :)
[16:06:21] <cmjohnson1>	 !log labstore1006 and labstore1007 down for rack relocation 
[16:06:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:32] <chasemp>	 cmjohnson1: no worries man I was just checking, I knew that whole move thing was coming up
[16:11:01] <wikibugs>	 (03PS6) 10Elukey: profile::kafka::burrow: add prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/411249 (https://phabricator.wikimedia.org/T180442)
[16:13:27] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/10004/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/411249 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey)
[16:13:31] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-openldap-exporter [puppet] - 10https://gerrit.wikimedia.org/r/411219 (https://phabricator.wikimedia.org/T135991)
[16:13:46] <logmsgbot>	 !log andrew@tin Started deploy [horizon/deploy@16f3d8e]: ocata branch with upper new requirements
[16:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:47] <wikibugs>	 (03CR) 10Elukey: "Just noticed File[/etc/default/prometheus-burrow-exporter@main-eqiad], not pretty, going to amend it.." [puppet] - 10https://gerrit.wikimedia.org/r/411249 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey)
[16:16:27] <wikibugs>	 (03CR) 10Elukey: "> Just noticed File[/etc/default/prometheus-burrow-exporter@main-eqiad]," [puppet] - 10https://gerrit.wikimedia.org/r/411249 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey)
[16:17:37] <wikibugs>	 (03PS1) 10Chad: Remove indirection from search-redirect.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411284
[16:21:46] <logmsgbot>	 !log andrew@tin Finished deploy [horizon/deploy@16f3d8e]: ocata branch with upper new requirements (duration: 08m 00s)
[16:21:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:20] <icinga-wm>	 RECOVERY - Host labstore1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms
[16:30:14] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10netops: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3979092 (10ayounsi) No worries, port description renamed!
[16:31:48] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: set up values for test and n environment [puppet] - 10https://gerrit.wikimedia.org/r/410943 (https://phabricator.wikimedia.org/T184209) (owner: 10Rush)
[16:31:52] <wikibugs>	 (03PS3) 10Rush: openstack: set up values for test and n environment [puppet] - 10https://gerrit.wikimedia.org/r/410943 (https://phabricator.wikimedia.org/T184209)
[16:33:29] <arturo>	 some day we will have proper names for these clusters ^^^
[16:36:18] <wikibugs>	 (03PS1) 10Rush: openstack: correct key lookup values for n deployment [puppet] - 10https://gerrit.wikimedia.org/r/411288
[16:37:42] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: correct key lookup values for n deployment [puppet] - 10https://gerrit.wikimedia.org/r/411288 (owner: 10Rush)
[16:43:56] <wikibugs>	 (03CR) 10Niedzielski: [C: 04-1] New: add chromium_render service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/409996 (https://phabricator.wikimedia.org/T178166) (owner: 10Niedzielski)
[16:54:03] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Enable EtcdConfig on the debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411296 (https://phabricator.wikimedia.org/T149617)
[16:55:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable EtcdConfig on the debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411296 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto)
[16:58:14] <wikibugs>	 (03CR) 10Ppchelko: New: add chromium_render service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/409996 (https://phabricator.wikimedia.org/T178166) (owner: 10Niedzielski)
[17:00:34] <icinga-wm>	 PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:01:33] <icinga-wm>	 PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:02:03] <icinga-wm>	 PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:02:03] <icinga-wm>	 PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:02:43] <icinga-wm>	 PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:03:53] <wikibugs>	 (03PS3) 10Niedzielski: New: add chromium_render service [puppet] - 10https://gerrit.wikimedia.org/r/409996 (https://phabricator.wikimedia.org/T178166)
[17:03:59] <wikibugs>	 (03CR) 10Niedzielski: New: add chromium_render service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/409996 (https://phabricator.wikimedia.org/T178166) (owner: 10Niedzielski)
[17:04:03] <icinga-wm>	 PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:04:03] <icinga-wm>	 PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:04:23] <icinga-wm>	 PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:04:33] <icinga-wm>	 PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:04:33] <icinga-wm>	 PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:04:33] <icinga-wm>	 PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:04:43] <icinga-wm>	 PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:04:44] <icinga-wm>	 PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:04:53] <icinga-wm>	 PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:04:53] <icinga-wm>	 PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:04:55] <wikibugs>	 (03CR) 10Ppchelko: New: add chromium_render service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/409996 (https://phabricator.wikimedia.org/T178166) (owner: 10Niedzielski)
[17:05:54] <elukey>	 puppetdb --> Active: active (running) since Fri 2018-02-16 16:57:50 UTC; 7min ago
[17:06:31] <wikibugs>	 (03CR) 10Niedzielski: [C: 04-1] "Whoops! Well I will leave this voted down then until a port is decided upon." [puppet] - 10https://gerrit.wikimedia.org/r/409996 (https://phabricator.wikimedia.org/T178166) (owner: 10Niedzielski)
[17:07:03] <icinga-wm>	 PROBLEM - Host labstore1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:08:06] <wikibugs>	 (03CR) 10Chad: Move all dblists on noc to dblists/ directory, rather than individually (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394199 (owner: 10Chad)
[17:10:54] <icinga-wm>	 PROBLEM - Varnish HTTP text-backend - port 3128 on cp4028 is CRITICAL: connect to address 10.128.0.128 and port 3128: Connection refused
[17:11:54] <icinga-wm>	 RECOVERY - Varnish HTTP text-backend - port 3128 on cp4028 is OK: HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.157 second response time
[17:12:13] <icinga-wm>	 RECOVERY - Host labstore1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.71 ms
[17:18:29] <wikibugs>	 (03CR) 10Chad: toollabs: add apt pinnings for key packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[17:19:03] <wikibugs>	 (03CR) 10Chad: toollabs: add apt pinnings for key packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/410177 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez)
[17:29:03] <icinga-wm>	 RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[17:29:04] <icinga-wm>	 RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:29:23] <icinga-wm>	 RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:29:33] <icinga-wm>	 RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[17:29:33] <icinga-wm>	 RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:29:34] <icinga-wm>	 RECOVERY - puppet last run on analytics1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:29:43] <icinga-wm>	 RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:29:44] <icinga-wm>	 RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[17:29:53] <icinga-wm>	 RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:29:53] <icinga-wm>	 RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:30:43] <icinga-wm>	 RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:31:33] <icinga-wm>	 RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:32:03] <icinga-wm>	 RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:32:03] <icinga-wm>	 RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:32:43] <icinga-wm>	 RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:34:38] <wikibugs>	 10Operations, 10ops-eqiad, 10hardware-requests: Decommission host erbium - https://phabricator.wikimedia.org/T185226#3979284 (10RobH) p:05Triage>03Normal
[17:36:20] <wikibugs>	 10Operations, 10ops-codfw, 10hardware-requests, 10Patch-For-Review: Decommission db2012 - https://phabricator.wikimedia.org/T187543#3978461 (10RobH)
[17:37:25] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2012 - https://phabricator.wikimedia.org/T187543#3979295 (10RobH)
[17:37:41] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1043 - https://phabricator.wikimedia.org/T187542#3979298 (10RobH)
[17:39:12] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2012 - https://phabricator.wikimedia.org/T187543#3978461 (10RobH) Since this is pending the DBA team's work on stating the new host is online, I've appended in the #DBA flag.  Once the DBA team work is done (their s...
[17:39:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1043 - https://phabricator.wikimedia.org/T187542#3978426 (10RobH) Since this is pending the DBA team's work on stating the new host is online, I've appended in the #DBA flag.  Once the DBA team work is done (their s...
[17:40:04] <robh>	 back into the s/salt/decom mines
[17:57:53] <wikibugs>	 (03PS1) 10Chico Venancio: Graphite sumSeries to reduce shinken puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315
[18:02:49] <wikibugs>	 10Operations, 10Analytics-Kanban, 10monitoring, 10netops, and 2 others: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3979339 (10Nuria) Are  we planing to use tranquility to move the he data into druid or rather just kafka-> camus-> hive?
[18:07:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I like the change as commented on IRC. This review is more about the format of the patch:" [puppet] - 10https://gerrit.wikimedia.org/r/411315 (owner: 10Chico Venancio)
[18:23:23] <wikibugs>	 (03PS1) 10RobH: decom mc201[78] [dns] - 10https://gerrit.wikimedia.org/r/411323 (https://phabricator.wikimedia.org/T187474)
[18:23:43] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: Decommission old and unused/spare servers in codfw - https://phabricator.wikimedia.org/T187474#3979396 (10RobH)
[18:23:48] <wikibugs>	 (03PS1) 10Cmjohnson: Updating MAC address labstore1006-7 [puppet] - 10https://gerrit.wikimedia.org/r/411324 (https://phabricator.wikimedia.org/T186756)
[18:23:59] <wikibugs>	 (03PS3) 10Dzahn: Revert "mediawiki: reduce frequency of purge_abusefilter to weekly" [puppet] - 10https://gerrit.wikimedia.org/r/411031
[18:24:32] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Updating MAC address labstore1006-7 [puppet] - 10https://gerrit.wikimedia.org/r/411324 (https://phabricator.wikimedia.org/T186756) (owner: 10Cmjohnson)
[18:24:43] <wikibugs>	 (03CR) 10RobH: [C: 032] decom mc201[78] [dns] - 10https://gerrit.wikimedia.org/r/411323 (https://phabricator.wikimedia.org/T187474) (owner: 10RobH)
[18:24:47] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "per Reedy's comment on ticket, running "en" just takes minutes now, going back to how things were before" [puppet] - 10https://gerrit.wikimedia.org/r/411031 (owner: 10Dzahn)
[18:24:52] <wikibugs>	 10Operations, 10ops-eqiad: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3979420 (10ayounsi)
[18:25:03] <wikibugs>	 (03PS4) 10Dzahn: Revert "mediawiki: reduce frequency of purge_abusefilter to weekly" [puppet] - 10https://gerrit.wikimedia.org/r/411031
[18:25:53] <wikibugs>	 (03PS1) 10Madhuvishy: partman: Add recipe for dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/411326
[18:26:21] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests, 10Patch-For-Review: Decommission old and unused/spare servers in codfw - https://phabricator.wikimedia.org/T187474#3979427 (10RobH) a:05RobH>03Papaul All of these systems are now ready for on-site steps, assigned to @papaul.
[18:27:52] <wikibugs>	 10Operations, 10ops-eqiad: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3979437 (10ayounsi) @Cmjohnson please pre-populate the following interfaces with SFP-Ts: ``` ge-2/0/9                   db1099 ge-2/0/17                  db1060 ge-2/0/18                  ms-be...
[18:31:12] <wikibugs>	 (03PS2) 10RobH: partman: Add recipe for dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/411326 (owner: 10Madhuvishy)
[18:31:20] <wikibugs>	 (03CR) 10RobH: [C: 032] partman: Add recipe for dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/411326 (owner: 10Madhuvishy)
[18:32:09] <wikibugs>	 (03CR) 10Madhuvishy: [V: 032] partman: Add recipe for dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/411326 (owner: 10Madhuvishy)
[18:34:42] <hashar>	 !log upgraded zuul
[18:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:58] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: apt: apt-upgrade: cleanup report output [puppet] - 10https://gerrit.wikimedia.org/r/411330 (https://phabricator.wikimedia.org/T181647)
[18:38:23] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: apt: apt-upgrade: cleanup report output [puppet] - 10https://gerrit.wikimedia.org/r/411330 (https://phabricator.wikimedia.org/T181647)
[18:38:23] <icinga-wm>	 RECOVERY - Host labstore1006 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[18:42:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: apt-upgrade: cleanup report output [puppet] - 10https://gerrit.wikimedia.org/r/411330 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez)
[18:51:03] <wikibugs>	 (03PS2) 10Chico Venancio: shinken: WMCS: use check_graphite_series sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315
[18:58:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "The author is still using your gmail address. Also, if we are including the Signed-off-by line, better place at the end of the commit mess" [puppet] - 10https://gerrit.wikimedia.org/r/411315 (owner: 10Chico Venancio)
[19:00:47] <wikibugs>	 10Operations, 10ops-eqiad: Rack/cable/configure asw2-b-eqiad switch stack - https://phabricator.wikimedia.org/T183585#3979544 (10ayounsi)
[19:15:12] <wikibugs>	 (03PS1) 10Chad: mw.org: remove old keys txt file from 2009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411364
[19:17:05] <wikibugs>	 (03PS1) 10Chad: mw.org: Symlink keys.html to index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411367
[19:17:19] <wikibugs>	 (03PS3) 10Chico Venancio: shinken: WMCS: use check_graphite_series sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315
[19:18:32] <wikibugs>	 (03PS1) 10Chad: Move mw.org docroot to mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411368
[19:20:34] <wikibugs>	 (03PS1) 10Chad: Turn wikimedia.org docroot into symlink to standard-docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411369
[19:23:22] <no_justification>	 Krinkle: 404.html is *only* used for the old secure.wm.o redirect vhost.....is there a pressing reason we couldn't use 404.php?
[19:50:36] <wikibugs>	 (03PS3) 10Krinkle: extract2: Set wiki context directly instead of MW_LANG indirection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410109
[19:50:48] <wikibugs>	 (03PS3) 10Krinkle: multiversion: Remove support for MW_LANG env override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410110
[19:50:51] <wikibugs>	 (03PS4) 10Krinkle: multiversion: Remove support for MW_LANG env override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/410110
[19:51:39] <logmsgbot>	 !log andrew@tin Started deploy [horizon/deploy@bdcc12b]: ocata branch with sidebar fix
[19:51:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:51] <logmsgbot>	 !log andrew@tin Finished deploy [horizon/deploy@bdcc12b]: ocata branch with sidebar fix (duration: 03m 12s)
[19:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:10] <logmsgbot>	 !log andrew@tin Started deploy [horizon/deploy@1fdd122]: two more small fixes
[20:10:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:31] <logmsgbot>	 !log andrew@tin Finished deploy [horizon/deploy@1fdd122]: two more small fixes (duration: 01m 21s)
[20:11:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:11] <wikibugs>	 (03CR) 10Krinkle: [C: 031] mw.org: remove old keys txt file from 2009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411364 (owner: 10Chad)
[20:39:40] <logmsgbot>	 !log andrew@tin Started deploy [horizon/deploy@efcba2b]: sudo dashboard update
[20:39:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:56] <logmsgbot>	 !log andrew@tin Finished deploy [horizon/deploy@efcba2b]: sudo dashboard update (duration: 01m 16s)
[20:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:52] <wikibugs>	 (03PS1) 10Chad: Gerrit: Also set read timeout [puppet] - 10https://gerrit.wikimedia.org/r/411394
[20:53:05] <wikibugs>	 (03PS2) 10Chad: Gerrit: Also set ldap read timeout [puppet] - 10https://gerrit.wikimedia.org/r/411394
[21:01:08] <wikibugs>	 10Operations, 10ops-codfw: Decommission mw2017 and mw2099 - https://phabricator.wikimedia.org/T187467#3979829 (10Papaul) a:05Papaul>03RobH
[21:06:16] <wikibugs>	 (03PS1) 10Chad: Gerrit: Tweak SSH timeout settings and such [puppet] - 10https://gerrit.wikimedia.org/r/411397
[21:09:31] <wikibugs>	 (03CR) 10Paladox: [C: 031] Gerrit: Tweak SSH timeout settings and such [puppet] - 10https://gerrit.wikimedia.org/r/411397 (owner: 10Chad)
[21:10:37] <wikibugs>	 (03CR) 10Paladox: [C: 031] Gerrit: Also set ldap read timeout [puppet] - 10https://gerrit.wikimedia.org/r/411394 (owner: 10Chad)
[21:12:15] <wikibugs>	 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3979836 (10mmodell) If it's really as simple as importi...
[21:12:37] <hashar>	 !log Upgraded Zuul to https://gerrit.wikimedia.org/r/#/c/411322/3 | T187567
[21:12:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:53] <stashbot>	 T187567: CI is running against parent patches, not the patches themselves for chained patches - https://phabricator.wikimedia.org/T187567
[21:26:53] <wikibugs>	 (03PS4) 10Chico Venancio: shinken: WMCS: use check_graphite_series sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315
[21:29:50] <wikibugs>	 (03PS1) 10EBernhardson: Deploy libhdfs0 to hadoop nodes [puppet] - 10https://gerrit.wikimedia.org/r/411464
[21:32:53] <wikibugs>	 (03CR) 10EBernhardson: "This is semi-related to T187139. In that ticket copying files from hdfs to the local machine so a C++ library could read them triggered an" [puppet] - 10https://gerrit.wikimedia.org/r/411464 (owner: 10EBernhardson)
[21:35:16] <wikibugs>	 (03CR) 10Rush: shinken: WMCS: use check_graphite_series sumSeries to reduce puppet failures false positves (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/411315 (owner: 10Chico Venancio)
[21:35:35] <icinga-wm>	 PROBLEM - Check systemd state on labpuppetmaster1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[21:40:09] <twentyafterfour>	 seeing quite a lot of log noise from job queue. Stuff like "29 buffered job(s) of type(s) JobSpecification, CdnPurgeJob never inserted"
[21:40:34] <twentyafterfour>	 I'm not sure if this is worrisome or not? 
[21:44:12] <no_justification>	 Not last time I checked.
[21:49:00] <wikibugs>	 10Operations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Wikipedia-iOS-App-Backlog, and 2 others: Zero: Investigate removing the limit on carrier tagging to m-dot and zero-dot requests - https://phabricator.wikimedia.org/T137990#3979897 (10Dbrant) 05Open>03Invalid
[22:08:05] <icinga-wm>	 PROBLEM - Host scb2005 is DOWN: PING CRITICAL - Packet loss = 100%
[22:09:55] <icinga-wm>	 RECOVERY - Host scb2005 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms
[22:17:08] <wikibugs>	 (03PS1) 10Rush: openstack: neutron l3 and service for labtestn [puppet] - 10https://gerrit.wikimedia.org/r/411488 (https://phabricator.wikimedia.org/T167293)
[22:17:18] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs encapi: preposterous erb hack to limit POSTs to horizon hosts [puppet] - 10https://gerrit.wikimedia.org/r/411489
[22:17:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs encapi: preposterous erb hack to limit POSTs to horizon hosts [puppet] - 10https://gerrit.wikimedia.org/r/411489 (owner: 10Andrew Bogott)
[22:37:29] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs encapi: preposterous erb hack to limit POSTs to horizon hosts [puppet] - 10https://gerrit.wikimedia.org/r/411489
[22:38:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs encapi: preposterous erb hack to limit POSTs to horizon hosts [puppet] - 10https://gerrit.wikimedia.org/r/411489 (owner: 10Andrew Bogott)
[22:39:36] <wikibugs>	 (03PS3) 10Andrew Bogott: wmcs encapi: preposterous hack to limit POSTs to horizon hosts [puppet] - 10https://gerrit.wikimedia.org/r/411489
[22:40:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs encapi: preposterous hack to limit POSTs to horizon hosts [puppet] - 10https://gerrit.wikimedia.org/r/411489 (owner: 10Andrew Bogott)
[22:41:56] <wikibugs>	 (03PS2) 10Krinkle: Enable EtcdConfig on the debug hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411296 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto)
[22:42:00] <wikibugs>	 (03CR) 10Krinkle: "Fixed phpcs violation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411296 (https://phabricator.wikimedia.org/T149617) (owner: 10Giuseppe Lavagetto)
[22:43:35] <wikibugs>	 (03PS4) 10Andrew Bogott: wmcs encapi: preposterous hack to limit POSTs to horizon hosts [puppet] - 10https://gerrit.wikimedia.org/r/411489
[22:44:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs encapi: preposterous hack to limit POSTs to horizon hosts [puppet] - 10https://gerrit.wikimedia.org/r/411489 (owner: 10Andrew Bogott)
[22:45:05] <wikibugs>	 (03PS5) 10Andrew Bogott: wmcs encapi: preposterous hack to limit POSTs to horizon hosts [puppet] - 10https://gerrit.wikimedia.org/r/411489
[22:55:25] <Ivy>	 mutante: Is there anywhere we actually use HTTP 418? Or do we just have the HTTP 301 at coffee.wikimedia.org ?
[22:59:46] <wikibugs>	 (03PS6) 10Andrew Bogott: wmcs encapi: preposterous hack to limit POSTs to horizon hosts [puppet] - 10https://gerrit.wikimedia.org/r/411489
[23:04:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] wmcs encapi: preposterous hack to limit POSTs to horizon hosts [puppet] - 10https://gerrit.wikimedia.org/r/411489 (owner: 10Andrew Bogott)
[23:08:59] <Krinkle>	 Ivy: Used to, for a short time, for the "Old IE insecure SSL support" page 
[23:08:59] <Krinkle>	 https://github.com/wikimedia/puppet/commit/fb7eae473a44cdeac2c9188ac388fe7fbeabf4b4
[23:13:06] <Ivy>	 Hmmm.
[23:13:20] <Ivy>	 Oh well.
[23:24:29] <wikibugs>	 (03PS5) 10Chico Venancio: shinken: WMCS: use check_graphite_series sumSeries to reduce puppet failures false positves [puppet] - 10https://gerrit.wikimedia.org/r/411315
[23:36:36] <wikibugs>	 (03PS1) 10Andrew Bogott: labspuppetbackend: rewrite of the read-only security layer [puppet] - 10https://gerrit.wikimedia.org/r/411520
[23:37:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labspuppetbackend: rewrite of the read-only security layer [puppet] - 10https://gerrit.wikimedia.org/r/411520 (owner: 10Andrew Bogott)
[23:38:22] <wikibugs>	 (03PS2) 10Andrew Bogott: labspuppetbackend: rewrite of the read-only security layer [puppet] - 10https://gerrit.wikimedia.org/r/411520
[23:43:11] <wikibugs>	 (03PS3) 10Andrew Bogott: labspuppetbackend: rewrite of the read-only security layer [puppet] - 10https://gerrit.wikimedia.org/r/411520
[23:47:10] <wikibugs>	 (03CR) 10Andrew Bogott: "puppet diff can be found here:  https://puppet-compiler.wmflabs.org/compiler02/10011/labpuppetmaster1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/411520 (owner: 10Andrew Bogott)
[23:51:03] <wikibugs>	 (03PS4) 10Andrew Bogott: labspuppetbackend: rewrite of the read-only security layer [puppet] - 10https://gerrit.wikimedia.org/r/411520 (https://phabricator.wikimedia.org/T187499)
[23:53:51] <wikibugs>	 (03PS5) 10Andrew Bogott: labspuppetbackend: rewrite of the read-only security layer [puppet] - 10https://gerrit.wikimedia.org/r/411520 (https://phabricator.wikimedia.org/T187499)