[00:00:04] <jouncebot>	 twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T0000).
[00:11:07] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3310890 (10thcipriani) >>! In T166888#3322963, @faidon wrote: > These are all conjectures of mine, just by looking at the log, so correct me when...
[00:17:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 60796.64 seconds
[00:18:00] <icinga-wm>	 PROBLEM - Check systemd state on cp2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:18:01] <icinga-wm>	 PROBLEM - salt-minion processes on cp2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:18:08] <icinga-wm>	 PROBLEM - Check systemd state on cp2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:18:08] <icinga-wm>	 PROBLEM - salt-minion processes on cp4012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:18:08] <icinga-wm>	 PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:18:08] <icinga-wm>	 PROBLEM - Host cp1059 is DOWN: PING CRITICAL - Packet loss = 100%
[00:18:08] <icinga-wm>	 PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100%
[00:18:18] <icinga-wm>	 PROBLEM - salt-minion processes on cp2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:18:18] <icinga-wm>	 PROBLEM - Check systemd state on cp2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:18:18] <icinga-wm>	 PROBLEM - Check systemd state on cp3004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:18:18] <icinga-wm>	 PROBLEM - Check systemd state on cp1060 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:18:19] <icinga-wm>	 PROBLEM - salt-minion processes on cp1060 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:18:19] <icinga-wm>	 PROBLEM - salt-minion processes on cp3005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:18:19] <icinga-wm>	 PROBLEM - salt-minion processes on cp3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:18:20] <icinga-wm>	 PROBLEM - Check systemd state on cp3005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:18:20] <icinga-wm>	 PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100%
[00:18:22] <icinga-wm>	 PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100%
[00:19:14] <mutante>	 uhm.. what's happening with the salt minions
[00:20:11] <mutante>	 or did Icinga just forget downtimes again.. i guess that's it
[00:20:13] <mutante>	 looking
[00:21:52] <mutante>	 The Salt Master has rejected this minion's public key!
[00:24:19] <icinga-wm>	 RECOVERY - salt-minion processes on cp3005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:25:05] <mutante>	 !log salt-master: deleted salt-key for cp3005, stopped started minion cp3005 - key got accepted again (was: Salt Master has rejected this minion's public key)
[00:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:31:18] <icinga-wm>	 RECOVERY - salt-minion processes on cp2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:31:21] <mutante>	 !log cp2012 - fixed salt key issue as for cp3005 (delete key, stop/start minion, accept new key)
[00:31:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:34:09] <mutante>	 !log cp4020 - powercycling (host down, console sat at initramfs)
[00:34:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:38] <icinga-wm>	 RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 74.94 ms
[00:36:38] <icinga-wm>	 RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:39:18] <icinga-wm>	 PROBLEM - Check size of conntrack table on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:39:19] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:39:38] <icinga-wm>	 PROBLEM - Check systemd state on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:39:38] <icinga-wm>	 PROBLEM - puppet last run on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:39:38] <icinga-wm>	 PROBLEM - DPKG on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:39:38] <icinga-wm>	 PROBLEM - configured eth on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:39:38] <icinga-wm>	 PROBLEM - MD RAID on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:39:38] <icinga-wm>	 PROBLEM - dhclient process on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:39:39] <icinga-wm>	 PROBLEM - salt-minion processes on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:39:39] <icinga-wm>	 PROBLEM - Disk space on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:39:40] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:40:38] <icinga-wm>	 RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 74.00 ms
[00:41:08] <icinga-wm>	 RECOVERY - Check size of conntrack table on cp4020 is OK: OK: nf_conntrack is 0 % full
[00:41:28] <icinga-wm>	 RECOVERY - Disk space on cp4020 is OK: DISK OK
[00:41:28] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on cp4020 is OK: OK ferm input default policy is set
[00:41:28] <icinga-wm>	 RECOVERY - DPKG on cp4020 is OK: All packages OK
[00:41:28] <icinga-wm>	 RECOVERY - configured eth on cp4020 is OK: OK - interfaces up
[00:41:28] <icinga-wm>	 RECOVERY - dhclient process on cp4020 is OK: PROCS OK: 0 processes with command name dhclient
[00:41:29] <icinga-wm>	 RECOVERY - MD RAID on cp4020 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[00:41:58] <mutante>	 !log cp4011 - like cp4010 - powercycling (host down, console sat at initramfs). it hat the "did not detect disk by uid" issue but boots normal after powercycle
[00:42:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:42:28] <icinga-wm>	 PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX]
[00:43:38] <icinga-wm>	 PROBLEM - dhclient process on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:43:38] <icinga-wm>	 PROBLEM - Check size of conntrack table on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:43:38] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:43:38] <icinga-wm>	 PROBLEM - Disk space on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:43:38] <icinga-wm>	 PROBLEM - Check systemd state on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:43:39] <icinga-wm>	 PROBLEM - MD RAID on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:43:39] <icinga-wm>	 PROBLEM - configured eth on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:43:40] <icinga-wm>	 PROBLEM - DPKG on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:43:40] <icinga-wm>	 PROBLEM - salt-minion processes on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:43:41] <icinga-wm>	 PROBLEM - puppet last run on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:44:08] <mutante>	 ok, these are failed reinstalls
[00:44:17] <mutante>	 from "ex-cache_maps, to be decommed"
[00:44:31] <mutante>	 SAL "reimaging ex-cache_maps hosts"
[00:45:07] <mutante>	 so not critical and i'll make it recover anyways
[00:45:19] <mutante>	 they are role spare now
[00:45:28] <icinga-wm>	 RECOVERY - Disk space on cp4011 is OK: DISK OK
[00:45:29] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on cp4011 is OK: OK ferm input default policy is set
[00:45:29] <icinga-wm>	 RECOVERY - Check size of conntrack table on cp4011 is OK: OK: nf_conntrack is 0 % full
[00:45:29] <icinga-wm>	 RECOVERY - dhclient process on cp4011 is OK: PROCS OK: 0 processes with command name dhclient
[00:45:29] <icinga-wm>	 RECOVERY - Check systemd state on cp4011 is OK: OK - running: The system is fully operational
[00:45:29] <icinga-wm>	 RECOVERY - configured eth on cp4011 is OK: OK - interfaces up
[00:45:29] <icinga-wm>	 RECOVERY - salt-minion processes on cp4011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:45:30] <icinga-wm>	 RECOVERY - DPKG on cp4011 is OK: All packages OK
[00:45:30] <icinga-wm>	 RECOVERY - MD RAID on cp4011 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[00:46:28] <icinga-wm>	 PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX]
[00:47:28] <mutante>	 !log cp1059 - same thing - powercycle after failed boot after reimaging script
[00:47:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:50:38] <icinga-wm>	 RECOVERY - Host cp1059 is UP: PING OK - Packet loss = 0%, RTA = 39.37 ms
[00:51:08] <icinga-wm>	 RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational
[00:52:48] <icinga-wm>	 ACKNOWLEDGEMENT - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T150256#3323647
[00:53:08] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cp1059 is CRITICAL: Return code of 255 is out of bounds
[00:53:29] <icinga-wm>	 PROBLEM - Check systemd state on cp1059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:53:29] <icinga-wm>	 PROBLEM - salt-minion processes on cp1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[00:54:09] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on cp1059 is OK: OK ferm input default policy is set
[00:54:09] <icinga-wm>	 PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:54:30] <icinga-wm>	 PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 53 seconds ago with 2 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX]
[00:54:39] <icinga-wm>	 RECOVERY - Host cp4019 is UP: PING OK - Packet loss = 0%, RTA = 73.83 ms
[00:54:41] <mutante>	 !log cp4019 - powercycled (same as others) | lvs1007 - sits at installer - waiting for IP to be configured (T150256)
[00:54:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:54:51] <stashbot>	 T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256
[00:57:25] <wikibugs>	 (03PS5) 10Jforrester: Beta Features: Update last-big-change-plus-six-month dates in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354731
[00:57:39] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:57:40] <icinga-wm>	 PROBLEM - Check size of conntrack table on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:57:40] <icinga-wm>	 PROBLEM - Check systemd state on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:57:40] <icinga-wm>	 PROBLEM - MD RAID on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:57:40] <icinga-wm>	 PROBLEM - DPKG on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:57:40] <icinga-wm>	 PROBLEM - Disk space on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:57:40] <icinga-wm>	 PROBLEM - salt-minion processes on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:57:41] <icinga-wm>	 PROBLEM - configured eth on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:57:41] <icinga-wm>	 PROBLEM - dhclient process on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:57:42] <icinga-wm>	 PROBLEM - puppet last run on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:59:30] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on cp4019 is OK: OK ferm input default policy is set
[00:59:30] <icinga-wm>	 RECOVERY - Check size of conntrack table on cp4019 is OK: OK: nf_conntrack is 0 % full
[00:59:30] <icinga-wm>	 RECOVERY - configured eth on cp4019 is OK: OK - interfaces up
[00:59:30] <icinga-wm>	 RECOVERY - Disk space on cp4019 is OK: DISK OK
[00:59:30] <icinga-wm>	 RECOVERY - dhclient process on cp4019 is OK: PROCS OK: 0 processes with command name dhclient
[00:59:30] <icinga-wm>	 RECOVERY - DPKG on cp4019 is OK: All packages OK
[00:59:30] <icinga-wm>	 RECOVERY - Check systemd state on cp4019 is OK: OK - running: The system is fully operational
[00:59:31] <icinga-wm>	 RECOVERY - MD RAID on cp4019 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[01:00:30] <icinga-wm>	 RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[01:00:30] <icinga-wm>	 PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 58 seconds ago with 2 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX]
[01:01:54] <wikibugs>	 (03CR) 10Jforrester: [C: 031] Remove semanticness from another place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352985 (https://phabricator.wikimedia.org/T53642) (owner: 10MaxSem)
[01:02:19] <icinga-wm>	 RECOVERY - salt-minion processes on cp4012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[01:02:21] <icinga-wm>	 RECOVERY - salt-minion processes on cp1060 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[01:02:22] <wikibugs>	 (03CR) 10Jforrester: [C: 031] phpunit: replace deprecated strict=true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356349 (owner: 10Hashar)
[01:05:39] <icinga-wm>	 RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[01:06:09] <icinga-wm>	 RECOVERY - salt-minion processes on cp2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[01:09:09] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on cp4020 is OK: OK: synced at Thu 2017-06-08 01:09:05 UTC.
[01:10:39] <icinga-wm>	 RECOVERY - salt-minion processes on cp4019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[01:11:19] <mutante>	 !Log cp1060, cp2003, cp4012, cp4019, cp4020 - delete old salt-keys, accept new salt-keys, restart minion 
[01:11:29] <icinga-wm>	 RECOVERY - salt-minion processes on cp4020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[01:11:39] <icinga-wm>	 RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[01:20:39] <icinga-wm>	 RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[01:22:09] <icinga-wm>	 RECOVERY - Check systemd state on cp2015 is OK: OK - running: The system is fully operational
[01:35:21] <wikibugs>	 10Operations: terbium maintenance cron "processEchoEmailBatch.php" is getting "access denied" from database - https://phabricator.wikimedia.org/T167373#3331054 (10Dzahn)
[01:36:22] <wikibugs>	 10Operations: terbium maintenance cron "processEchoEmailBatch.php" is getting "access denied" from database - https://phabricator.wikimedia.org/T167373#3331043 (10Dzahn) >'wikiadmin'@'10.64.32.13' (using password: YES) (208.80.153.14)  10.64.32.13 = terbium 208.80.153.14 = **labtestweb2001.wikimedia.org**  ^ lab...
[01:37:18] <logmsgbot>	 !log maxsem@tin Synchronized php-1.30.0-wmf.2/extensions/GeoData/includes/Searcher.php: Livehack to stop exceptions (duration: 00m 46s)
[01:37:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:45:25] <mutante>	 !log manually running mediawiki maintenance job "echo_mail_batch" (on terbium as www-data, just like cron). did _NOT_ get denied by DB (T167373)
[01:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:45:34] <stashbot>	 T167373: terbium maintenance cron "processEchoEmailBatch.php" is getting "access denied" from database - https://phabricator.wikimedia.org/T167373
[01:50:22] <wikibugs>	 10Operations: terbium maintenance cron "processEchoEmailBatch.php" is getting "access denied" from database - https://phabricator.wikimedia.org/T167373#3331062 (10Dzahn) ^ can't reproduce it when manually running it (as the same user from the same host), but ... the emails have been arriving (almost) every day s...
[02:21:11] <icinga-wm>	 RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational
[02:24:11] <icinga-wm>	 PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[02:34:56] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 08m 41s)
[02:35:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:40:47] <twentyafterfour>	 !log deploying hotfix for T166958
[02:40:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:40:57] <stashbot>	 T166958: Unhandled exception on viewing T14974 - https://phabricator.wikimedia.org/T166958
[02:50:02] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.4) (duration: 05m 07s)
[02:50:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:55:05] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3331087 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp3006.esams.wmnet', 'cp1046.eqiad.wmnet', 'cp1047.e...
[02:55:21] <icinga-wm>	 PROBLEM - salt-minion processes on cp1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[02:55:21] <icinga-wm>	 PROBLEM - salt-minion processes on cp1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[02:56:27] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jun  8 02:56:27 UTC 2017 (duration 6m 26s)
[02:56:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:57:01] <icinga-wm>	 PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100%
[03:01:01] <icinga-wm>	 PROBLEM - puppet last run on cp1047 is CRITICAL: Return code of 255 is out of bounds
[03:01:31] <icinga-wm>	 RECOVERY - Host cp3006 is UP: PING OK - Packet loss = 0%, RTA = 120.11 ms
[03:02:11] <icinga-wm>	 PROBLEM - puppet last run on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:02:11] <icinga-wm>	 PROBLEM - puppet last run on cp2009 is CRITICAL: Return code of 255 is out of bounds
[03:02:41] <icinga-wm>	 PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100%
[03:03:41] <icinga-wm>	 PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100%
[03:03:47] <bblack>	 ignore all of this, apparently wmf-auto-reimage doesn't downtime when other steps fail along the way :P
[03:03:51] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cp3006 is CRITICAL: Return code of 255 is out of bounds
[03:03:51] <icinga-wm>	 PROBLEM - DPKG on cp3006 is CRITICAL: Return code of 255 is out of bounds
[03:03:51] <icinga-wm>	 PROBLEM - Disk space on cp3006 is CRITICAL: Return code of 255 is out of bounds
[03:03:51] <icinga-wm>	 PROBLEM - salt-minion processes on cp3006 is CRITICAL: Return code of 255 is out of bounds
[03:03:51] <icinga-wm>	 PROBLEM - dhclient process on cp3006 is CRITICAL: Return code of 255 is out of bounds
[03:03:52] <icinga-wm>	 PROBLEM - Check size of conntrack table on cp3006 is CRITICAL: Return code of 255 is out of bounds
[03:03:53] <icinga-wm>	 PROBLEM - puppet last run on cp3006 is CRITICAL: Return code of 255 is out of bounds
[03:03:53] <icinga-wm>	 PROBLEM - Check systemd state on cp3006 is CRITICAL: Return code of 255 is out of bounds
[03:03:53] <icinga-wm>	 PROBLEM - configured eth on cp3006 is CRITICAL: Return code of 255 is out of bounds
[03:03:54] <icinga-wm>	 PROBLEM - MD RAID on cp3006 is CRITICAL: Return code of 255 is out of bounds
[03:03:54] <icinga-wm>	 PROBLEM - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100%
[03:05:11] <icinga-wm>	 RECOVERY - Host cp1047 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms
[03:06:11] <icinga-wm>	 RECOVERY - Host cp2009 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms
[03:06:41] <icinga-wm>	 RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 36.40 ms
[03:07:01] <icinga-wm>	 PROBLEM - IPMI Temperature on cp3006 is CRITICAL: Return code of 255 is out of bounds
[03:07:21] <icinga-wm>	 PROBLEM - configured eth on cp1047 is CRITICAL: Return code of 255 is out of bounds
[03:07:21] <icinga-wm>	 PROBLEM - dhclient process on cp1047 is CRITICAL: Return code of 255 is out of bounds
[03:07:22] <icinga-wm>	 PROBLEM - DPKG on cp1047 is CRITICAL: Return code of 255 is out of bounds
[03:07:31] <icinga-wm>	 PROBLEM - Disk space on cp1047 is CRITICAL: Return code of 255 is out of bounds
[03:07:41] <icinga-wm>	 PROBLEM - Check systemd state on cp1047 is CRITICAL: Return code of 255 is out of bounds
[03:07:51] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cp1047 is CRITICAL: Return code of 255 is out of bounds
[03:08:01] <icinga-wm>	 PROBLEM - MD RAID on cp1047 is CRITICAL: Return code of 255 is out of bounds
[03:08:11] <icinga-wm>	 PROBLEM - Check size of conntrack table on cp1047 is CRITICAL: Return code of 255 is out of bounds
[03:08:11] <icinga-wm>	 PROBLEM - dhclient process on cp2009 is CRITICAL: Return code of 255 is out of bounds
[03:08:21] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cp2009 is CRITICAL: Return code of 255 is out of bounds
[03:08:31] <icinga-wm>	 PROBLEM - Check size of conntrack table on cp2009 is CRITICAL: Return code of 255 is out of bounds
[03:08:31] <icinga-wm>	 PROBLEM - DPKG on cp2009 is CRITICAL: Return code of 255 is out of bounds
[03:08:31] <icinga-wm>	 PROBLEM - salt-minion processes on cp2009 is CRITICAL: Return code of 255 is out of bounds
[03:08:41] <icinga-wm>	 PROBLEM - MD RAID on cp2009 is CRITICAL: Return code of 255 is out of bounds
[03:08:41] <icinga-wm>	 PROBLEM - configured eth on cp2009 is CRITICAL: Return code of 255 is out of bounds
[03:08:52] <icinga-wm>	 PROBLEM - Check systemd state on cp1046 is CRITICAL: Return code of 255 is out of bounds
[03:08:52] <icinga-wm>	 PROBLEM - dhclient process on cp1046 is CRITICAL: Return code of 255 is out of bounds
[03:09:01] <icinga-wm>	 PROBLEM - Disk space on cp2009 is CRITICAL: Return code of 255 is out of bounds
[03:09:01] <icinga-wm>	 PROBLEM - Check systemd state on cp2009 is CRITICAL: Return code of 255 is out of bounds
[03:09:02] <icinga-wm>	 PROBLEM - MD RAID on cp1046 is CRITICAL: Return code of 255 is out of bounds
[03:09:21] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on cp1046 is CRITICAL: Return code of 255 is out of bounds
[03:09:21] <icinga-wm>	 PROBLEM - configured eth on cp1046 is CRITICAL: Return code of 255 is out of bounds
[03:09:21] <icinga-wm>	 PROBLEM - DPKG on cp1046 is CRITICAL: Return code of 255 is out of bounds
[03:09:31] <icinga-wm>	 PROBLEM - Check size of conntrack table on cp1046 is CRITICAL: Return code of 255 is out of bounds
[03:09:31] <icinga-wm>	 PROBLEM - Disk space on cp1046 is CRITICAL: Return code of 255 is out of bounds
[03:09:32] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on cp1046 is CRITICAL: Return code of 255 is out of bounds
[03:09:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.075 second response time
[03:09:41] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.151 second response time
[03:10:01] <icinga-wm>	 PROBLEM - HHVM rendering on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time
[03:10:04] <bblack>	 mw1261 isn't part of my spam
[03:10:41] <icinga-wm>	 RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.118 second response time
[03:10:42] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.185 second response time
[03:11:01] <icinga-wm>	 RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 74864 bytes in 0.234 second response time
[03:12:41] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on cp2009 is CRITICAL: Return code of 255 is out of bounds
[03:16:41] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.153 second response time
[03:16:51] <icinga-wm>	 PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100%
[03:17:11] <icinga-wm>	 PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100%
[03:17:31] <icinga-wm>	 RECOVERY - Host cp3006 is UP: PING OK - Packet loss = 0%, RTA = 119.88 ms
[03:17:41] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.186 second response time
[03:18:21] <icinga-wm>	 RECOVERY - Host cp1047 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms
[03:18:51] <icinga-wm>	 PROBLEM - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100%
[03:21:11] <icinga-wm>	 RECOVERY - Host cp2009 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms
[03:22:51] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on cp3006 is OK: OK ferm input default policy is set
[03:22:51] <icinga-wm>	 RECOVERY - dhclient process on cp3006 is OK: PROCS OK: 0 processes with command name dhclient
[03:22:51] <icinga-wm>	 RECOVERY - Check size of conntrack table on cp3006 is OK: OK: nf_conntrack is 0 % full
[03:22:51] <icinga-wm>	 RECOVERY - Disk space on cp3006 is OK: DISK OK
[03:22:51] <icinga-wm>	 RECOVERY - salt-minion processes on cp3006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[03:22:52] <icinga-wm>	 RECOVERY - DPKG on cp3006 is OK: All packages OK
[03:22:52] <icinga-wm>	 RECOVERY - configured eth on cp3006 is OK: OK - interfaces up
[03:22:53] <icinga-wm>	 RECOVERY - MD RAID on cp3006 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[03:24:21] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on cp1046 is OK: OK ferm input default policy is set
[03:24:21] <icinga-wm>	 RECOVERY - DPKG on cp1046 is OK: All packages OK
[03:24:21] <icinga-wm>	 RECOVERY - configured eth on cp1046 is OK: OK - interfaces up
[03:24:22] <icinga-wm>	 RECOVERY - salt-minion processes on cp1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[03:24:32] <icinga-wm>	 RECOVERY - Check size of conntrack table on cp1046 is OK: OK: nf_conntrack is 0 % full
[03:24:32] <icinga-wm>	 RECOVERY - Disk space on cp1046 is OK: DISK OK
[03:25:01] <icinga-wm>	 RECOVERY - dhclient process on cp1046 is OK: PROCS OK: 0 processes with command name dhclient
[03:25:01] <icinga-wm>	 RECOVERY - Check systemd state on cp1046 is OK: OK - running: The system is fully operational
[03:25:02] <icinga-wm>	 RECOVERY - MD RAID on cp1046 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[03:25:02] <icinga-wm>	 RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[03:25:21] <icinga-wm>	 RECOVERY - salt-minion processes on cp1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[03:25:22] <icinga-wm>	 RECOVERY - DPKG on cp1047 is OK: All packages OK
[03:25:22] <icinga-wm>	 RECOVERY - dhclient process on cp1047 is OK: PROCS OK: 0 processes with command name dhclient
[03:25:22] <icinga-wm>	 RECOVERY - configured eth on cp1047 is OK: OK - interfaces up
[03:25:31] <icinga-wm>	 RECOVERY - Disk space on cp1047 is OK: DISK OK
[03:25:41] <icinga-wm>	 RECOVERY - Check systemd state on cp1047 is OK: OK - running: The system is fully operational
[03:25:51] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on cp1047 is OK: OK ferm input default policy is set
[03:26:01] <icinga-wm>	 RECOVERY - Disk space on cp2009 is OK: DISK OK
[03:26:01] <icinga-wm>	 RECOVERY - MD RAID on cp1047 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[03:26:02] <icinga-wm>	 RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[03:26:02] <icinga-wm>	 RECOVERY - Check systemd state on cp2009 is OK: OK - running: The system is fully operational
[03:26:11] <icinga-wm>	 RECOVERY - Check size of conntrack table on cp1047 is OK: OK: nf_conntrack is 0 % full
[03:26:11] <icinga-wm>	 RECOVERY - dhclient process on cp2009 is OK: PROCS OK: 0 processes with command name dhclient
[03:26:11] <icinga-wm>	 RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[03:26:21] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on cp2009 is OK: OK ferm input default policy is set
[03:26:31] <icinga-wm>	 RECOVERY - Check size of conntrack table on cp2009 is OK: OK: nf_conntrack is 0 % full
[03:26:31] <icinga-wm>	 RECOVERY - DPKG on cp2009 is OK: All packages OK
[03:26:41] <icinga-wm>	 RECOVERY - MD RAID on cp2009 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
[03:26:41] <icinga-wm>	 RECOVERY - configured eth on cp2009 is OK: OK - interfaces up
[03:30:31] <icinga-wm>	 RECOVERY - salt-minion processes on cp2009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[03:33:01] <icinga-wm>	 RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[03:37:01] <icinga-wm>	 RECOVERY - IPMI Temperature on cp3006 is OK: Sensor Type(s) Temperature Status: OK
[03:37:41] <icinga-wm>	 RECOVERY - salt-minion processes on cp1059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[03:38:51] <icinga-wm>	 PROBLEM - HHVM rendering on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time
[03:38:52] <icinga-wm>	 PROBLEM - Apache HTTP on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time
[03:39:01] <icinga-wm>	 RECOVERY - salt-minion processes on cp3004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[03:39:31] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on cp1046 is OK: OK: synced at Thu 2017-06-08 03:39:30 UTC.
[03:39:51] <icinga-wm>	 RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 74866 bytes in 1.121 second response time
[03:39:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.102 second response time
[03:42:01] <icinga-wm>	 RECOVERY - Check systemd state on cp3004 is OK: OK - running: The system is fully operational
[03:42:41] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on cp2009 is OK: OK: synced at Thu 2017-06-08 03:42:33 UTC.
[03:43:01] <icinga-wm>	 RECOVERY - Check systemd state on cp3006 is OK: OK - running: The system is fully operational
[03:44:01] <icinga-wm>	 RECOVERY - Check systemd state on cp3005 is OK: OK - running: The system is fully operational
[03:44:41] <icinga-wm>	 RECOVERY - Check systemd state on cp1059 is OK: OK - running: The system is fully operational
[03:44:51] <icinga-wm>	 RECOVERY - Check systemd state on cp4020 is OK: OK - running: The system is fully operational
[03:45:31] <icinga-wm>	 RECOVERY - Check systemd state on cp1060 is OK: OK - running: The system is fully operational
[03:46:21] <icinga-wm>	 RECOVERY - Check systemd state on cp2003 is OK: OK - running: The system is fully operational
[03:47:31] <icinga-wm>	 RECOVERY - Check systemd state on cp2021 is OK: OK - running: The system is fully operational
[03:51:46] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3331125 (10BBlack) 05Open>03Resolved a:03BBlack
[03:55:36] <wikibugs>	 10Operations, 10ops-esams, 10hardware-requests: Decommission cp300[3456] - https://phabricator.wikimedia.org/T167376#3331127 (10BBlack)
[03:56:55] <wikibugs>	 10Operations, 10ops-ulsfo, 10hardware-requests: Decommission cp4011, cp4012, cp4019, cp4020 - https://phabricator.wikimedia.org/T167377#3331138 (10BBlack)
[03:58:26] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: cp3003 network interface issues - https://phabricator.wikimedia.org/T162132#3331150 (10BBlack) 05Open>03declined cp3003 is decomming for good in T167376
[04:57:06] <wikibugs>	 10Operations, 10Labs: virbr0 interface present in some virt hosts - https://phabricator.wikimedia.org/T83732#917870 (10bd808) Is this still an issue that needs to be fixed?
[04:59:51] <wikibugs>	 10Operations, 10Labs, 10wikitech.wikimedia.org: Turn on Cirrus replicas for labswiki (wikitech) - https://phabricator.wikimedia.org/T83760#3331225 (10bd808) @EBernhardson is this bug report still valid or just ancient cruft?
[05:03:37] <wikibugs>	 10Operations, 10Labs, 10wikitech.wikimedia.org: Turn on Cirrus replicas for labswiki (wikitech) - https://phabricator.wikimedia.org/T83760#3331228 (10EBernhardson) 05Open>03Resolved a:03EBernhardson It looks like everything in deployment-prep is using either one or two replicas. Should be fine.
[05:21:11] <icinga-wm>	 PROBLEM - Host elastic1035 is DOWN: PING CRITICAL - Packet loss = 100%
[05:21:41] <icinga-wm>	 RECOVERY - Host elastic1035 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms
[05:29:31] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3331237 (10Marostegui) Sorry @MarcoAurelio that was almost 10:30pm our time :-(
[05:40:22] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357756
[05:40:28] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357756
[05:41:43] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357756 (owner: 10Marostegui)
[05:42:39] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357756 (owner: 10Marostegui)
[05:42:52] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357756 (owner: 10Marostegui)
[05:43:39] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1076 - T166205 (duration: 00m 45s)
[05:43:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:49] <stashbot>	 T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205
[05:49:45] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357757 (https://phabricator.wikimedia.org/T166205)
[05:50:52] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357757 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui)
[05:52:01] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357757 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui)
[05:52:14] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357757 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui)
[05:52:59] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1074 - T166205 (duration: 00m 43s)
[05:53:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:53:09] <stashbot>	 T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205
[05:54:24] <marostegui>	 !log Deploy alter table s2 - db1074 - T166205
[05:54:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:59:34] <_joe_>	 !log uploading new service-checker version to reprepro, T167048
[05:59:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:59:43] <stashbot>	 T167048: Services need external monitoring - https://phabricator.wikimedia.org/T167048
[06:07:34] <wikibugs>	 10Operations, 10Monitoring, 10Services (next), 10User-Joe, 10User-mobrovac: Services need external monitoring - https://phabricator.wikimedia.org/T167048#3331247 (10Joe) @faidon at first I was thinking of implementing the checks on the LVS host (in the end, the puppettization is mostly the same), but I t...
[06:15:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] check_ipmi_temp: load ipmi_devintf [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema)
[07:00:05] <marostegui>	 !log Drop table updates on s6 - T139342
[07:00:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:15] <stashbot>	 T139342: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342
[07:38:39] <wikibugs>	 (03PS1) 10Elukey: Remove webrequest_maps topic from Camus configuration [puppet] - 10https://gerrit.wikimedia.org/r/357768
[07:45:35] <wikibugs>	 (03CR) 10Elukey: "Changes looks good: https://puppet-compiler.wmflabs.org/6692/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/357768 (owner: 10Elukey)
[08:02:44] <wikibugs>	 (03CR) 10Joal: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/357768 (owner: 10Elukey)
[08:03:20] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3331391 (10elukey) Based on several guides like http://download.intel.com/support/motherboards/server/sb/configuring_raid_for_opti...
[08:04:45] <wikibugs>	 (03CR) 10Elukey: [C: 032] Remove webrequest_maps topic from Camus configuration [puppet] - 10https://gerrit.wikimedia.org/r/357768 (owner: 10Elukey)
[08:17:10] <TabbyCat>	 jynus / marostegui / legoktm -- Avalaible for supervising T166028 ?
[08:17:10] <stashbot>	 T166028: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028
[08:18:43] <marostegui>	 TabbyCat: I am around yes
[08:19:03] <TabbyCat>	 if you think it is safe to proceed marostegui I can do it
[08:19:22] <marostegui>	 TabbyCat: Yes, give me a sec to open up some extra tabs
[08:19:27] <TabbyCat>	 sure thing
[08:19:33] <TabbyCat>	 logstash, fatalerror, etc
[08:19:57] <TabbyCat>	 dame una voz cuando estés :)
[08:20:05] <marostegui>	 haha will do :)
[08:21:17] <marostegui>	 TabbyCat: Listo/ready to check out the dbs!
[08:21:32] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3331397 (10MarcoAurelio) Being handled in a minute.
[08:21:49] <wikibugs>	 (03PS5) 10Muehlenhoff: Use new repository layout for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583)
[08:21:59] <TabbyCat>	 !log Starting big global rename as requested in T166028
[08:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:24] <TabbyCat>	 marostegui: can I start it now then?
[08:22:31] <marostegui>	 TabbyCat: go for it
[08:22:36] <TabbyCat>	 oki
[08:24:39] <TabbyCat>	 Jobs to rename Mlpearc to FlightTime have been queued on .
[08:29:20] <TabbyCat>	 marostegui: how is it going? :) globalrenameprogress is not failing for me :)
[08:29:42] <marostegui>	 TabbyCat: so far I see no issues
[08:29:49] <TabbyCat>	 it is doing enwiki right now
[08:29:57] <TabbyCat>	 the wiki with more edits
[08:31:09] <TabbyCat>	 enwiki is done
[08:31:26] <marostegui>	 I am seeing some lag
[08:31:29] <marostegui>	 On enwiki
[08:31:34] <wikibugs>	 (03PS3) 10Ema: check_ipmi_temp: load ipmi_devintf [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205)
[08:31:39] <marostegui>	 But not too worrying so far
[08:31:48] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] check_ipmi_temp: load ipmi_devintf [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema)
[08:32:34] <marostegui>	 Only one host lagging now
[08:32:47] <marostegui>	 one of the recentchanges servers (db1051)
[08:33:12] <marostegui>	 gone now
[08:33:31] <TabbyCat>	 I've got to fix some page moves that didn't happened on enwiki
[08:33:40] <TabbyCat>	 but after the rename finishes
[08:34:05] <marostegui>	 the rc hosts in codfw are lagging behind (but that doesn't impact anything as codfw isn't active)
[08:34:42] <wikibugs>	 (03PS1) 10Ema: base::kernel: create /etc/modules-load.d [puppet] - 10https://gerrit.wikimedia.org/r/357772
[08:34:51] <icinga-wm>	 PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:34:51] <icinga-wm>	 PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:34:51] <icinga-wm>	 PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:34:51] <icinga-wm>	 PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:34:51] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:34:51] <icinga-wm>	 PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:34:52] <icinga-wm>	 PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:34:52] <icinga-wm>	 PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:34:53] <icinga-wm>	 PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:34:53] <icinga-wm>	 PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:34:59] <TabbyCat>	 o_O
[08:35:00] <ema>	 moritzm, godog: we need https://gerrit.wikimedia.org/r/357772 to fix the puppet fails :(
[08:35:04] <icinga-wm>	 PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:11] <icinga-wm>	 PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:11] <icinga-wm>	 PROBLEM - puppet last run on elastic2033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:11] <icinga-wm>	 PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:11] <icinga-wm>	 PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:11] <icinga-wm>	 PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:12] <icinga-wm>	 PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:12] <icinga-wm>	 PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:13] <icinga-wm>	 PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:13] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:14] <icinga-wm>	 PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:14] <icinga-wm>	 PROBLEM - puppet last run on cp1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:15] <icinga-wm>	 PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:15] <icinga-wm>	 PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:21] <icinga-wm>	 PROBLEM - puppet last run on elastic2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:29] <godog>	 ema: I thought it was in base already?
[08:35:31] <icinga-wm>	 PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:31] <icinga-wm>	 PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:31] <icinga-wm>	 PROBLEM - puppet last run on wezen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:32] <icinga-wm>	 PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:32] <icinga-wm>	 PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:35] <ema>	 godog: only for trusty
[08:35:41] <icinga-wm>	 PROBLEM - puppet last run on mc2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:41] <icinga-wm>	 PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:42] <icinga-wm>	 PROBLEM - puppet last run on wtp2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:48] <godog>	 I'll shut ircecho
[08:35:51] <icinga-wm>	 PROBLEM - puppet last run on db1082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:51] <icinga-wm>	 PROBLEM - puppet last run on db1074 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:51] <icinga-wm>	 PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:51] <icinga-wm>	 PROBLEM - puppet last run on ms-fe1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:51] <icinga-wm>	 PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:51] <icinga-wm>	 PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:52] <ema>	 thanks
[08:35:52] <icinga-wm>	 PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:52] <icinga-wm>	 PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:53] <icinga-wm>	 PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:53] <icinga-wm>	 PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:54] <icinga-wm>	 PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:54] <icinga-wm>	 PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:55] <icinga-wm>	 PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:35:55] <icinga-wm>	 PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:36:05] <jynus>	 I'm going to set db1051 as non-transactional writes
[08:36:22] <godog>	 !log temporarily stop ircecho on tegmen, puppet spam
[08:36:24] <marostegui>	 it is fine now (hanging around 0-10 seconds)
[08:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:33] <elukey>	 thanks godog, was about to ask
[08:37:14] <ema>	 godog: so basically https://gerrit.wikimedia.org/r/#/c/357617/3/modules/base/manifests/monitoring/host.pp has a require on File[/etc/modules-load.d], which is not defined on jessie because of the trusty conditional here https://gerrit.wikimedia.org/r/#/c/357591/2/modules/base/manifests/kernel.pp
[08:37:40] <godog>	 ah ok, makes sense
[08:37:59] <moritzm>	 ah, right. it's shipped on the dpkg level via systemd, but not on the puppet level
[08:38:03] <ema>	 yup
[08:38:34] <marostegui>	 TabbyCat: did it finish enwiki already? Lag is totally gone now
[08:39:59] <godog>	 ema: can you adjust the comment stating that it can be removed once trusty is gone? other than that +1
[08:40:10] <ema>	 godog: sure thing
[08:40:10] <TabbyCat>	 marostegui: yep, some minutes ago
[08:40:23] <TabbyCat>	 marostegui: got to check the page moves later
[08:40:26] <godog>	 ema: actually meh it'll fail the same way when we do that
[08:40:35] <marostegui>	 cool, which wiki is it at now?
[08:40:55] <ema>	 godog: unless we remove the unnecessary (after trusty is gone) require in host.pp 
[08:41:13] <TabbyCat>	 marostegui: mrwiki
[08:41:21] <TabbyCat>	 https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress?username=FlightTime
[08:41:29] <godog>	 ema: yeah I'd say remove the require instead but leave the if there
[08:41:44] <marostegui>	 TabbyCat: that is useful, thanks!
[08:41:51] <godog>	 it'll be racy maybe on trusty for a while, but we don't care that much
[08:43:24] <ema>	 godog: ok
[08:45:19] <wikibugs>	 (03PS1) 10Ema: check_ipmi_temp: do not require /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/357774
[08:45:51] <ema>	 godog: ^
[08:46:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] check_ipmi_temp: do not require /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/357774 (owner: 10Ema)
[08:47:03] <wikibugs>	 (03CR) 10Ema: [V: 032 C: 032] check_ipmi_temp: do not require /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/357774 (owner: 10Ema)
[08:48:05] <godog>	 ema: btw for recovery I think we can test run-puppet-agent --failed-only !
[08:48:16] <ema>	 godog: nice one, yeah
[08:48:24] <TabbyCat>	 5 wikis to go marostegui 
[08:48:32] <TabbyCat>	 4
[08:48:38] <ema>	 godog: confirmed that the puppetfail is fixed on cp1008
[08:48:39] <marostegui>	 hehe yeah, I see
[08:49:20] <TabbyCat>	 finished :D
[08:49:24] <marostegui>	 \o/
[08:49:29] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3286412 (10JAllemandou) Looks good to me (even if I don't understand in depth what it means). I particularly like the idea of havi...
[08:49:40] <TabbyCat>	 at least on that special page I mentioned, not sure from your part
[08:50:02] <ema>	 godog: I'm gonna run-puppet-agent --failed-only across the fleet then
[08:50:03] <marostegui>	 yes, everything looks fine 
[08:50:38] <TabbyCat>	 !log Rename user "Mlpearc" to "FlightTime" on Central Auth is now finished (T166028)
[08:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:46] <stashbot>	 T166028: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028
[08:51:48] <moritzm>	 ema, godog: but we still need a followup fix of some sort, otherwise it's not ensured on trusty hosts that /etc/modules-load.d/ is created before ipmi.conf gets created?
[08:52:03] <wikibugs>	 (03Abandoned) 10Ema: base::kernel: create /etc/modules-load.d [puppet] - 10https://gerrit.wikimedia.org/r/357772 (owner: 10Ema)
[08:53:37] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3331434 (10MarcoAurelio) 05Open>03Resolved p:05Triage>03Normal a:03MarcoAurelio Global rename is now done. Thanks to @Marostegui for his help during the...
[08:53:47] <ema>	 moritzm: can we conditionally set the require in puppet?
[08:55:14] <godog>	 moritzm: yeah we could, hopefully though we don't make any new trusty reinstall :D
[08:55:27] <godog>	 also it'll converge on the second puppet run if it fails the first on trusty
[08:57:51] <moritzm>	 ema: there might be hack, but not sure
[08:58:17] <moritzm>	 godog: yeah, it's probably not a big deal, we can also choose to ignore it
[08:58:23] <godog>	 !log swift eqiad-prod eqiad-prod: decom ms-be1005/6/7 - T166489
[08:58:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:31] <stashbot>	 T166489: Decommission ms-be1001 - ms-be1012  - https://phabricator.wikimedia.org/T166489
[08:58:54] <wikibugs>	 (03PS1) 10Elukey: Bump Debian Jessie zookeeper version to 3.4.5+dfsg-2+deb8u2 [puppet] - 10https://gerrit.wikimedia.org/r/357775
[08:59:21] <ema>	 so cumin seems to be ignoring -p?
[08:59:32] <ema>	 cumin -p 0 -b 12 bla bla
[08:59:42] <ema>	 7.3% (93/1272) success ratio (< 100.0% threshold)
[08:59:47] <ema>	 volans: ^
[09:01:41] <godog>	 ema: LMK when it is ok to renable puppet/ircecho on tegmen btw
[09:02:12] <wikibugs>	 (03PS2) 10Elukey: Bump Debian Jessie zookeeper version to 3.4.5+dfsg-2+deb8u2 [puppet] - 10https://gerrit.wikimedia.org/r/357775
[09:02:52] <ema>	 godog: yeah currently struggling with cumin, most likely PEBKAC
[09:03:20] <volans>	 ema: looking
[09:03:38] <wikibugs>	 (03CR) 10Elukey: "I think that this change might not be needed since the client role is only included in the server one afaics, but I'll merge anyway to kee" [puppet] - 10https://gerrit.wikimedia.org/r/357775 (owner: 10Elukey)
[09:03:51] <volans>	 what is the exact problem? the -p decide if it aborts or not, but still show the restults at the end
[09:04:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] Bump Debian Jessie zookeeper version to 3.4.5+dfsg-2+deb8u2 [puppet] - 10https://gerrit.wikimedia.org/r/357775 (owner: 10Elukey)
[09:04:33] <ema>	 volans: I'm trying to run-puppet-agent --failed-only across the fleet
[09:04:47] <ema>	 I don't care about the exit status of puppet-agent of course
[09:04:52] <wikibugs>	 (03CR) 10Elukey: [C: 032] Bump Debian Jessie zookeeper version to 3.4.5+dfsg-2+deb8u2 [puppet] - 10https://gerrit.wikimedia.org/r/357775 (owner: 10Elukey)
[09:05:02] <ema>	 volans: sudo cumin -d --success-percentage 0 -b 12 '*' 'run-puppet-agent --failed-only || true'
[09:05:11] <volans>	 no need for the || true
[09:05:48] <ema>	 well at any rate this fails with
[09:05:49] <ema>	 11.6% (148/1272) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting
[09:05:59] <elukey>	 !log upgrade zookeeper packages to 3.4.5+dfsg-2+deb8u2 on conf100[123], conf200[23] and druid100[123]
[09:06:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:27] <ema>	 volans: is there a way to let it just continue regardless of the number of fails?
[09:07:01] <ema>	 (I thought that's what --success-percentage 0 was for)
[09:07:31] <volans>	 ema: use 1 for now I'm looking at a possible bug, and you don't need the || true
[09:07:48] <ema>	 volans: ack, trying -p 1
[09:08:28] <volans>	 also the real "failure" will probably be under few percent so like -p 95 because only the unreachable hosts should fail, all the others should skip or run puppet (and then maybe fail, but hopefully not if is fixed)
[09:08:51] <ema>	 yup
[09:09:24] <wikibugs>	 (03CR) 10Hashar: [C: 031] Fix whitespace-related Rubocop warnings across the tree [puppet] - 10https://gerrit.wikimedia.org/r/357715 (owner: 10Faidon Liambotis)
[09:10:31] <ema>	 volans: also I don't think I got any different output by using -d
[09:10:45] <volans>	 ema: no that's debug level in the logs
[09:10:50] <volans>	 you got a loooot more there ;)
[09:10:57] <volans>	  /var/log/cumin/cumin.log
[09:10:57] <ema>	 ah fair enough :)
[09:11:11] <volans>	 -d, --debug           Set log level to DEBUG.
[09:11:13] <volans>	 :-P
[09:12:48] <wikibugs>	 (03CR) 10Hashar: [C: 031] check_puppetrun: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357716 (owner: 10Faidon Liambotis)
[09:12:58] <ema>	 volans: there's definitely a bug there, -p 1 works while -p 0 doesn't
[09:13:13] <volans>	 yep! I think I've found it... debugging ;)
[09:13:17] <ema>	 thanks!
[09:13:42] <volans>	 thank you for reporting it ;) and sorry for the trouble
[09:15:50] <wikibugs>	 (03PS4) 10Elukey: beta: profile::cassandra::allow_analytics: false [puppet] - 10https://gerrit.wikimedia.org/r/357344 (owner: 10Hashar)
[09:17:16] <wikibugs>	 (03CR) 10Elukey: [C: 032] beta: profile::cassandra::allow_analytics: false [puppet] - 10https://gerrit.wikimedia.org/r/357344 (owner: 10Hashar)
[09:17:32] <wikibugs>	 (03CR) 10Hashar: [C: 031] check_puppetrun: fix rubocop warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357716 (owner: 10Faidon Liambotis)
[09:18:13] <wikibugs>	 (03CR) 10Hashar: [C: 031] wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717 (owner: 10Faidon Liambotis)
[09:18:43] <elukey>	 hashar: thanks for the patch! Are you going to remove the cherry pick on the depl-prep puppet master or do you want me to do it?
[09:18:48] <ema>	 godog: done, feel free to re-enable ircecho
[09:18:52] <ema>	 and thanks :)
[09:19:08] <hashar>	 elukey: good morning. Which change/patch are you referring to ?
[09:19:17] <elukey>	 beta: profile::cassandra::allow_analytics: false :)
[09:19:22] <hashar>	 elukey: if it got merged,  there is a crontab entry that automagically rebase the puppet repo for us
[09:19:46] <hashar>	 note I have absolutely no clue what that settings is actually doing and whether it should be false or true on beta :-}
[09:19:47] <elukey>	 hashar: sure, but yesterday due to a cherry picked patch (scap3 + jobrunners) the sync was broken
[09:19:54] <hashar>	 ahhh
[09:20:06] <hashar>	 yeah and I guess the cron falling does not trigger any mail notification so that is left unnoticed
[09:20:07] <godog>	 ema: ack, {{done}}
[09:20:16] <elukey>	 hashar: yeah
[09:20:22] <hashar>	 though there might be Shinken prob to report an error 
[09:20:55] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: puppetmaster: Set stringify_facts = false [puppet] - 10https://gerrit.wikimedia.org/r/357776
[09:25:36] <hashar>	 elukey: wmflabs might have a way to send email to all project admins. Maybe that could help
[09:26:12] <elukey>	 yeah
[09:37:10] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490 (owner: 10Hashar)
[09:37:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490 (owner: 10Hashar)
[09:40:05] <wikibugs>	 (03CR) 10Hashar: "I can't tell whether the self.xx are actually needed :(   But Dan would know for sure!" [puppet] - 10https://gerrit.wikimedia.org/r/357718 (owner: 10Faidon Liambotis)
[09:41:25] <moritzm>	 !log updating mysql-connector-java on hadoop cluster
[09:41:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:51] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Fix whitespace-related Rubocop warnings across the tree [puppet] - 10https://gerrit.wikimedia.org/r/357715 (owner: 10Faidon Liambotis)
[09:43:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix whitespace-related Rubocop warnings across the tree [puppet] - 10https://gerrit.wikimedia.org/r/357715 (owner: 10Faidon Liambotis)
[09:44:09] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: check_puppetrun: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357716 (owner: 10Faidon Liambotis)
[09:44:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] check_puppetrun: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357716 (owner: 10Faidon Liambotis)
[09:46:31] <elukey>	 moritzm: everything looks good, starting the upgrade of the zk eqiad cluster
[09:46:39] <moritzm>	 ok!
[09:48:51] <icinga-wm>	 RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:48:55] <wikibugs>	 (03CR) 10Hashar: [C: 031] "There is still a $DIR global variable, but that is not important for this standalone script." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357719 (owner: 10Faidon Liambotis)
[09:50:53] <elukey>	 conf1001 done, all good
[09:51:01] <elukey>	 will do conf1002 shortly
[09:51:09] <wikibugs>	 (03CR) 10Hashar: [C: 031] scap: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357720 (owner: 10Faidon Liambotis)
[09:51:11] <icinga-wm>	 RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational
[09:53:53] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1019 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167393
[09:54:00] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167393#3331591 (10ops-monitoring-bot)
[09:55:59] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "I cannot reproduce." [puppet] - 10https://gerrit.wikimedia.org/r/357721 (owner: 10Faidon Liambotis)
[09:59:23] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics-privatedata-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T167116#3331604 (10GoranSMilovanovic) @Dzahn Thanks a lot!
[10:00:57] <elukey>	 moritzm: all done, zk upgraded everywhere
[10:01:06] <elukey>	 (main-eqiad/main-codfw/druid)
[10:01:32] <wikibugs>	 (03CR) 10Hashar: "I would make that patch to bump rubocop version to 0.49.1 in the Gemfile.  Looks like that is ready to be the final patch in the serie \O/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357722 (owner: 10Faidon Liambotis)
[10:01:41] <moritzm>	 ok, great
[10:01:51] <icinga-wm>	 PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init]
[10:02:13] <elukey>	 whhhaaat
[10:02:17] <elukey>	 checking --^
[10:03:13] <elukey>	 ah this is a stupid race condition, it failed while restarting the last conf1* node
[10:03:31] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[10:04:31] <ema>	 looking ^
[10:04:35] <elukey>	 one big spike, seems recovered
[10:04:41] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[10:04:51] <icinga-wm>	 RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[10:05:41] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[10:08:41] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[10:08:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Update to 1.1.0f [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357783
[10:11:31] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[10:13:41] <icinga-wm>	 PROBLEM - DPKG on d-i-test is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[10:14:41] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[10:16:54] <wikibugs>	 (03CR) 10Hashar: [C: 031] "> 5 seconds on every job is quite a bit -- and on the CI instances with empty pagecaches and in VMs in a shared infrastructure without SSD" [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans)
[10:19:49] <wikibugs>	 (03PS1) 10Volans: Transports: fix success_threshold getter when set to 0 [software/cumin] - 10https://gerrit.wikimedia.org/r/357784 (https://phabricator.wikimedia.org/T167392)
[10:19:51] <wikibugs>	 (03PS1) 10Volans: Transports: fix ok_codes getter for empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/357785 (https://phabricator.wikimedia.org/T167394)
[10:23:41] <icinga-wm>	 RECOVERY - DPKG on d-i-test is OK: All packages OK
[10:24:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Update symbols for 1.1.0f [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357786
[10:24:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update to 1.1.0f [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357783 (owner: 10Muehlenhoff)
[10:25:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update symbols for 1.1.0f [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357786 (owner: 10Muehlenhoff)
[10:30:52] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3331701 (10elukey) Better view:  ``` elukey@neodymium:~$ sudo cumin 'R:class = role::analytics_cluster::hadoop::worker' 'megacli -...
[10:31:13] <wikibugs>	 (03PS1) 10Ema: VCL: update wikiScrape regex [puppet] - 10https://gerrit.wikimedia.org/r/357787
[10:31:41] <volans>	 elukey: random policy per disk? :D
[10:32:23] <elukey>	 volans: yes! trying to set only one for all of them, if you have suggestions please let me know
[10:32:26] <elukey>	 fun times
[10:32:43] <elukey>	 atm I am removing Write cache OK if bad BBU
[10:32:49] <elukey>	 that seems really wrong
[10:33:43] <volans>	 it depends IMHO
[10:33:51] <volans>	 but a bit busy now, I can explain later
[10:35:10] <elukey>	 sure
[10:39:36] <wikibugs>	 (03PS1) 10Hashar: DO NOT SUBMIT: dump git info for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/357788
[10:41:50] <wikibugs>	 (03Abandoned) 10Hashar: DO NOT SUBMIT: dump git info for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/357788 (owner: 10Hashar)
[10:42:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Update man-sections patch for 1.1.0f [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357790
[10:45:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update man-sections patch for 1.1.0f [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357790 (owner: 10Muehlenhoff)
[10:49:59] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3310890 (10hashar) Thank you @catrope @demon for the description of how patches are prepared to be tested and the gate system. I endorse your desc...
[10:53:53] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1019 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167398
[10:53:57] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167398#3331752 (10ops-monitoring-bot)
[10:55:01] <volans>	 godog: double task today too? what happened?
[10:59:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/355871 (owner: 10Dzahn)
[11:18:34] <wikibugs>	 (03PS1) 10Ayounsi: Add mock rancid ssh key [labs/private] - 10https://gerrit.wikimedia.org/r/357791
[11:18:42] <wikibugs>	 (03CR) 10Ema: [C: 031] "Thanks for fixing this!" [software/cumin] - 10https://gerrit.wikimedia.org/r/357784 (https://phabricator.wikimedia.org/T167392) (owner: 10Volans)
[11:19:58] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3331797 (10hashar) >>! In T166888#3316057, @greg wrote: > Looking at the data we have it seems that the tests themselves take about [[ https://int...
[11:20:23] <wikibugs>	 (03CR) 10Ema: [C: 031] Transports: fix ok_codes getter for empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/357785 (https://phabricator.wikimedia.org/T167394) (owner: 10Volans)
[11:20:42] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/357791 (owner: 10Ayounsi)
[11:21:07] <volans>	 thanks ema!
[11:21:22] <wikibugs>	 (03PS2) 10Volans: Transports: fix success_threshold getter when set to 0 [software/cumin] - 10https://gerrit.wikimedia.org/r/357784 (https://phabricator.wikimedia.org/T167392)
[11:23:34] <wikibugs>	 (03CR) 10Ayounsi: [V: 032 C: 032] Add mock rancid ssh key [labs/private] - 10https://gerrit.wikimedia.org/r/357791 (owner: 10Ayounsi)
[11:24:47] <wikibugs>	 (03CR) 10Volans: [C: 032] Transports: fix success_threshold getter when set to 0 [software/cumin] - 10https://gerrit.wikimedia.org/r/357784 (https://phabricator.wikimedia.org/T167392) (owner: 10Volans)
[11:25:30] <wikibugs>	 (03Merged) 10jenkins-bot: Transports: fix success_threshold getter when set to 0 [software/cumin] - 10https://gerrit.wikimedia.org/r/357784 (https://phabricator.wikimedia.org/T167392) (owner: 10Volans)
[11:28:26] <wikibugs>	 (03PS2) 10Volans: Transports: fix ok_codes getter for empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/357785 (https://phabricator.wikimedia.org/T167394)
[11:38:21] <icinga-wm>	 PROBLEM - Host mw1294 is DOWN: PING CRITICAL - Packet loss = 100%
[11:41:13] <moritzm>	 !log powercycling mw1294, mgmt is unresponsive
[11:41:19] <marostegui>	 !log Drop table updates on s7 - T139342
[11:41:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:32] <stashbot>	 T139342: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342
[11:43:34] <wikibugs>	 (03CR) 10Volans: [C: 032] Transports: fix ok_codes getter for empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/357785 (https://phabricator.wikimedia.org/T167394) (owner: 10Volans)
[11:43:45] <wikibugs>	 (03Merged) 10jenkins-bot: Transports: fix ok_codes getter for empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/357785 (https://phabricator.wikimedia.org/T167394) (owner: 10Volans)
[11:44:41] <icinga-wm>	 RECOVERY - Host mw1294 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms
[11:46:41] <icinga-wm>	 PROBLEM - nutcracker process on mw1294 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (nutcracker), command name nutcracker
[11:47:41] <icinga-wm>	 RECOVERY - nutcracker process on mw1294 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[11:53:37] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3331924 (10hashar) And my final reply, following up on T166888#3322963 that does a breakdown of the job build steps.  > 1. It clones the repositor...
[11:58:45] <wikibugs>	 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3331935 (10MoritzMuehlenhoff)
[11:59:06] <wikibugs>	 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3331949 (10MoritzMuehlenhoff) p:05Triage>03Normal
[12:02:17] <wikibugs>	 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3331951 (10MoritzMuehlenhoff)
[12:02:22] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 136.75, 104.80, 73.54
[12:13:21] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 57.10, 77.45, 78.59
[12:13:41] <wikibugs>	 (03PS3) 10Faidon Liambotis: hiera_lookup: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357719
[12:13:43] <wikibugs>	 (03PS4) 10Faidon Liambotis: scap: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357720
[12:13:45] <wikibugs>	 (03PS4) 10Faidon Liambotis: rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721
[12:13:47] <wikibugs>	 (03PS4) 10Faidon Liambotis: rubocop: update rubocop to rubocop 0.49.1 [puppet] - 10https://gerrit.wikimedia.org/r/357722
[12:15:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rubocop: update rubocop to rubocop 0.49.1 [puppet] - 10https://gerrit.wikimedia.org/r/357722 (owner: 10Faidon Liambotis)
[12:15:55] <paravoid>	 huh, still didn't like that
[12:16:18] <paravoid>	 maybe it's 0.48 -> 0.49 :)
[12:17:31] <moritzm>	 !log updated hhvm 3.18.2-dfsg-1+wmf5 to apt.wikimedia.org
[12:17:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:46] <moritzm>	 !log uploaded hhvm 3.18.2-dfsg-1+wmf5 to apt.wikimedia.org
[12:17:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:21] <wikibugs>	 (03PS1) 10Faidon Liambotis: Fix another couple instances of RuboCop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357797
[12:19:57] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "Faidon pointed rubocop/target_finder.rb list .fcgi has a ruby extension.  That got introduced in rubocop 0.48.0." [puppet] - 10https://gerrit.wikimedia.org/r/357721 (owner: 10Faidon Liambotis)
[12:19:59] <wikibugs>	 (03CR) 10Faidon Liambotis: [V: 032 C: 032] Fix another couple instances of RuboCop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357797 (owner: 10Faidon Liambotis)
[12:23:10] <wikibugs>	 (03CR) 10Hashar: [C: 031] "That ignore is fine. The fault is rubocop solely rely on the file extension and blindly consider .fcgi files to be ... ruby." [puppet] - 10https://gerrit.wikimedia.org/r/357721 (owner: 10Faidon Liambotis)
[12:23:56] <wikibugs>	 (03PS1) 10Faidon Liambotis: nginx: tiny whitespace fix to make RuboCop happy [puppet/nginx] - 10https://gerrit.wikimedia.org/r/357798
[12:24:19] <paravoid>	 hashar: do we do this magic where submodules get updated automatically?
[12:24:57] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] nginx: tiny whitespace fix to make RuboCop happy [puppet/nginx] - 10https://gerrit.wikimedia.org/r/357798 (owner: 10Faidon Liambotis)
[12:25:36] <paravoid>	 god I hate submodules so much
[12:26:14] <wikibugs>	 (03PS1) 10Faidon Liambotis: Update nginx submodule to include lint/rubocop fixes [puppet] - 10https://gerrit.wikimedia.org/r/357799
[12:30:25] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] Update nginx submodule to include lint/rubocop fixes [puppet] - 10https://gerrit.wikimedia.org/r/357799 (owner: 10Faidon Liambotis)
[12:31:38] <wikibugs>	 (03PS3) 10Faidon Liambotis: wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717
[12:31:40] <wikibugs>	 (03PS3) 10Faidon Liambotis: puppetmaster: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357718
[12:31:42] <wikibugs>	 (03PS4) 10Faidon Liambotis: hiera_lookup: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357719
[12:31:44] <wikibugs>	 (03PS5) 10Faidon Liambotis: scap: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357720
[12:31:46] <wikibugs>	 (03PS5) 10Faidon Liambotis: rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721
[12:31:48] <wikibugs>	 (03PS5) 10Faidon Liambotis: rubocop: update rubocop to rubocop 0.49.1 [puppet] - 10https://gerrit.wikimedia.org/r/357722
[12:38:38] <hashar>	 paravoid: the submodules are not automagically updated by Gerrit in operations/puppet
[12:38:49] <paravoid>	 yeah saw that
[12:38:56] <hashar>	 I guess this way folks can do their hack/dev in the submodule, and bumping the submodule require an explicit commit bump
[12:39:41] <hashar>	 the Jenkins job clone all submodules though.  So maybe they will each need to be bumped to rubocop 0.49.1
[12:39:52] <moritzm>	 !log updating mwdebug* to HHVM 3.18.2+wmf5
[12:39:52] <hashar>	 then in operations/puppet the change that bumps rubocop will also bump the submodules
[12:40:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:06] <paravoid>	 hey, the rubocop 0.49.1 got V+2 :)
[12:40:12] <hashar>	 !!!!
[12:40:37] <hashar>	 AH .rubocop.yml  excludes the git submodules \O/
[12:51:40] <wikibugs>	 (03PS1) 10Faidon Liambotis: rubocop: remove stale comments from _todo.yml [puppet] - 10https://gerrit.wikimedia.org/r/357801
[13:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T1300). Please do the needful.
[13:01:40] * aude has stuff for swat
[13:01:56] <aude>	 looks like there's nothing else today
[13:06:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "Minor nit but LGTM; the output is satisfactory and the code is overall very clear." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh)
[13:11:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "So, I am conflicted, as this removes a safety net given it's not hard to mess up the format/syntax of redirects.dat." [puppet] - 10https://gerrit.wikimedia.org/r/357733 (owner: 10Faidon Liambotis)
[13:12:21] <icinga-wm>	 PROBLEM - HHVM rendering on mw2245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:13:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "see comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357733 (owner: 10Faidon Liambotis)
[13:13:12] <icinga-wm>	 RECOVERY - HHVM rendering on mw2245 is OK: HTTP OK: HTTP/1.1 200 OK - 75503 bytes in 0.263 second response time
[13:14:56] <wikibugs>	 (03PS1) 10Hashar: Rake: optimize typos task for CI [puppet] - 10https://gerrit.wikimedia.org/r/357804 (https://phabricator.wikimedia.org/T166888)
[13:15:14] <paravoid>	 haha
[13:15:17] <paravoid>	 I was just fixing that :)
[13:15:28] <hashar>	 ;D
[13:16:48] <hashar>	 bah the run took 1min25s ...
[13:17:25] <paravoid>	 I think this Rakefile needs to be written from scratch with performance in mind honestly
[13:17:41] <hashar>	 what a bold statement ;}
[13:17:56] <paravoid>	 git_changed_in_head() is called a few times now, every time spawning git
[13:18:08] <paravoid>	 most commands don't use it etc.
[13:18:18] <paravoid>	 I think the rakefile should start from there
[13:18:21] <paravoid>	 find the git_changed_in_head
[13:18:24] <hashar>	 it can probably cache the command output 
[13:18:32] <paravoid>	 then filter that, and depending on the files found
[13:18:40] <paravoid>	 'require' (ruby) modules conditionally
[13:18:53] <paravoid>	 and run against them
[13:19:59] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cache: add monitoring of services at the SSL termination level [puppet] - 10https://gerrit.wikimedia.org/r/357805 (https://phabricator.wikimedia.org/T167048)
[13:20:11] <_joe_>	 mobrovac: ^^
[13:20:38] <wikibugs>	 (03PS2) 10Aude: Don't enable Wikibase data access yet for beta wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357601 (https://phabricator.wikimedia.org/T158324)
[13:21:03] <l_bratch>	 Hello - IPv6 for en.wikipedia.org seems to be down, is this known?
[13:21:20] <paravoid>	 no
[13:21:24] <paravoid>	 and I can confirm
[13:21:26] <paravoid>	 XioNoX: ^^^
[13:21:42] <paravoid>	 wtf
[13:22:03] <XioNoX>	 looking
[13:22:09] <XioNoX>	 blackhole?
[13:22:15] <paravoid>	 did you do anything?
[13:22:30] <XioNoX>	 I added v4 blackhole IPs
[13:22:45] <paravoid>	 yeah it's a bug in the blackhole ACL probably
[13:22:48] <paravoid>	 revert
[13:22:56] <XioNoX>	 rolling back
[13:23:25] <paravoid>	 2c8b414bf11ddbe997e731638395dc0352e1dfa2 is new and hasn't been tested before I think :(
[13:23:33] <paravoid>	 probably needs to be split in blackhole4/6
[13:24:14] <XioNoX>	 push in progress with JNT
[13:24:25] <paravoid>	 l_bratch: thanks, I was wondering if my IPv6 was broken, so that definitely helped :)
[13:24:33] <paravoid>	 also: sad that we didn't catch this by alerting :(
[13:24:33] <l_bratch>	 no problem :)
[13:25:02] <mobrovac>	 _joe_: is the service-checker pkg already installed on the caches?
[13:25:21] <_joe_>	 it's included in service::monitoring
[13:25:28] <mobrovac>	 duh ofc
[13:25:28] <mobrovac>	 kk
[13:25:41] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 518.51 seconds
[13:25:49] <marostegui>	 I will get that
[13:25:51] <wikibugs>	 (03CR) 10Mobrovac: [C: 031] cache: add monitoring of services at the SSL termination level [puppet] - 10https://gerrit.wikimedia.org/r/357805 (https://phabricator.wikimedia.org/T167048) (owner: 10Giuseppe Lavagetto)
[13:26:13] <moritzm>	 aude: can you please ping the channel when you're done with swat? I'd like to update some app servers after that
[13:26:29] <wikibugs>	 10Operations, 10Monitoring, 10Patch-For-Review, 10Services (next), and 2 others: Services need external monitoring - https://phabricator.wikimedia.org/T167048#3332136 (10mobrovac) a:05mobrovac>03Joe
[13:26:37] <paravoid>	 _joe_: I still disagree fwiw
[13:26:53] <_joe_>	 paravoid: I'm working on the other option :)
[13:27:08] <paravoid>	 by that logic we should be making all of our checks against all caches :)
[13:27:14] <paravoid>	 but I don't think it's the right thing to do
[13:27:33] <paravoid>	 we should have (and have!) cache checks that check that the cluster is healthy and in-sync etc.
[13:27:35] <_joe_>	 but I got distracted by how hard it is to define that correctly in the current form of role::lvs::balancer, which would be where it would make sense
[13:27:52] <paravoid>	 and then higher-level checks that check their own thing against the service IP
[13:28:01] <wikibugs>	 (03CR) 10Ema: cache: add monitoring of services at the SSL termination level (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357805 (https://phabricator.wikimedia.org/T167048) (owner: 10Giuseppe Lavagetto)
[13:28:35] <paravoid>	 basically: each check should be doing its own thing, not trying to find unrelated (e.g. cache config coherency) issues
[13:28:39] <l_bratch>	 it's back!
[13:28:54] <_joe_>	 paravoid: fair enough
[13:29:31] <_joe_>	 mobrovac: I'll prepare a patch to check just at the LVS level in all DCs
[13:29:55] <XioNoX>	 paravoid, l_bratch, config rollback done
[13:30:17] <paravoid>	 thanks XioNoX
[13:30:23] <paravoid>	 probably worth doing !log for this kind of thing
[13:30:47] <paravoid>	 hashar: so yeah, it's not about caching git_changed_in_head
[13:30:48] <mobrovac>	 _joe_: why at the lvs level now? we already have that?
[13:30:59] <l_bratch>	 looks good XioNoX, thanks :)
[13:31:04] <paravoid>	 hashar: the whole thing should be made lazy
[13:31:04] <_joe_>	 mobrovac: no, I mean the lvs level for the caches
[13:31:12] <mobrovac>	 ah ok
[13:31:13] <XioNoX>	 !log blackhole v4 IPs removed from all cr* routers
[13:31:14] <_joe_>	 not the application lvs
[13:31:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:31] <logmsgbot>	 !log aude@tin Synchronized php-1.30.0-wmf.4/extensions/Wikidata: Fix warning in date formatting T167360 (duration: 02m 16s)
[13:33:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:41] <stashbot>	 T167360: "Warning: in_array() expects parameter 2 to be an array or collection" from Wikibase MwTimeIsoFormatter - https://phabricator.wikimedia.org/T167360
[13:34:13] <wikibugs>	 (03PS3) 10Bmansurov: Enable ElectronPdf on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356881 (https://phabricator.wikimedia.org/T165954)
[13:35:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix typo in symbols file [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357809
[13:35:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Fix typo in symbols file [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357809 (owner: 10Muehlenhoff)
[13:36:51] <wikibugs>	 (03PS1) 10Faidon Liambotis: Bump puppet & rake versions in the Gemfile [puppet] - 10https://gerrit.wikimedia.org/r/357810
[13:37:15] <XioNoX>	 paravoid: I split v4/v6 in jnt, do you have some time to verify that it's working properly after I push it to a site? (or anyone else with v6 connectivity)
[13:37:29] <paravoid>	 XioNoX: let me review the config first :)
[13:38:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Bump puppet & rake versions in the Gemfile [puppet] - 10https://gerrit.wikimedia.org/r/357810 (owner: 10Faidon Liambotis)
[13:38:33] <XioNoX>	 paravoid: pushed to git
[13:39:06] <paravoid>	 XioNoX: you should had mentioned that it's not just splitting but also adding those v4 IPs (or make them two separate commits)
[13:39:17] <paravoid>	 but other than that it looks OK
[13:39:32] <paravoid>	 it's a little weird why this happened in the first place though, I'm wondering why
[13:40:30] <XioNoX>	 yeah, I already had the case where an empty list would make the router block all traffic (interprete it as 0/0)
[13:41:10] <XioNoX>	 but yeah, unexpected behavior that it does it only for one protocol
[13:41:25] <logmsgbot>	 !log aude@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details)
[13:41:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:11] <aude>	 ugh
[13:44:24] <XioNoX>	 paravoid: pushing it to ulsfo as it's probably the pop serving the least users for now
[13:44:28] <paravoid>	 no
[13:44:29] <paravoid>	 wait :)
[13:44:47] <XioNoX>	 ok
[13:45:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/6694/ says noop so merging" [puppet] - 10https://gerrit.wikimedia.org/r/356032 (https://phabricator.wikimedia.org/T166372) (owner: 10Faidon Liambotis)
[13:45:50] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Remove to_i/Integer from now unstringified facts [puppet] - 10https://gerrit.wikimedia.org/r/356032 (https://phabricator.wikimedia.org/T166372) (owner: 10Faidon Liambotis)
[13:45:52] <paravoid>	 first I want to understand why this happened
[13:45:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove to_i/Integer from now unstringified facts [puppet] - 10https://gerrit.wikimedia.org/r/356032 (https://phabricator.wikimedia.org/T166372) (owner: 10Faidon Liambotis)
[13:46:14] <aude>	 but think unrelated...
[13:46:21] <paravoid>	 I see no obvious reason why
[13:46:57] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: puppetmaster: Set stringify_facts = false [puppet] - 10https://gerrit.wikimedia.org/r/357776
[13:47:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppetmaster: Set stringify_facts = false [puppet] - 10https://gerrit.wikimedia.org/r/357776 (owner: 10Alexandros Kosiaris)
[13:47:25] <logmsgbot>	 !log aude@tin Synchronized php-1.30.0-wmf.4/extensions/RevisionSlider: Fix fatal error: T167359 (duration: 00m 44s)
[13:47:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:34] <stashbot>	 T167359: Catchable fatal error: Argument 2 passed to RevisionSliderHooks::onDiffViewHeader() must be an instance of Revision, null given  - https://phabricator.wikimedia.org/T167359
[13:47:49] <wikibugs>	 10Operations, 10Labs: virbr0 interface present in some virt hosts - https://phabricator.wikimedia.org/T83732#3332217 (10chasemp) 05Open>03Resolved a:03chasemp It seems not, I'm going to close this but anyone who knows differently please reopen  ```for i in `cat labvirt`; do echo $i; ssh $i.eqiad.wmnet 'i...
[13:47:55] <paravoid>	 XioNoX: any ideas?
[13:48:15] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3332220 (10elukey) Current status is:  ``` elukey@neodymium:~$ sudo cumin 'R:class = role::analytics_cluster::hadoop::worker' 'meg...
[13:48:21] <XioNoX>	 thinking
[13:49:32] <wikibugs>	 (03CR) 10Aude: [C: 032] Don't enable Wikibase data access yet for beta wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357601 (https://phabricator.wikimedia.org/T158324) (owner: 10Aude)
[13:50:41] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[13:53:36] <wikibugs>	 (03PS12) 10Faidon Liambotis: mediawiki: puppet compiler for Tim's redirects DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh)
[13:54:14] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] mediawiki: puppet compiler for Tim's redirects DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh)
[13:54:36] <elukey>	 single spike again
[13:55:03] <wikibugs>	 10Operations, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review, 10Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#3332228 (10chasemp) 05Resolved>03Open ```elukey@deployment-aqs03:~$ dig -x 10.68.17.125 +short elukey ci-jessie-wi...
[13:56:56] <wikibugs>	 (03PS3) 10Faidon Liambotis: mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733
[13:57:57] <wikibugs>	 10Operations, 10Labs: Tools puppet failing: Detail: undefined method `>>' for "24443.99":String - https://phabricator.wikimedia.org/T167412#3332240 (10chasemp)
[13:58:41] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[13:59:03] <wikibugs>	 (03CR) 10Aude: [C: 032] Don't enable Wikibase data access yet for beta wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357601 (https://phabricator.wikimedia.org/T158324) (owner: 10Aude)
[13:59:15] <_joe_>	 paravoid: don't forget to fix the source/content error :)
[13:59:26] <paravoid>	 _joe_: oh I didn't see that
[14:00:11] <wikibugs>	 (03Merged) 10jenkins-bot: Don't enable Wikibase data access yet for beta wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357601 (https://phabricator.wikimedia.org/T158324) (owner: 10Aude)
[14:00:21] <wikibugs>	 (03CR) 10jenkins-bot: Don't enable Wikibase data access yet for beta wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357601 (https://phabricator.wikimedia.org/T158324) (owner: 10Aude)
[14:01:29] <wikibugs>	 10Operations, 10Labs: Tools puppet failing: Detail: undefined method `>>' for "24443.99":String - https://phabricator.wikimedia.org/T167412#3332329 (10chasemp) Related?  ```Commit:  d3dc61097073773b308f2cc1bb9352c4aea61be8 Author:  Alexandros Kosiaris <akosiaris@wikimedia.org> Date:    (5 hours ago) 2017-06-08...
[14:02:38] <logmsgbot>	 !log aude@tin Synchronized wmf-config/InitialiseSettings-labs.php: Do not enable Wikibase data access yet on beta wiktionary (duration: 00m 43s)
[14:02:41] <wikibugs>	 10Operations, 10Labs: Tools puppet failing: Detail: undefined method `>>' for "24443.99":String - https://phabricator.wikimedia.org/T167412#3332347 (10chasemp) This is probably from an operation against this fact:  > sudo facter -p | grep swapsize_mb    swapsize_mb => 24443.99  Where that fact is now a string...
[14:02:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:20] <aude>	 done
[14:04:21] <icinga-wm>	 PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:04:21] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:04:30] <XioNoX>	 paravoid: "If no conditions match, the router rejects the address. An empty prefix list results in an automatic permit of the tested address."
[14:04:54] <wikibugs>	 10Operations, 10Labs: Tools puppet failing: Detail: undefined method `>>' for "24443.99":String - https://phabricator.wikimedia.org/T167412#3332386 (10chasemp) p:05Triage>03Normal
[14:05:01] <icinga-wm>	 PROBLEM - nutcracker process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:05:21] <icinga-wm>	 RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient
[14:05:21] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:05:51] <icinga-wm>	 RECOVERY - nutcracker process on thumbor1001 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker
[14:06:10] <XioNoX>	 paravoid: that's the closer I can find to an explanation, even though it still doesn't make sens
[14:06:30] <XioNoX>	 I can open a ticket with juniper to know more, but not sure that would be helpful anyway
[14:13:07] <wikibugs>	 10Operations, 10Labs: Tools puppet failing: Detail: undefined method `>>' for "24443.99":String - https://phabricator.wikimedia.org/T167412#3332403 (10chasemp) turns out 37b83e8b2c04a58f555ee5627a415561ab792d26 unintentionally resulted in this  ```diff --git a/modules/toollabs/templates/gridengine/host-vmem.er...
[14:14:07] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 04-1] "We do want this, but we have not yet decided when. We should proabably wait at least until we have good data in testwikidatawiki.wb_term.t" [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup)
[14:14:14] <wikibugs>	 10Operations, 10Labs: Tools puppet failing: Detail: undefined method `>>' for "24443.99":String - https://phabricator.wikimedia.org/T167412#3332404 (10chasemp) Quoting @faidon from irc:  ```yeah my suggestion wrt this would be a) swap = 3*ram is just silly obsolete advice, half a gig of swap should be plenty/e...
[14:14:37] <wikibugs>	 10Operations, 10Labs: host-vmem.erb is doing operations that make no sense - https://phabricator.wikimedia.org/T167412#3332409 (10chasemp)
[14:14:51] <wikibugs>	 10Operations, 10Labs, 10Patch-For-Review, 10cloud-services-team (Kanban): rebuild tools-grid-master as a large instance - https://phabricator.wikimedia.org/T162955#3332411 (10chasemp)
[14:14:53] <wikibugs>	 10Operations, 10Labs: host-vmem.erb is doing operations that make no sense - https://phabricator.wikimedia.org/T167412#3332240 (10chasemp)
[14:15:03] <wikibugs>	 (03PS1) 10Faidon Liambotis: Revert swapsize_mb/memorysize_mb unstringification [puppet] - 10https://gerrit.wikimedia.org/r/357816 (https://phabricator.wikimedia.org/T167412)
[14:15:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: toollabs: Fix memorysize_mb integer casts [puppet] - 10https://gerrit.wikimedia.org/r/357817 (https://phabricator.wikimedia.org/T167412)
[14:16:10] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] Revert swapsize_mb/memorysize_mb unstringification [puppet] - 10https://gerrit.wikimedia.org/r/357816 (https://phabricator.wikimedia.org/T167412) (owner: 10Faidon Liambotis)
[14:16:17] <wikibugs>	 (03CR) 10Rush: [C: 031] toollabs: Fix memorysize_mb integer casts [puppet] - 10https://gerrit.wikimedia.org/r/357817 (https://phabricator.wikimedia.org/T167412) (owner: 10Alexandros Kosiaris)
[14:16:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Revert swapsize_mb/memorysize_mb unstringification [puppet] - 10https://gerrit.wikimedia.org/r/357816 (https://phabricator.wikimedia.org/T167412) (owner: 10Faidon Liambotis)
[14:17:06] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: toollabs: Fix memorysize_mb integer casts [puppet] - 10https://gerrit.wikimedia.org/r/357817 (https://phabricator.wikimedia.org/T167412) (owner: 10Alexandros Kosiaris)
[14:17:58] <wikibugs>	 (03PS2) 10Faidon Liambotis: Revert swapsize_mb/memorysize_mb unstringification [puppet] - 10https://gerrit.wikimedia.org/r/357816 (https://phabricator.wikimedia.org/T167412)
[14:19:09] <wikibugs>	 (03CR) 10Faidon Liambotis: [V: 032 C: 032] Revert swapsize_mb/memorysize_mb unstringification [puppet] - 10https://gerrit.wikimedia.org/r/357816 (https://phabricator.wikimedia.org/T167412) (owner: 10Faidon Liambotis)
[14:20:25] <wikibugs>	 (03PS4) 10Faidon Liambotis: mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733
[14:20:44] <wikibugs>	 10Operations, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review, 10Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#3332427 (10hashar) From my digging:  | May 9th | `652785` | June 8th | `692016`  So I guess 505374 is a few months old.
[14:21:29] <wikibugs>	 (03PS1) 10Faidon Liambotis: raid: remove unused aac, twe, zfs [puppet] - 10https://gerrit.wikimedia.org/r/357819
[14:22:15] <paravoid>	 hashar: any clue what the error is @ https://gerrit.wikimedia.org/r/#/c/357810/ ?
[14:22:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] raid: remove unused aac, twe, zfs [puppet] - 10https://gerrit.wikimedia.org/r/357819 (owner: 10Faidon Liambotis)
[14:23:11] <wikibugs>	 (03PS11) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850)
[14:23:12] <hashar>	 paravoid: the rake task provided by rubocop does not seem to be compatible with rake 12
[14:24:07] <paravoid>	 ah, will be fixed with the new rubocop then?
[14:24:15] <paravoid>	 I'll rebase :)
[14:24:46] <wikibugs>	 (03PS2) 10Faidon Liambotis: raid: remove unused aac, twe, zfs [puppet] - 10https://gerrit.wikimedia.org/r/357819
[14:25:46] <hashar>	 paravoid: that is probably fixed in rubocop 0.38.0   there is a commit that claims to update the task for rake 11 ( https://github.com/bbatsov/rubocop/commit/88a200e59e10868450ceb4316ffc600d9a09b95c )
[14:26:57] <wikibugs>	 (03PS2) 10Faidon Liambotis: Bump puppet & rake versions in the Gemfile [puppet] - 10https://gerrit.wikimedia.org/r/357810
[14:27:41] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[14:29:50] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor: Package latest version of Thumbor and deploy it - https://phabricator.wikimedia.org/T167286#3332457 (10fgiunchedi)
[14:29:52] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor: Backport python-schedule and add it to jessie-wikimedia - https://phabricator.wikimedia.org/T167287#3332455 (10fgiunchedi) 05Open>03Resolved I've uploaded `schedule` `0.3.2-1~bpo8+1` to Debian `jessie-backports` with its maintainer approval.
[14:30:26] <wikibugs>	 (03PS6) 10Faidon Liambotis: rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721
[14:31:15] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721 (owner: 10Faidon Liambotis)
[14:31:28] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3332462 (10fgiunchedi)
[14:31:30] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor: Package latest version of Thumbor and deploy it - https://phabricator.wikimedia.org/T167286#3323329 (10fgiunchedi) 05Open>03Resolved I've checked the diff and uploaded `thumbor` `6.3.2+git20170607-1` internally to `jessie-wikimedia`
[14:33:41] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:34:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717 (owner: 10Faidon Liambotis)
[14:35:21] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[14:35:30] <wikibugs>	 (03CR) 10Mforns: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey)
[14:38:21] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:39:34] <wikibugs>	 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3332468 (10fgiunchedi)
[14:39:36] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167393#3332471 (10fgiunchedi)
[14:39:38] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167398#3332472 (10fgiunchedi)
[14:41:21] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[14:41:41] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[14:43:56] <wikibugs>	 10Operations, 10Mail: Increase email log retention period for the main email relays - https://phabricator.wikimedia.org/T167333#3325007 (10fgiunchedi) FWIW if we also want to store mail logs off-host a simple solution would be to syslog exim logs too, syslog hosts already have 90d retention in place.
[14:44:41] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:45:21] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:48:01] <wikibugs>	 10Operations, 10Monitoring: Monitoring: add link to graph for Icinga timeseries alarms - https://phabricator.wikimedia.org/T167422#3332481 (10Volans)
[14:48:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "prometheus-node-exporter has been moved to main" [puppet] - 10https://gerrit.wikimedia.org/r/357616 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff)
[14:54:06] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824
[14:54:23] <wikibugs>	 (03CR) 10Volans: "LGTM, small comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357819 (owner: 10Faidon Liambotis)
[14:54:35] <XioNoX>	 !log 2 blackhole IPs pushed to cr* routers
[14:54:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:51] <moritzm>	 !log updating mw1261 to HHVM 3.18.2+wmf5
[14:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:04] <wikibugs>	 (03PS1) 10Ayounsi: Rancid improvements [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288)
[15:09:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Services, 10User-fgiunchedi: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3332563 (10fgiunchedi)
[15:10:48] <wikibugs>	 (03CR) 10Volans: "The puppet part looks good, a minor comment inline. I'm not familiar with Rancid to review that part." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi)
[15:11:13] <XioNoX>	 !log Upgrading rancid to 3 - T167288
[15:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:23] <stashbot>	 T167288: Rancid improvements - https://phabricator.wikimedia.org/T167288
[15:11:35] <wikibugs>	 (03CR) 10Volans: "It will be nice to have the results of a puppet compiler too to verify" [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi)
[15:14:29] <wikibugs>	 10Operations, 10Traffic: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#3332571 (10ema) I've collected 60s of ICMP traffic from GCE on the load balancers and sent a report through https://support.google.com/code/contact/cloud_platform_report?hl=en. I've also added a c...
[15:15:22] <volans>	 thanks ema!
[15:15:32] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: lvs: Remove all bgp keywords from configuration [puppet] - 10https://gerrit.wikimedia.org/r/356790
[15:19:41] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 49.41 seconds
[15:20:12] <urandom>	 elukey: thanks for the deployment-prep aqs re-init
[15:20:21] <icinga-wm>	 PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:20:22] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:20:57] <urandom>	 elukey: i can't believe it, but that seems to have fixed the restbase cluster
[15:21:01] <wikibugs>	 (03PS2) 10Ayounsi: Rancid improvements [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288)
[15:21:11] <icinga-wm>	 RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker
[15:21:12] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[15:21:53] <urandom>	 elukey: i was honestly expecting to have to re-init both
[15:22:42] <moritzm>	 !log updating mw1262-mw1265 to HHVM 3.18.2+wmf5
[15:22:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:53] <wikibugs>	 (03CR) 10Ayounsi: Rancid improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi)
[15:24:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] scap: fix rubocop warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357720 (owner: 10Faidon Liambotis)
[15:24:47] <elukey>	 urandom: Marko did the same work in the restbase cluster :)
[15:24:54] <elukey>	 no magic sadly :)
[15:25:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] hiera_lookup: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357719 (owner: 10Faidon Liambotis)
[15:26:11] <urandom>	 oooooh
[15:26:26] <urandom>	 elukey: actually, that makes me feel somewhat better
[15:26:50] <urandom>	 because the more i thought about it, the more i was having a hard time believing it
[15:27:00] <urandom>	 mobrovac: and, thank you :)
[15:27:03] <_joe_>	 akosiaris: before you merge https://gerrit.wikimedia.org/r/356790
[15:27:13] <_joe_>	 akosiaris: https://gerrit.wikimedia.org/r/357824
[15:27:15] <_joe_>	 :)
[15:27:15] <wikibugs>	 10Operations, 10ops-eqiad, 10Dumps-Generation: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3332602 (10Cmjohnson)
[15:27:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] Bump puppet & rake versions in the Gemfile [puppet] - 10https://gerrit.wikimedia.org/r/357810 (owner: 10Faidon Liambotis)
[15:27:35] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3332608 (10Cmjohnson)
[15:27:53] <_joe_>	 akosiaris: still working on that unholy mess btw
[15:28:07] <akosiaris>	 _joe_: hehe, should I rebase on top of yours ?
[15:28:15] <wikibugs>	 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3332611 (10Cmjohnson)
[15:28:27] <_joe_>	 akosiaris: I'd say wait until I tell you so
[15:28:30] <akosiaris>	 ok
[15:28:53] <_joe_>	 it shouldn't take too long, but in the end I hope to be able to reduce a little bit of the 4-level entanglement that those classes are
[15:29:01] <wikibugs>	 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3332619 (10Cmjohnson)
[15:29:32] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3332623 (10Cmjohnson)
[15:29:47] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and setup  wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3332624 (10Cmjohnson)
[15:30:15] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824
[15:30:24] <_joe_>	 tbh, the lvs hierarchy is my last white whale of our puppet repo
[15:31:00] <_joe_>	 well, excluding analytics, ci and labs, where I read the sign "here be dragons" and I didn't look back :P
[15:31:31] <_joe_>	 but really, those classes defied most of my previous attempt to untangle them
[15:31:43] <paravoid>	 akosiaris: i think it's better honestly
[15:31:48] <mark>	 _joe_: don't you like my work?
[15:32:02] <_joe_>	 mark: for puppet 0.25? yes
[15:32:04] <akosiaris>	 paravoid: ?
[15:32:05] <_joe_>	 :)
[15:32:06] <mark>	 ;-)
[15:32:12] <paravoid>	 akosiaris: the 0o
[15:32:14] <akosiaris>	 paravoid: please tell you don't refer to 0o 
[15:32:18] <wikibugs>	 (03CR) 10Ayounsi: "Compiler is happy: https://puppet-compiler.wmflabs.org/6699/" [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi)
[15:32:22] <akosiaris>	 oh come on
[15:32:29] <akosiaris>	 where have you ever seen this syntax ?
[15:32:45] <paravoid>	 "prefixing with 0" automatically means octal is very evil too
[15:33:05] <akosiaris>	 yeah but that practically happens since the epoch
[15:33:36] <akosiaris>	 introducing a new weird syntax to solve the evilness of prefixing with 0 is not exactly solving the problem
[15:34:18] <paravoid>	 python 3 did the same btw
[15:34:22] <paravoid>	 and removed the old syntax entirely
[15:34:57] <paravoid>	 In [1]: 02755
[15:34:57] <paravoid>	   File "<ipython-input-1-6bbdb9f24ced>", line 1
[15:34:57] <paravoid>	     02755
[15:34:57] <paravoid>	         ^
[15:34:57] <paravoid>	 SyntaxError: invalid token
[15:34:59] <volans>	 python3 force you to do iit ;)
[15:35:00] <paravoid>	 In [2]: 0o2755
[15:35:02] <paravoid>	 Out[2]: 1517
[15:35:05] <paravoid>	 fyi :)
[15:35:07] <paravoid>	 yeah
[15:35:13] <paravoid>	 ruby was nicer and actually made it a style guide thing
[15:35:38] <paravoid>	 the rationale is that this:
[15:35:39] <paravoid>	 In [1]: 02755
[15:35:39] <paravoid>	 Out[1]: 1517
[15:35:42] <paravoid>	 is really super confusing
[15:35:52] <paravoid>	 leading zeros are totally legit in decimals
[15:36:07] <akosiaris>	 omg .. python3 does that ?
[15:36:18] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824
[15:36:18] <paravoid>	 yup :)
[15:36:20] <akosiaris>	 one more reason I won't care much about it for the next 10 years
[15:36:27] <paravoid>	 wanna bet?
[15:36:30] <volans>	 rotfl
[15:36:37] <akosiaris>	 well... data is with me
[15:36:43] <akosiaris>	 python3 is around for 10 years already
[15:36:58] <akosiaris>	 yet python 2 is still around
[15:36:58] <paravoid>	 you know python 2 will go EOL in 2020 right :)
[15:37:20] <wikibugs>	 (03PS12) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850)
[15:37:21] <akosiaris>	 yeah I am wondering already how many times that will be extended
[15:37:21] <paravoid>	 buster (stretch+1) probably won't ship with python 2
[15:37:33] <paravoid>	 doubt it will
[15:37:37] <_joe_>	 paravoid: so we have to convert pybal to something samer
[15:37:43] <_joe_>	 *saner
[15:37:45] <bd808>	 writing new code in python2 is a bad idea
[15:38:03] <akosiaris>	 s/in python2//
[15:38:08] <akosiaris>	 there.. fixed that for ya :P
[15:38:13] <volans>	 akosiaris: you can use int('02755', 8) if you prefer :-P
[15:38:19] <bd808>	 the libraries have caught up which was the big knock on python3 for years
[15:38:24] <bd808>	 akosiaris: :)
[15:38:25] <jynus>	 bd808, they said that of my fortran code...
[15:38:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824 (owner: 10Giuseppe Lavagetto)
[15:38:26] <paravoid>	 not all of them
[15:38:27] <_joe_>	 bd808: not really?
[15:38:29] <paravoid>	 but most of them!
[15:38:50] <akosiaris>	 volans: overall I prefer not to use octals
[15:38:50] <_joe_>	 I keep stumbling in python2-only things
[15:39:05] <paravoid>	 chmod 1517?
[15:39:15] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM for the puppet side, I left to the netops folks the rancid config ;)" [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi)
[15:39:16] <bd808>	 any library that doesn't support python3 is either dead tech or has a community that is out of touch with the rest of the world
[15:39:21] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 117.62, 100.06, 77.25
[15:39:33] <paravoid>	 come on :)
[15:39:37] <wikibugs>	 (03PS3) 10Ayounsi: Rancid improvements [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288)
[15:39:40] <bd808>	 Not trying to throw shame here, just my experience
[15:39:44] <paravoid>	 In [1]: 02755
[15:39:44] <paravoid>	 Out[1]: 1517
[15:39:45] <paravoid>	 er
[15:39:48] <paravoid>	 https://github.com/cea-hpc/clustershell/commits/master :)
[15:40:17] <akosiaris>	 lol
[15:40:25] <paravoid>	 also, diamond is python2
[15:40:27] <paravoid>	 and I think ansible too?
[15:40:43] <bd808>	 and you are convincing me of which point I made? ;)
[15:40:45] <akosiaris>	 yes
[15:40:48] <paravoid>	 hey look http://py3readiness.org/ :)
[15:41:25] <akosiaris>	 I see a few very importants tools to us in that list
[15:41:42] <volans>	 ansible, carbon, diamond lol
[15:41:43] <bd808>	 for instance python-ldap is on that list as py2 only, but ldap3 is way way nicer to use
[15:41:43] <paravoid>	 but yeah you can at least use python3-mostly-compatible code if your library doesn't support it
[15:42:19] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] Rancid improvements [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi)
[15:42:31] <wikibugs>	 (03CR) 10Mforns: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey)
[15:42:50] <akosiaris>	 bd808: true but the fact ldap3 is not in those 360 most popular packages says something
[15:43:21] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 48.37, 77.47, 73.74
[15:43:43] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1019 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167426
[15:43:47] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167426#3332659 (10ops-monitoring-bot)
[15:43:49] <paravoid>	 we're getting there I think
[15:43:52] <paravoid>	 343/360 is pretty good
[15:44:09] <akosiaris>	 yeah but my guess is a long tail distribution
[15:44:22] <akosiaris>	 which means those few left might very well end up being done around 2020
[15:44:25] <volans>	 godog: so ms-be1019 is making the BBU check flapping that in turn trigger icinga that trigger the raid_handler
[15:44:43] <akosiaris>	 or never for that matter
[15:44:44] <volans>	 what you want to do? downtime, disable notification, fix the issue :D
[15:44:58] <bd808>	 akosiaris: it says that people don't write new code that talks to ldap very often I think, by YMMV
[15:45:13] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3332664 (10Gilles) It seems like this type...
[15:45:15] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824
[15:45:20] <godog>	 volans: heh I'll downtime for now, we're waiting the replacement battery from hp
[15:45:44] <volans>	 eheheh ok :)
[15:45:46] <volans>	 thanks
[15:46:10] <godog>	 np
[15:46:16] <akosiaris>	 bd808: yeah but it also says that there are enough projects out there that are still using python-ldap. My guess is they will still do until they can't do otherwise
[15:46:19] <bd808>	 striker and other things I've written use ldap3 because besides being nicer to work with it is a pure python lib and not a wrapper around libldap so it's nicer for virtualenvs
[15:47:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824 (owner: 10Giuseppe Lavagetto)
[15:48:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/6703/lvs1001.wikimedia.org/ it's a noop, I'll fix the validation in the next ps but this can basically" [puppet] - 10https://gerrit.wikimedia.org/r/357824 (owner: 10Giuseppe Lavagetto)
[15:49:40] <wikibugs>	 (03PS1) 10Cmjohnson: Adding productin dns for analytics1069 T162216 [dns] - 10https://gerrit.wikimedia.org/r/357836
[15:51:21] <wikibugs>	 (03PS2) 10Cmjohnson: Adding productin dns for analytics1069 T162216 [dns] - 10https://gerrit.wikimedia.org/r/357836
[15:51:52] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Adding productin dns for analytics1069 T162216 [dns] - 10https://gerrit.wikimedia.org/r/357836 (owner: 10Cmjohnson)
[15:52:25] <wikibugs>	 (03PS1) 10Papaul: DNS: Add mgmt and production DNS for labtestpuppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/357841
[15:53:01] <wikibugs>	 (03PS3) 10Cmjohnson: Adding productin dns for analytics1069 T162216 [dns] - 10https://gerrit.wikimedia.org/r/357836
[15:53:14] <wikibugs>	 (03CR) 10Ema: "> https://puppet-compiler.wmflabs.org/6703/lvs1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/357824 (owner: 10Giuseppe Lavagetto)
[15:55:53] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3332697 (10Gilles) Nevermind, it's because...
[15:56:46] <wikibugs>	 (03CR) 10RobH: [C: 032] DNS: Add mgmt and production DNS for labtestpuppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/357841 (owner: 10Papaul)
[15:56:50] <wikibugs>	 (03PS2) 10RobH: DNS: Add mgmt and production DNS for labtestpuppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/357841 (owner: 10Papaul)
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T1600). Please do the needful.
[16:00:04] <jouncebot>	 bmansurov and AaronSchulz: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[16:00:12] <bmansurov>	 here
[16:01:48] <godog>	 bmansurov: I'll take a look
[16:01:54] <bmansurov>	 thanks
[16:03:29] <godog>	 bmansurov: this is scheduled in puppet swat but it is for mediawiki-config ?
[16:03:43] <bmansurov>	 oops, my bad
[16:04:08] <bmansurov>	 i'll move it to the morning swat
[16:04:50] <godog>	 bmansurov: I can deploy it though if it is urgent?
[16:05:09] <bmansurov>	 godog, it's not urgent, but I'd appreciate it if you can deploy it
[16:06:09] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824
[16:06:31] <godog>	 bmansurov: ok! if it isn't an emergency I'd rather go through morning swat
[16:06:45] <bmansurov>	 ok, that sounds good too
[16:08:31] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2161528
[16:09:17] <godog>	 AaronSchulz: merging https://phabricator.wikimedia.org/T165651
[16:09:19] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3332797 (10RobH)
[16:09:53] <wikibugs>	 (03PS5) 10Filippo Giunchedi: Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz)
[16:12:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz)
[16:14:39] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3332839 (10Papaul) @chasemp do I have to put labtestnet2002 both eth0 and eth1  under labs-hosts1-b-codfw network or just plug eth1 and not put it under that network? Same...
[16:21:39] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3332874 (10RobH) IRC Update: We chatted about this, basically he wanted to know if we had to setup dns for both interfaces (eth0 and eth1) prior to installation.  Since on...
[16:23:10] <wikibugs>	 (03CR) 10BBlack: role::lvs::balancer: convert to role/profile (step 1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357824 (owner: 10Giuseppe Lavagetto)
[16:23:11] <wikibugs>	 10Operations, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3332875 (10Nuria)
[16:23:42] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1019 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167434
[16:23:51] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167434#3332878 (10ops-monitoring-bot)
[16:24:05] <bblack>	 !log cp1074: varnish-backend-restart for mailbox lag
[16:24:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:52] <wikibugs>	 (03PS7) 10BBlack: numa_networking: add facter data from sysfs [puppet] - 10https://gerrit.wikimedia.org/r/355809
[16:24:54] <wikibugs>	 (03PS8) 10BBlack: numa_networking: support NUMA in interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/355810
[16:24:56] <wikibugs>	 (03PS8) 10BBlack: numa_networking: support NUMA in tlsproxy nginx config [puppet] - 10https://gerrit.wikimedia.org/r/355811
[16:24:58] <wikibugs>	 (03PS1) 10BBlack: numa_networking: test enable on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/357844
[16:25:12] <godog>	 volans: not sure what to do about the ms-be1019 alert, downtime didn't work apparently
[16:25:43] <volans>	 godog: looking
[16:26:45] <volans>	 godog: maybe the icinga restart in the middle has to do with it?
[16:26:46] <volans>	 https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=ms-be1019&service=HP+RAID
[16:26:51] <volans>	 I'd suggest disable notification
[16:27:21] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3332904 (10elukey) @Cmjohnson sorry to ping :) Any idea if we have a spare BBU for db1046?
[16:28:30] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0
[16:28:46] <godog>	 volans: good idea, I'll do that
[16:28:55] <volans>	 godog: actually no
[16:29:00] <volans>	 there is a disable event_handler
[16:29:10] <volans>	 i completely forgot about it
[16:29:12] <volans>	 sorry
[16:29:30] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3332932 (10Cmjohnson) @elukey. Yes, I have another decommissioned r510 to take it from.  Ping you in a hour or so to replace
[16:29:32] <volans>	 the handler handling is separated from the notification handling
[16:29:48] <godog>	 ack, done the handler disabiling
[16:29:56] <volans>	 that's probably why, the downtime suppress notification and not handlers
[16:29:58] <godog>	 we'll need to find a way to remember to re-enable it
[16:30:00] <volans>	 I guess
[16:30:03] <volans>	 yeah
[16:30:09] <volans>	 that's the hard part
[16:30:46] <godog>	 meta-alert about disabled handlers to the rescue
[16:31:07] <volans>	 lol
[16:33:02] <godog>	 !log delete net.ifnames for ms-be2001 and ms-be2013 - T158429
[16:33:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:14] <stashbot>	 T158429: Switch to predictable network interface names? - https://phabricator.wikimedia.org/T158429
[16:35:20] <paravoid>	 godog: nice!
[16:36:02] <godog>	 paravoid: yeah! quite straightforward really, delete the argument from /etc/default/grub ; update-grub and s/eth0/eno1/ in /etc/network/interfaces
[16:36:14] <godog>	 no nefarious effects observed yet
[16:37:33] <volans>	 godog: 70-persistent-net.rules ?
[16:37:45] <volans>	 is replaced by this?
[16:38:29] <volans>	 looks like, nice!
[16:40:17] <logmsgbot>	 !log nuria@tin Started deploy [analytics/refinery@2fbed63]: (no justification provided)
[16:40:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:02] <godog>	 aye
[16:41:43] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3332986 (10elukey) @Cmjohnson thanks! Would it be possible to do the swap next week? Since this is an important DB I'd need to coordinate my team and Jaime/Manuel first.
[16:44:26] <logmsgbot>	 !log nuria@tin Finished deploy [analytics/refinery@2fbed63]: (no justification provided) (duration: 04m 08s)
[16:44:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:19] <wikibugs>	 (03PS1) 10Jdlrobson: Update logo and dimensions for SR wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357848 (https://phabricator.wikimedia.org/T165896)
[16:49:37] <wikibugs>	 (03PS3) 10Dzahn: gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074
[16:51:03] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3333018 (10faidon) Thanks for all the detailed responses from all of you, it's really appreciated. It's also great to see Docker patches proposed...
[16:52:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn)
[16:53:07] <wikibugs>	 (03PS5) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078
[16:54:00] <wikibugs>	 (03PS1) 10BBlack: numa_networking: remove install-time bnx2x stuff [puppet] - 10https://gerrit.wikimedia.org/r/357850
[16:54:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn)
[16:54:31] <wikibugs>	 (03PS4) 10Dzahn: gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074
[16:55:51] <wikibugs>	 (03PS6) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078
[17:00:04] <jouncebot>	 gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T1700).
[17:01:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Add listening to localhost too, or bad things will happen(TM)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn)
[17:02:37] <halfak>	 Nothing for ORES
[17:02:39] <halfak>	 SOON
[17:03:10] <wikibugs>	 (03CR) 10Bmansurov: Update logo and dimensions for SR wordmark (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357848 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson)
[17:07:50] <wikibugs>	 10Operations, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch errors about BulkShardRequest - https://phabricator.wikimedia.org/T167091#3333065 (10debt) p:05Triage>03Normal
[17:09:30] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 122.22, 100.48, 78.95
[17:11:52] <wikibugs>	 (03CR) 10Paladox: gerrit: let Apache proxy only listen on service IP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn)
[17:12:24] <wikibugs>	 (03CR) 10Paladox: gerrit: switch to base::service_unit and systemd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn)
[17:13:31] <wikibugs>	 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Services-next, 10Security-General, 10Services (next): Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#3333190 (10Fjalapeno)
[17:13:39] <wikibugs>	 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Web-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3333191 (10Fjalapeno)
[17:18:30] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 59.84, 77.19, 77.85
[17:26:13] <wikibugs>	 (03PS1) 10Andrew Bogott: designate.conf:  Replace identity_uri setting [puppet] - 10https://gerrit.wikimedia.org/r/357853
[17:29:31] <wikibugs>	 (03PS2) 10Andrew Bogott: designate.conf:  Replace identity_uri setting [puppet] - 10https://gerrit.wikimedia.org/r/357853
[17:31:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] designate.conf:  Replace identity_uri setting [puppet] - 10https://gerrit.wikimedia.org/r/357853 (owner: 10Andrew Bogott)
[17:36:43] <logmsgbot>	 !log arlolra@tin Started deploy [parsoid/deploy@f82cb4f]: Updating Parsoid to 108eed81
[17:36:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:45] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3333335 (10Gilles)
[17:41:52] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review: Rancid improvements - https://phabricator.wikimedia.org/T167288#3333354 (10ayounsi) > I think there's a lot of value in doing so. Agreed on the rest. Converted!  > Upgrade to 3.6.2 Done  > Switch from CVS to GIT Done  > Replace password auth with ssh key auth Done,...
[17:44:55] <wikibugs>	 (03PS1) 10Jdlrobson: Undeploy Cards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357858 (https://phabricator.wikimedia.org/T167452)
[17:45:04] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] "Need to wait until next Thursday deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357858 (https://phabricator.wikimedia.org/T167452) (owner: 10Jdlrobson)
[17:46:09] <wikibugs>	 (03PS1) 10Cmjohnson: Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076 [puppet] - 10https://gerrit.wikimedia.org/r/357860
[17:46:55] <logmsgbot>	 !log arlolra@tin Finished deploy [parsoid/deploy@f82cb4f]: Updating Parsoid to 108eed81 (duration: 10m 12s)
[17:47:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:23] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3333443 (10Cmjohnson)
[17:49:33] <wikibugs>	 (03CR) 10RobH: [C: 032] Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002,  [puppet] - 10https://gerrit.wikimedia.org/r/357860 (owner: 10Cmjohnson)
[17:51:01] <wikibugs>	 (03PS2) 10Cmjohnson: Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076 [puppet] - 10https://gerrit.wikimedia.org/r/357860
[17:51:05] <wikibugs>	 (03CR) 10Cmjohnson: [V: 032] Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002,  [puppet] - 10https://gerrit.wikimedia.org/r/357860 (owner: 10Cmjohnson)
[17:55:58] <arlolra>	 !log Updated Parsoid to 108eed81 (T136653, T167081)
[17:56:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:56:09] <stashbot>	 T167081: Broken rendering of inserted template (incorrect comment stripping for style attribute) - https://phabricator.wikimedia.org/T167081
[17:56:09] <stashbot>	 T136653: Parsoid doesn't recognize interwiki shortcuts in the href attribute - https://phabricator.wikimedia.org/T136653
[17:56:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: [WiP] role::lvs::balancer: refactor to role/profile (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/357863
[17:57:20] <icinga-wm>	 PROBLEM - Apache HTTP on mw2131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:57:21] <icinga-wm>	 PROBLEM - HHVM rendering on mw2131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:57:52] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: [WiP] role::lvs::balancer: refactor to role/profile (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/357863
[17:58:10] <icinga-wm>	 RECOVERY - Apache HTTP on mw2131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.071 second response time
[17:58:11] <icinga-wm>	 RECOVERY - HHVM rendering on mw2131 is OK: HTTP OK: HTTP/1.1 200 OK - 75442 bytes in 0.153 second response time
[18:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T1800). Please do the needful.
[18:00:05] <jouncebot>	 James_F, bmansurov, MatmaRex, and MaxSem: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[18:00:14] <MaxSem>	 I can deploy
[18:00:22] <James_F>	 Heya.
[18:00:28] <MatmaRex>	 hi
[18:00:32] <bmansurov>	 here
[18:00:47] <MatmaRex>	 MaxSem: it looks like wmf.4 is not actually live anywhere right now, correct?
[18:01:06] <RainbowSprinkles>	 Not true ^
[18:01:11] <MaxSem>	 group0
[18:01:18] <MatmaRex>	 https://www.mediawiki.org/wiki/Special:Version says wmf.2
[18:01:22] <MaxSem>	 http://tools.wmflabs.org/versions/
[18:01:37] <greg-g>	 MaxSem: MatmaRex test wikis but not mw.org
[18:01:51] <greg-g>	 https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wikiversions.json;8d49ecd178ac84beb247ffca51c967f8a0e6c1dc$757
[18:01:53] <MatmaRex>	 ah, i see. https://test.wikipedia.org/wiki/Special:Version is wmf.4
[18:01:56] <RainbowSprinkles>	 I thought it moved forward yesterday on my day off, guess I misread the bugmail I had
[18:02:25] <greg-g>	 we're clear to move forward today though, aude fixed up the last major logspam blocker
[18:02:43] <wikibugs>	 (03PS6) 10MaxSem: Beta Features: Update last-big-change-plus-six-month dates in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354731 (owner: 10Jforrester)
[18:02:48] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Beta Features: Update last-big-change-plus-six-month dates in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354731 (owner: 10Jforrester)
[18:03:39] <MaxSem>	 bleh Catchable fatal error: Argument 1 passed to EditPage::displayViewSourcePage()
[18:03:47] <wikibugs>	 (03Merged) 10jenkins-bot: Beta Features: Update last-big-change-plus-six-month dates in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354731 (owner: 10Jforrester)
[18:03:49] <MaxSem>	 must implement interface Content, null given
[18:04:05] <James_F>	 MaxSem: Where's that from?
[18:04:14] <MaxSem>	 from fatalmonitor
[18:04:34] <James_F>	 I mean, from prod or from Beta Cluster?
[18:04:35] <Reedy>	 helpful answer MaxSem
[18:04:35] <Reedy>	 :P
[18:04:40] <icinga-wm>	 PROBLEM - Check systemd state on install2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:04:50] <James_F>	 Is it suddenly arising or has it been happening for a while?
[18:05:12] <RainbowSprinkles>	 #til people still use `fatalmonitor`
[18:05:17] <MaxSem>	 don't remember it yesterday
[18:05:27] <MatmaRex>	 MaxSem: https://phabricator.wikimedia.org/T161199 ?
[18:05:47] <MaxSem>	 RainbowSprinkles, it gives you better reaction time than kibana
[18:05:54] <wikibugs>	 (03CR) 10jenkins-bot: Beta Features: Update last-big-change-plus-six-month dates in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354731 (owner: 10Jforrester)
[18:06:10] <MaxSem>	 yep, MatmaRex 
[18:06:42] <MatmaRex>	 well, it "Needs Triage", but it's been happening for two months and a change
[18:07:08] <MaxSem>	 James_F, pulled on mwdebug1002
[18:08:20] <RainbowSprinkles>	 MaxSem: I haven't logged into fluorine (or whatever we're using now) in about a year or two ;-)
[18:08:29] <James_F>	 MaxSem: Looks good.
[18:08:37] <MaxSem>	 it's mwlog1001 now
[18:09:29] <RainbowSprinkles>	 Either way, haven't needed it :p
[18:09:52] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/354731/6 (duration: 00m 44s)
[18:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:01] <jdlrobson>	 there's a deploy now? I completely failed there by lining stuff up for later :/
[18:10:28] <James_F>	 jdlrobson: Yup, 11:00 SF Monday/Wednesday/Thursdays.
[18:10:30] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 136.10, 106.99, 79.57
[18:10:36] <MaxSem>	 bmansurov, yt?
[18:10:40] <jdlrobson>	 yeh i see that now :)
[18:10:47] <bmansurov>	 MaxSem, yes
[18:10:58] <James_F>	 jdlrobson: The fact that it's /not/ there on Tuesdays (because of the train) trips people up. :-)
[18:11:02] <jdlrobson>	 MaxSem: if you have space, i've got two for this evening which can go out now instead and save whoevers doing them later the hassle if you like
[18:11:25] <MaxSem>	 if we have time
[18:12:13] <wikibugs>	 (03PS4) 10MaxSem: Enable ElectronPdf on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356881 (https://phabricator.wikimedia.org/T165954) (owner: 10Bmansurov)
[18:12:17] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Enable ElectronPdf on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356881 (https://phabricator.wikimedia.org/T165954) (owner: 10Bmansurov)
[18:13:37] <jdlrobson>	 MaxSem: ill update wik... thanks :)
[18:13:42] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ElectronPdf on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356881 (https://phabricator.wikimedia.org/T165954) (owner: 10Bmansurov)
[18:14:30] <MaxSem>	 bmansurov, pulled on mwdebug1002
[18:14:38] <bmansurov>	 ok checking
[18:15:45] <bmansurov>	 MaxSem, works
[18:15:52] <MaxSem>	 milimetric, I still see that dashiki error in logs
[18:15:54] <wikibugs>	 (03CR) 10jenkins-bot: Enable ElectronPdf on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356881 (https://phabricator.wikimedia.org/T165954) (owner: 10Bmansurov)
[18:17:30] <logmsgbot>	 !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/356881/4 (duration: 00m 44s)
[18:17:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:38] <MaxSem>	 bmansurov, ^
[18:17:53] <bmansurov>	 MaxSem, thanks!
[18:18:55] <milimetric>	 thx MaxSem, just looking at it (back from medical leave today)
[18:20:55] <wikibugs>	 (03PS3) 10MaxSem: Remove semanticness from another place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352985 (https://phabricator.wikimedia.org/T53642)
[18:21:20] <wikibugs>	 (03CR) 10MaxSem: [C: 032] Remove semanticness from another place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352985 (https://phabricator.wikimedia.org/T53642) (owner: 10MaxSem)
[18:22:50] <icinga-wm>	 PROBLEM - Check systemd state on install1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[18:24:20] <wikibugs>	 (03Merged) 10jenkins-bot: Remove semanticness from another place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352985 (https://phabricator.wikimedia.org/T53642) (owner: 10MaxSem)
[18:25:49] <wikibugs>	 (03CR) 10jenkins-bot: Remove semanticness from another place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352985 (https://phabricator.wikimedia.org/T53642) (owner: 10MaxSem)
[18:25:53] <logmsgbot>	 !log maxsem@tin Synchronized multiversion/submodules.json: https://gerrit.wikimedia.org/r/#/c/352985/3 (duration: 00m 43s)
[18:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:41] <MaxSem>	 dear Zuul, did you not get enough human sacrifice already?
[18:29:30] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 55.26, 66.21, 79.54
[18:31:01] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3333560 (10Gilles) Note to self: SVG needs...
[18:34:03] <MaxSem>	 MatmaRex, pulled on mwdebug1002
[18:34:40] <icinga-wm>	 PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[isc-dhcp-server]
[18:34:52] <MatmaRex>	 MaxSem: yupp. works fine
[18:36:31] <logmsgbot>	 !log maxsem@tin Synchronized php-1.30.0-wmf.4/includes/EditPage.php: https://gerrit.wikimedia.org/r/#/c/357855/ (duration: 00m 45s)
[18:36:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:56] <MaxSem>	 MatmaRex, ^
[18:37:00] <MatmaRex>	 thanks MaxSem
[18:37:53] <urandom>	 is there an app/command that provides page faults as a rate?
[18:38:07] <urandom>	 or do i need to grok this out of ps or something?
[18:38:25] <wikibugs>	 (03PS1) 10Papaul: DNS: Add mgmt and production DNS for labtestneutron2002 and labtestnet2002 [dns] - 10https://gerrit.wikimedia.org/r/357869
[18:38:26] <Zppix>	 urandom:  i'd assume theres an api maybe the amount of views of 404 pages or special:badtitle?
[18:38:54] <urandom>	 Zppix: sorry, i should have said 'linux page faults'
[18:39:11] <urandom>	 where page is in the context of disk/storage
[18:39:13] <Zppix>	 urandom:  oh in that case i have no clue xD
[18:40:08] <wikibugs>	 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3333601 (10RobH)
[18:40:33] <wikibugs>	 (03PS1) 10Cmjohnson: Adding production dns for several new servers, wtp1025-48, ganeti1005-1008, kubestage1001/1002, dumpsdata1001/2, labvirt1015-18 T165173 T166264 T165531 T165520 T162216 T166076 [dns] - 10https://gerrit.wikimedia.org/r/357870
[18:40:38] <logmsgbot>	 !log maxsem@tin Synchronized php-1.30.0-wmf.4/extensions/LoginNotify/: https://gerrit.wikimedia.org/r/#/c/357743/ (duration: 00m 44s)
[18:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:52] <mutante>	 !log built gerrit_2.13.8+git1-wmf.5 on copper (T158946)
[18:42:59] <jdlrobson>	 w00t
[18:43:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:02] <stashbot>	 T158946: Update gerrit to 2.13.8 - https://phabricator.wikimedia.org/T158946
[18:43:32] <urandom>	 got it: sar -B 1 10
[18:43:42] <wikibugs>	 (03PS2) 10Papaul: DNS: Add mgmt and production DNS for labtestneutron2002 and labtestnet2002 [dns] - 10https://gerrit.wikimedia.org/r/357869
[18:45:26] <wikibugs>	 (03PS2) 10Cmjohnson: Adding production dns for several new servers, wtp1025-48, ganeti1005-1008, kubestage1001/1002, dumpsdata1001/2, labvirt1015-18 and stat1005/6 T165366 T165368  T165173 T166264 T165531 T165520 T162216 T166076 [dns] - 10https://gerrit.wikimedia.org/r/357870
[18:47:20] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Adding production dns for several new servers, wtp1025-48, ganeti1005-1008, kubestage1001/1002, dumpsdata1001/2, labvirt1015-18 and stat1005 [dns] - 10https://gerrit.wikimedia.org/r/357870 (owner: 10Cmjohnson)
[18:49:33] <wikibugs>	 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3333632 (10Cmjohnson)
[18:49:43] <urandom>	 !log Restarting Cassandra, restbase-dev1001-a to test alternative disk access mode
[18:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:17] <wikibugs>	 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3259047 (10Cmjohnson) a:05Cmjohnson>03RobH Mac address has been added to dhcpd file. Assigning to @robh for install
[18:50:38] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3333636 (10Cmjohnson)
[18:52:00] <icinga-wm>	 PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[isc-dhcp-server]
[18:52:02] <MaxSem>	 jdlrobson, pulled on mwdebug1002
[18:53:13] <jdlrobson>	 MaxSem: testing
[18:53:51] <jdlrobson>	 MaxSem: checked! LGTM
[18:55:29] <logmsgbot>	 !log maxsem@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details)
[18:55:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:08] <RainbowSprinkles>	 MaxSem: Was that mw12$something?
[18:56:21] <RainbowSprinkles>	 (1/11 - just retry, probably transient)
[18:56:27] <MaxSem>	 RainbowSprinkles, mw1279
[18:56:38] <RainbowSprinkles>	 That one blew up on me Tuesday too
[18:56:42] <RainbowSprinkles>	 Retrying made it shut up
[18:58:55] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3333657 (10Cmjohnson) a:05Cmjohnson>03RobH @robh added mac address to dhcpd already, verified on switch that it's...
[18:59:28] <wikibugs>	 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3333675 (10Cmjohnson)
[18:59:40] <logmsgbot>	 !log maxsem@tin Synchronized php-1.30.0-wmf.4/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/#/c/357846/ (duration: 00m 49s)
[18:59:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:57] <MaxSem>	 jdlrobson, ^
[18:59:59] <wikibugs>	 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3290332 (10Cmjohnson) a:05akosiaris>03RobH @robh added mac address already
[19:00:04] <jouncebot>	 RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T1900).
[19:00:11] <jdlrobson>	 w00p thankkk you
[19:00:19] <MaxSem>	 jdlrobson, no time for another patch
[19:00:29] <MaxSem>	 RainbowSprinkles, all yours
[19:00:34] <jdlrobson>	 no? booo :( 
[19:00:44] <RainbowSprinkles>	 What was jdlrobsons?
[19:01:17] <RainbowSprinkles>	 Oh, that simple wordmark thing?
[19:01:22] <wikibugs>	 (03PS2) 10Chad: Update logo and dimensions for SR wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357848 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson)
[19:01:32] <wikibugs>	 (03CR) 10Chad: [C: 032] Update logo and dimensions for SR wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357848 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson)
[19:01:41] <jdlrobson>	 :-O it's christmas!
[19:01:58] <RainbowSprinkles>	 I can jfdi faster than you can add it to the next swat window ;-)
[19:02:27] <wikibugs>	 (03Merged) 10jenkins-bot: Update logo and dimensions for SR wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357848 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson)
[19:02:41] <wikibugs>	 (03CR) 10jenkins-bot: Update logo and dimensions for SR wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357848 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson)
[19:03:32] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3333690 (10RobH)
[19:03:34] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw:labtestnet2002 switch port configuration - https://phabricator.wikimedia.org/T167322#3333688 (10RobH) 05Resolved>03Open Nevermind, I had a bad config and it didn't commit.  I need to investiage and redo the change.
[19:03:52] <logmsgbot>	 !log demon@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-sr.svg: new wordmark (duration: 00m 46s)
[19:03:54] <wikibugs>	 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3333691 (10Cmjohnson)
[19:04:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:05] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw: labtestneutron2002 sswitch port configuration - https://phabricator.wikimedia.org/T167326#3333692 (10RobH) a:05Papaul>03RobH
[19:05:16] <logmsgbot>	 !log demon@tin Synchronized wmf-config/InitialiseSettings.php: New wordmark for mk/srwiki (duration: 00m 57s)
[19:05:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:59] <RainbowSprinkles>	 jdlrobson: You're live everywhere :)
[19:06:08] <wikibugs>	 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/install kubestage100[12] - https://phabricator.wikimedia.org/T166264#3333700 (10RobH)
[19:06:20] <wikibugs>	 (03PS1) 10Chad: mw.org back to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357872
[19:06:30] <wikibugs>	 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3333702 (10Cmjohnson) mac addresses were added to dhcpd file, not sure if h/w raid is needed..i believe these came with a controller. Also @mark was...
[19:07:35] <jdlrobson>	 RainbowSprinkles: did you do a static cache flush?
[19:07:46] <RainbowSprinkles>	 Nope?
[19:08:09] <jdlrobson>	 https://sr.m.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-sr.svg?r=4 vs https://sr.m.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-sr.svg
[19:08:12] <jdlrobson>	 seeing different things
[19:08:40] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3333707 (10Cmjohnson)
[19:09:07] <RainbowSprinkles>	 jdlrobson: Uno momento
[19:09:15] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3264256 (10Cmjohnson) a:05Cmjohnson>03RobH
[19:09:37] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and setup  wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3333709 (10Cmjohnson)
[19:09:48] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and setup  wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3268863 (10Cmjohnson) a:05Cmjohnson>03RobH
[19:09:54] <RainbowSprinkles>	 Hmmm
[19:10:00] <jdlrobson>	 RainbowSprinkles: i dont recall how to do that though..
[19:10:06] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3333711 (10Cmjohnson)
[19:10:06] <Reedy>	 purgeList?
[19:10:07] <RainbowSprinkles>	 I threw it into purgeList
[19:10:17] <jdlrobson>	 im seeing new asset now
[19:10:29] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3284602 (10Cmjohnson) a:05Cmjohnson>03RobH
[19:11:03] <RainbowSprinkles>	 jdlrobson: Ok we good then? :)
[19:11:11] <wikibugs>	 (03PS1) 10ArielGlenn: script for retrieving raw flow revision content [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/357873
[19:11:22] <wikibugs>	 (03CR) 10Chad: [C: 032] mw.org back to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357872 (owner: 10Chad)
[19:11:31] <jdlrobson>	 RainbowSprinkles: i think so. Worse case will have to wait for cache to update. If you did something i think it worked
[19:11:43] <jdlrobson>	 thanks MaxSem and RainbowSprinkles :)
[19:12:39] <wikibugs>	 (03Merged) 10jenkins-bot: mw.org back to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357872 (owner: 10Chad)
[19:12:50] <wikibugs>	 (03CR) 10jenkins-bot: mw.org back to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357872 (owner: 10Chad)
[19:13:24] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: mw.org -> wmf.4
[19:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:05] <wikibugs>	 (03PS1) 10RobH: setting install params for kubestage100[12] [puppet] - 10https://gerrit.wikimedia.org/r/357874
[19:16:32] <wikibugs>	 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/install kubestage100[12] - https://phabricator.wikimedia.org/T166264#3333723 (10RobH)
[19:16:49] <wikibugs>	 (03CR) 10RobH: [C: 032] setting install params for kubestage100[12] [puppet] - 10https://gerrit.wikimedia.org/r/357874 (owner: 10RobH)
[19:17:08] <wikibugs>	 (03PS1) 10Andrew Bogott: Designate api:  Increase max query limit [puppet] - 10https://gerrit.wikimedia.org/r/357875
[19:19:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Designate api:  Increase max query limit [puppet] - 10https://gerrit.wikimedia.org/r/357875 (owner: 10Andrew Bogott)
[19:19:14] <wikibugs>	 (03PS2) 10Andrew Bogott: Designate api:  Increase max query limit [puppet] - 10https://gerrit.wikimedia.org/r/357875
[19:21:16] <wikibugs>	 (03PS1) 10Chad: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357876
[19:24:19] <wikibugs>	 (03CR) 10Chad: [C: 032] group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357876 (owner: 10Chad)
[19:25:34] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357876 (owner: 10Chad)
[19:25:49] <wikibugs>	 (03CR) 10jenkins-bot: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357876 (owner: 10Chad)
[19:26:46] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.4
[19:26:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:34] <wikibugs>	 (03PS1) 10RobH: Revert "setting install params for kubestage100[12]" [puppet] - 10https://gerrit.wikimedia.org/r/357877
[19:27:40] <wikibugs>	 (03CR) 10RobH: [C: 032] Revert "setting install params for kubestage100[12]" [puppet] - 10https://gerrit.wikimedia.org/r/357877 (owner: 10RobH)
[19:28:48] <wikibugs>	 (03PS2) 10RobH: Revert "setting install params for kubestage100[12]" [puppet] - 10https://gerrit.wikimedia.org/r/357877
[19:28:52] <wikibugs>	 (03CR) 10RobH: [V: 032 C: 032] Revert "setting install params for kubestage100[12]" [puppet] - 10https://gerrit.wikimedia.org/r/357877 (owner: 10RobH)
[19:30:44] <wikibugs>	 (03PS1) 10Zhuyifei1999: tools-static: add /fontcdn/ to reverse-proxy to Google Fonts [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027)
[19:31:18] <wikibugs>	 (03PS1) 10RobH: Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076" [puppet] - 10https://gerrit.wikimedia.org/r/357879
[19:31:45] <wikibugs>	 (03CR) 10Zhuyifei1999: "https://tools.wmflabs.org/fontcdn/ is not ready." [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027) (owner: 10Zhuyifei1999)
[19:32:41] <wikibugs>	 (03CR) 10RobH: [C: 032] Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata100 [puppet] - 10https://gerrit.wikimedia.org/r/357879 (owner: 10RobH)
[19:32:46] <wikibugs>	 (03PS2) 10RobH: Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076" [puppet] - 10https://gerrit.wikimedia.org/r/357879
[19:33:13] <wikibugs>	 (03CR) 10Zhuyifei1999: tools-static: add /fontcdn/ to reverse-proxy to Google Fonts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027) (owner: 10Zhuyifei1999)
[19:36:53] <wikibugs>	 (03Abandoned) 10RobH: Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076" [puppet] - 10https://gerrit.wikimedia.org/r/357879 (owner: 10RobH)
[19:37:40] <wikibugs>	 (03PS1) 10Cmjohnson: Fixing a typo in dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/357880
[19:38:42] <wikibugs>	 (03CR) 10RobH: [C: 032] Fixing a typo in dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/357880 (owner: 10Cmjohnson)
[19:38:50] <wikibugs>	 (03PS2) 10RobH: Fixing a typo in dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/357880 (owner: 10Cmjohnson)
[19:41:10] <RainbowSprinkles>	 Weird. Something's requesting commons pages in the format of https://commons.wikimedia.org/wiki/File:Map_of_USA_OR.svg?uselang=⧼Lang⧽
[19:41:16] <RainbowSprinkles>	 (image seems to be varying)
[19:41:27] <RainbowSprinkles>	 this is causing exceptions :\
[19:44:15] <greg-g>	 RainbowSprinkles: new?
[19:44:24] <RainbowSprinkles>	 Hadn't seen it before yet
[19:44:33] <RainbowSprinkles>	 But not necessarily "new"
[19:44:43] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3333819 (10Gilles) OK, I've looked at the...
[19:45:11] * greg-g nods
[19:45:46] <RainbowSprinkles>	 T167359 also seems to be still appearing, fix is incomplete?
[19:45:46] <stashbot>	 T167359: Catchable fatal error: Argument 2 passed to RevisionSliderHooks::onDiffViewHeader() must be an instance of Revision, null given  - https://phabricator.wikimedia.org/T167359
[19:45:55] <RainbowSprinkles>	 Also spotted T167461
[19:45:56] <stashbot>	 T167461: SpecialMobileDiff: Call to member function getDiffBody() on non-object - https://phabricator.wikimedia.org/T167461
[19:46:53] <RainbowSprinkles>	 Also geosearch queries are throwing PartialShardExceptions :\
[19:47:00] <RainbowSprinkles>	 (but doubt this is a wmf.4 problem)
[19:47:08] <wikibugs>	 (03PS1) 10Eevans: Use Cassandra version that corresponds to what is being tested [puppet] - 10https://gerrit.wikimedia.org/r/357882 (https://phabricator.wikimedia.org/T160570)
[19:47:33] <wikibugs>	 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw: labtestneutron2002 switch port configuration  - https://phabricator.wikimedia.org/T167326#3333850 (10Papaul)
[19:47:37] <wikibugs>	 (03PS1) 10RobH: reverting this patchset, as its introduced some errors into our dhcp system and caused it to fail. [puppet] - 10https://gerrit.wikimedia.org/r/357884
[19:47:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] reverting this patchset, as its introduced some errors into our dhcp system and caused it to fail. [puppet] - 10https://gerrit.wikimedia.org/r/357884 (owner: 10RobH)
[19:49:25] <urandom>	 mutante: are you around? any chance i could convince you to merge https://gerrit.wikimedia.org/r/357882?  it won't impact anything outside of the dev (non-prod) environment, and it'll make the puppet warnings there go away.
[19:49:38] <urandom>	 (and i need to run puppet here :))
[19:50:48] <wikibugs>	 (03PS2) 10RobH: reverting this patchset, as its introduced some errors into our dhcp system and caused it to fail. [puppet] - 10https://gerrit.wikimedia.org/r/357884
[19:54:11] <wikibugs>	 (03CR) 10RobH: [C: 031] reverting this patchset, as its introduced some errors into our dhcp system and caused it to fail. [puppet] - 10https://gerrit.wikimedia.org/r/357884 (owner: 10RobH)
[19:54:23] <wikibugs>	 (03CR) 10Cmjohnson: [C: 031] reverting this patchset, as its introduced some errors into our dhcp system and caused it to fail. [puppet] - 10https://gerrit.wikimedia.org/r/357884 (owner: 10RobH)
[19:54:30] <wikibugs>	 (03CR) 10RobH: [C: 032] reverting this patchset, as its introduced some errors into our dhcp system and caused it to fail. [puppet] - 10https://gerrit.wikimedia.org/r/357884 (owner: 10RobH)
[19:55:52] <icinga-wm>	 RECOVERY - Check systemd state on install1002 is OK: OK - running: The system is fully operational
[19:56:12] <icinga-wm>	 RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[19:57:17] <urandom>	 mutante: basically, this is the minimum it'll take to make it happy about that package version currently installed
[19:59:18] <wikibugs>	 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/install kubestage100[12] - https://phabricator.wikimedia.org/T166264#3333891 (10RobH) Ok, that large patchset had some issues that borked up dhcp.  Rather than try to find the issues in the large one, we reverted it and will make small...
[20:02:32] <wikibugs>	 (03PS1) 10RobH: setting kubestage100[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/357890
[20:02:42] <icinga-wm>	 RECOVERY - Check systemd state on install2002 is OK: OK - running: The system is fully operational
[20:02:43] <icinga-wm>	 RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[20:02:59] <wikibugs>	 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/install kubestage100[12] - https://phabricator.wikimedia.org/T166264#3333901 (10RobH)
[20:04:05] <wikibugs>	 (03CR) 10RobH: [C: 032] setting kubestage100[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/357890 (owner: 10RobH)
[20:04:32] <mutante>	 urandom: won't it add confusion if then the target_version and the package_version are different versions?
[20:04:56] <mutante>	 3.7 => 3.11
[20:06:20] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3333909 (10Gilles) On second thought, let'...
[20:07:03] <wikibugs>	 (03CR) 10Dzahn: Use Cassandra version that corresponds to what is being tested (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357882 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans)
[20:12:35] <mutante>	 !log imported gerrit_2.13.8+git1-wmf.5_amd64 on apt.wikimedia.org (T158946)
[20:12:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:37] <stashbot>	 T158946: Update gerrit to 2.13.8 - https://phabricator.wikimedia.org/T158946
[20:23:24] <RainbowSprinkles>	 !log gerrit2001: upgraded to 2.13.8+git1-wmf.5 / 2.13.8-1-g7c438d37a2
[20:23:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:17] <urandom>	 mutante: well, maybe
[20:39:38] <urandom>	 mutante: i was trying to keep the changeset minimal considering that this version may yet change
[20:39:50] <urandom>	 and it's only used in the dev environment
[20:41:39] <urandom>	 mutante: so we could change the target version, but then there are also files like templates/some_config_file-3.7.erb
[20:42:03] <urandom>	 so if you extend that need for consistency, you'd probably want to rename them as well
[20:42:45] <urandom>	 mutante: which i'm willing to do, but i was aiming for least-invasive in the interest of shopping around for someone to merge :)
[20:44:28] <urandom>	 mutante: we could change target_version to 3.x, and rename all of the files accordingly
[21:00:56] <wikibugs>	 (03PS2) 10Eevans: Use Cassandra version that corresponds to what is being tested [puppet] - 10https://gerrit.wikimedia.org/r/357882 (https://phabricator.wikimedia.org/T160570)
[21:10:50] <wikibugs>	 (03CR) 10Eevans: "PC output here: http://puppet-compiler.wmflabs.org/6710, it shows 'no change' where it should, though fails to compile on one of the hosts" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357882 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans)
[21:18:39] <mutante>	 urandom: sorry, i was afk. reading now, and "minimal changeset" sounds good
[21:19:40] <urandom>	 heh
[21:19:52] <mutante>	 i am looking, hold on :) we'll do this
[21:20:17] <urandom>	 mutante: i updated it; it's slightly less minimal, but should be clearer
[21:20:44] <urandom>	 and more future-proof for this testing period
[21:21:11] <mutante>	 ok, yep, i see the compiler output
[21:21:29] <mutante>	 the one it fails on, it's that for some reason the compiler just doesn't know this host name at all
[21:21:36] <mutante>	 i can tell because the link is 404
[21:21:41] <mutante>	 and not a real failure 
[21:21:42] <urandom>	 right
[21:21:48] <mutante>	 rest looks good.. doing
[21:21:54] <urandom>	 mutante: awesome
[21:22:18] <wikibugs>	 (03PS3) 10Dzahn: Use Cassandra version that corresponds to what is being tested [puppet] - 10https://gerrit.wikimedia.org/r/357882 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans)
[21:23:08] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@56f7511]: Rate limiting code and config. T161710
[21:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:19] <stashbot>	 T161710: Automate RESTBase blacklisting - https://phabricator.wikimedia.org/T161710
[21:24:54] <logmsgbot>	 !log ppchelko@tin Finished deploy [changeprop/deploy@56f7511]: Rate limiting code and config. T161710 (duration: 01m 46s)
[21:25:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:22] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "affects only -dev, not -prod and -test, they use target_version 2.2" [puppet] - 10https://gerrit.wikimedia.org/r/357882 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans)
[21:26:14] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb2001 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200)
[21:26:18] <mutante>	 urandom: merged on master
[21:26:26] <urandom>	 mutante: great; let me give it a go
[21:26:44] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1003 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200)
[21:26:45] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb2005 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200)
[21:26:45] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200)
[21:26:45] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200)
[21:27:04] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1004 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200)
[21:27:14] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb2002 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200)
[21:27:14] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb2004 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200)
[21:27:16] <mutante>	 :(
[21:27:24] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb2006 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200)
[21:27:44] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb2003 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200)
[21:27:44] <icinga-wm>	 RECOVERY - puppet last run on restbase-dev1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures
[21:27:56] <greg-g>	 Pchelolo: revert?
[21:28:26] <Pchelolo>	 what the hell is that
[21:28:29] <Pchelolo>	 I'll revert
[21:29:03] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@56f7511]: dc1948f6bc7b1
[21:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:19] <logmsgbot>	 !log ppchelko@tin Finished deploy [changeprop/deploy@56f7511]: dc1948f6bc7b1 (duration: 00m 16s)
[21:29:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:30:54] <icinga-wm>	 RECOVERY - puppet last run on restbase-dev1003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[21:31:02] <urandom>	 mutante: looks good; thank you!
[21:31:05] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@56f7511]: dc1948f6bc7b1 Revert previous deploy
[21:31:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:28] <mutante>	 urandom:  < icinga-wm> RECOVERY - puppet last run on restbase-dev1002 is OK: :)    it just got drowned in the other unrelated alerts :o
[21:31:35] <urandom>	 yeah
[21:32:07] <mutante>	 ok, cool, that got me for the first couple seconds
[21:32:44] <mutante>	 Pchelolo: anything needed from root for that?
[21:33:05] <Pchelolo>	 mutante: no, no
[21:33:13] <mutante>	 alright
[21:33:54] <icinga-wm>	 RECOVERY - puppet last run on restbase-dev1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[21:34:03] <logmsgbot>	 !log ppchelko@tin Started deploy [changeprop/deploy@56f7511]: Revert previous deploy
[21:34:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:11] <logmsgbot>	 !log ppchelko@tin Finished deploy [changeprop/deploy@56f7511]: Revert previous deploy (duration: 01m 07s)
[21:35:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:01] <wikibugs>	 (03PS3) 10Dzahn: DNS: Add mgmt and production DNS for labtestneutron2002 and labtestnet2002 [dns] - 10https://gerrit.wikimedia.org/r/357869 (owner: 10Papaul)
[21:42:33] <urandom>	 !log T160570: Rolling Cassandra restart, restbase-dev
[21:42:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:45] <stashbot>	 T160570: Cassandra 3.x Tracking - https://phabricator.wikimedia.org/T160570
[21:48:27] <wikibugs>	 (03CR) 10Dzahn: [C: 032] DNS: Add mgmt and production DNS for labtestneutron2002 and labtestnet2002 [dns] - 10https://gerrit.wikimedia.org/r/357869 (owner: 10Papaul)
[21:50:09] <logmsgbot>	 !log mobrovac@tin Started deploy [changeprop/deploy@56f7511]: (no justification provided)
[21:50:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:43] <logmsgbot>	 !log mobrovac@tin Finished deploy [changeprop/deploy@56f7511]: (no justification provided) (duration: 00m 34s)
[21:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:31] <logmsgbot>	 !log mobrovac@tin Started deploy [changeprop/deploy@56f7511]: (no justification provided)
[21:52:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:03] <logmsgbot>	 !log mobrovac@tin Finished deploy [changeprop/deploy@56f7511]: (no justification provided) (duration: 01m 32s)
[21:54:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:46] <logmsgbot>	 !log mobrovac@tin Started deploy [changeprop/deploy@dc1948f]: (no justification provided)
[21:54:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:55:20] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb2001 is OK: All endpoints are healthy
[21:56:24] <logmsgbot>	 !log mobrovac@tin Finished deploy [changeprop/deploy@dc1948f]: (no justification provided) (duration: 01m 39s)
[21:56:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:49] <logmsgbot>	 !log mobrovac@tin Started deploy [changeprop/deploy@836b070]: Rate limiting, attempt #2
[22:13:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:13] <logmsgbot>	 !log mobrovac@tin Finished deploy [changeprop/deploy@836b070]: Rate limiting, attempt #2 (duration: 01m 23s)
[22:15:20] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb2004 is OK: All endpoints are healthy
[22:15:20] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb2002 is OK: All endpoints are healthy
[22:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:30] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb2006 is OK: All endpoints are healthy
[22:15:50] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb2003 is OK: All endpoints are healthy
[22:15:51] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb2005 is OK: All endpoints are healthy
[22:16:00] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1003 is OK: All endpoints are healthy
[22:16:00] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy
[22:16:00] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy
[22:16:10] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1004 is OK: All endpoints are healthy
[22:17:04] <logmsgbot>	 !log demon@tin Synchronized php-1.30.0-wmf.4/extensions/MobileFrontend/includes/specials/SpecialMobileDiff.php: (no justification provided) (duration: 00m 44s)
[22:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:20] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/6711/" [puppet] - 10https://gerrit.wikimedia.org/r/355871 (owner: 10Dzahn)
[22:17:26] <wikibugs>	 (03PS2) 10Dzahn: phabricator: move hiera lookups to parameters [puppet] - 10https://gerrit.wikimedia.org/r/355871
[22:29:04] <logmsgbot>	 !log demon@tin Synchronized php-1.30.0-wmf.4/extensions/RevisionSlider/src/RevisionSliderHooks.php: Livehack/test (duration: 00m 44s)
[22:29:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:33] <wikibugs>	 (03PS1) 10RobH: adding in dumpsdata00[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/357949
[22:45:17] <wikibugs>	 (03CR) 10RobH: [C: 032] adding in dumpsdata00[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/357949 (owner: 10RobH)
[22:46:19] <SMalyshev>	 is wmf.4 going to be deployed on group2?
[22:48:10] <greg-g>	 RainbowSprinkles: ^
[22:48:35] <wikibugs>	 (03PS1) 10Ladsgroup: Change Persian Wikis from uca-fa to xx-uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357951 (https://phabricator.wikimedia.org/T139110)
[22:48:52] <SMalyshev>	 I'm wondering because of T167473 - should I backport fix for .4 or .2?
[22:48:53] <stashbot>	 T167473: DeleteArchive: Object does not implement ArrayAccess - https://phabricator.wikimedia.org/T167473
[22:53:09] <ebernhardson>	 SMalyshev: both are in prod right now, so i suppose both?
[22:53:19] <ebernhardson>	 according to noc.wikimedia.org
[22:53:30] <SMalyshev>	 ebernhardson: that's why I ask, wmf.2 should be theoretically gone now
[22:53:34] <paravoid>	 madhuvishy: any news regarding those labstore changes?
[22:53:41] <SMalyshev>	 but looks like it's not
[22:54:03] <RainbowSprinkles>	 Yes, shortly :)
[22:54:10] <SMalyshev>	 ah, ok then :)
[22:54:22] <SMalyshev>	 I can wait :)
[22:55:25] <madhuvishy>	 paravoid: ah yes I saw your patch, i can merge now
[22:55:44] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] labstore: remove TC=$(which tc) [puppet] - 10https://gerrit.wikimedia.org/r/356107 (owner: 10Faidon Liambotis)
[22:55:50] <wikibugs>	 (03PS4) 10Madhuvishy: labstore: remove TC=$(which tc) [puppet] - 10https://gerrit.wikimedia.org/r/356107 (owner: 10Faidon Liambotis)
[22:56:14] <wikibugs>	 (03CR) 10Madhuvishy: [V: 032 C: 032] labstore: remove TC=$(which tc) [puppet] - 10https://gerrit.wikimedia.org/r/356107 (owner: 10Faidon Liambotis)
[22:56:19] <logmsgbot>	 !log demon@tin Synchronized php-1.30.0-wmf.4/extensions/RevisionSlider/src/RevisionSliderHooks.php: Re-syncing with permanent committed fix (duration: 00m 44s)
[22:56:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:37] <RainbowSprinkles>	 SMalyshev: Go ahead and merge your wmf.4 patch, I'll sync it out while I'm on a roll
[22:56:55] <ebernhardson>	 RainbowSprinkles: i could throw in a GeoData fix while you're at it too ;)
[22:57:02] <RainbowSprinkles>	 Oooooh
[22:57:03] <RainbowSprinkles>	 <3
[22:57:11] <RainbowSprinkles>	 The partialshardexception?
[22:57:12] <RainbowSprinkles>	 :)
[22:57:38] <ebernhardson>	 yes, elasticsearch broke their promise about mixed clusters
[22:57:58] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-2] "No, don't use a shell script, that's definitely the wrong way to do this. A systemd unit that manages the process is ideal. Running kill f" [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn)
[22:57:59] <ebernhardson>	 mixed clusters (5.1.2 and 5.3.2 in same cluster) is supposed to work flawlessly, but they changed an enum 
[22:58:20] <ebernhardson>	 so that enum emit by 5.1.2 and read by 5.3.2 is interpreted differently :S
[22:59:42] <RainbowSprinkles>	 Ouch. At least it's fixable on the Cirrus side :)
[23:00:04] <paravoid>	 madhuvishy: thanks! :)
[23:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T2300). Please do the needful.
[23:00:05] <jouncebot>	 Jdlrobson and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:07] <paravoid>	 madhuvishy: the rest too?
[23:00:13] <ebernhardson>	 well, i'm just pointing geodata at a cluster thats all uniform :)
[23:00:17] <RainbowSprinkles>	 AH
[23:00:19] <RainbowSprinkles>	 That works too
[23:02:24] <Amir1>	 I added a patch too last minute
[23:02:26] <madhuvishy>	 paravoid: yes doing, rolling out the patch on tools first to make sure
[23:02:30] <paravoid>	 ok :)
[23:03:59] <RainbowSprinkles>	 Amir1: I'm saving that for last, it might not make it today. I'm trying to wrap up a few production fixes we've got for wmf.4 first, so we can finish the train for the week
[23:04:16] <RainbowSprinkles>	 Well, does the collation swap need a maintenance script run I think? If not, could knock it out pretty quick
[23:04:24] <SMalyshev>	 RainbowSprinkles: so https://gerrit.wikimedia.org/r/#/c/357953/ is the cherry-pick, the master one is merged
[23:04:36] <Amir1>	 RainbowSprinkles: it needs the maintenance script 
[23:04:51] <RainbowSprinkles>	 Yeah, gonna have to wait until last thing. I can't babysit the script right now
[23:05:02] <Amir1>	 yeah sure
[23:06:48] <wikibugs>	 (03PS1) 10Chad: Swapping wikipedias to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357955
[23:07:36] <wikibugs>	 (03PS4) 10Madhuvishy: labstore: use the interface_primary fact, not eth0 [puppet] - 10https://gerrit.wikimedia.org/r/356108 (owner: 10Faidon Liambotis)
[23:10:27] <wikibugs>	 (03PS1) 10RobH: updating recipe for 80% of lvm [puppet] - 10https://gerrit.wikimedia.org/r/357956
[23:11:41] <wikibugs>	 (03PS2) 10RobH: updating recipe for 80% of lvm [puppet] - 10https://gerrit.wikimedia.org/r/357956
[23:13:03] <wikibugs>	 (03CR) 10RobH: [C: 032] updating recipe for 80% of lvm [puppet] - 10https://gerrit.wikimedia.org/r/357956 (owner: 10RobH)
[23:14:12] <wikibugs>	 (03PS7) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078
[23:14:55] <wikibugs>	 (03CR) 10Dzahn: gerrit: let Apache proxy only listen on service IP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn)
[23:15:01] <logmsgbot>	 !log demon@tin Synchronized php-1.30.0-wmf.4/extensions/GeoData/includes/Searcher.php: Temp hax to point GeoData at codfw DC (duration: 00m 43s)
[23:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:14] <RainbowSprinkles>	 ebernhardson: Your patch is live everywhere
[23:15:43] <wikibugs>	 (03Abandoned) 10Dzahn: lists: lower TTL for service IP change [dns] - 10https://gerrit.wikimedia.org/r/354064 (owner: 10Dzahn)
[23:16:04] <ebernhardson>	 RainbowSprinkles: sweet, we'll know its working because it doesn't start emitting 100 errors/minute when wmf.4 goes out (but this is also already on wmf.2, so should be fine) :)_
[23:16:05] <wikibugs>	 (03Abandoned) 10Dzahn: lists: switch v6 service IP [dns] - 10https://gerrit.wikimedia.org/r/354071 (owner: 10Dzahn)
[23:16:32] <logmsgbot>	 !log demon@tin Synchronized php-1.30.0-wmf.4/extensions/CirrusSearch/includes/Job/DeleteArchive.php: Fix array access bug (duration: 00m 43s)
[23:16:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:17:47] <RainbowSprinkles>	 SMalyshev: You've live now too
[23:17:57] <SMalyshev>	 RainbowSprinkles: cool, thanks!
[23:18:04] <wikibugs>	 (03PS5) 10Madhuvishy: labstore: avoid the hardcoding of eth0/eth1 [puppet] - 10https://gerrit.wikimedia.org/r/356109 (owner: 10Faidon Liambotis)
[23:19:05] <madhuvishy>	 paravoid: gerrit wants me to manually rebase https://gerrit.wikimedia.org/r/#/c/356108 on
[23:19:52] <madhuvishy>	 so doing that
[23:21:08] <wikibugs>	 (03PS5) 10Faidon Liambotis: labstore: use the interface_primary fact, not eth0 [puppet] - 10https://gerrit.wikimedia.org/r/356108
[23:21:10] <wikibugs>	 (03PS6) 10Faidon Liambotis: labstore: avoid the hardcoding of eth0/eth1 [puppet] - 10https://gerrit.wikimedia.org/r/356109
[23:21:12] <wikibugs>	 (03PS3) 10Faidon Liambotis: labstore: use /sbin/tc, not $PATH/tc [puppet] - 10https://gerrit.wikimedia.org/r/357597
[23:21:13] <paravoid>	 madhuvishy: ^
[23:21:40] <wikibugs>	 (03Abandoned) 10Dzahn: fix lists/fermium: switch v6 service IP [puppet] - 10https://gerrit.wikimedia.org/r/354055 (owner: 10Dzahn)
[23:21:51] <madhuvishy>	 paravoid: /\ thank you
[23:21:57] <wikibugs>	 (03CR) 10Madhuvishy: [V: 032 C: 032] labstore: use the interface_primary fact, not eth0 [puppet] - 10https://gerrit.wikimedia.org/r/356108 (owner: 10Faidon Liambotis)
[23:23:07] <paravoid>	 actually
[23:23:12] <wikibugs>	 (03PS4) 10Faidon Liambotis: labstore: use /sbin/tc, not $PATH/tc [puppet] - 10https://gerrit.wikimedia.org/r/357597
[23:23:14] <wikibugs>	 (03PS7) 10Faidon Liambotis: labstore: avoid the hardcoding of eth0/eth1 [puppet] - 10https://gerrit.wikimedia.org/r/356109
[23:23:19] <paravoid>	 let me swap the order of those two, to get the tc fix earlier
[23:23:23] <paravoid>	 done :)
[23:28:40] <wikibugs>	 (03CR) 10Faidon Liambotis: raid: remove unused aac, twe, zfs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357819 (owner: 10Faidon Liambotis)
[23:29:06] <madhuvishy>	 paravoid: do you know why https://gerrit.wikimedia.org/r/#/c/356109 says Submit Including parents?
[23:29:17] <paravoid>	 madhuvishy: because of ^^
[23:29:25] <madhuvishy>	 aah
[23:29:27] <paravoid>	 I swapped the order, the /sbin/tc thing is first now
[23:29:32] <madhuvishy>	 ah ah
[23:29:33] <madhuvishy>	 okay
[23:29:34] <paravoid>	 but I can swap it again if you want
[23:29:48] <paravoid>	 you can do that from gerrit too with a rebase -> change parent revision
[23:29:58] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] labstore: use /sbin/tc, not $PATH/tc [puppet] - 10https://gerrit.wikimedia.org/r/357597 (owner: 10Faidon Liambotis)
[23:31:49] <madhuvishy>	 paravoid: nah it's good, i was confused because it wouldn't let me rebase the /sbin/tc patch, nor submit it - but i realized it was because that was missing CR +2
[23:34:24] <wikibugs>	 (03CR) 10Chad: [C: 032] Swapping wikipedias to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357955 (owner: 10Chad)
[23:36:37] <wikibugs>	 (03PS2) 10Dzahn: fix all the "role-role" in system::roles [puppet] - 10https://gerrit.wikimedia.org/r/354172
[23:36:51] <wikibugs>	 (03Merged) 10jenkins-bot: Swapping wikipedias to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357955 (owner: 10Chad)
[23:36:59] <wikibugs>	 (03CR) 10jenkins-bot: Swapping wikipedias to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357955 (owner: 10Chad)
[23:37:56] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: remaining wikis to wmf.4
[23:38:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:40:27] <madhuvishy>	 paravoid: hmmm
[23:40:30] <madhuvishy>	 https://www.irccloud.com/pastebin/HOhF79da/
[23:42:04] <paravoid>	 oh you didn't PCC it after all?
[23:42:10] <icinga-wm>	 PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:42:39] <paravoid>	 oh right, I see the issue
[23:42:41] <madhuvishy>	 ugh
[23:42:55] <madhuvishy>	 i did put it through the compiler
[23:43:05] <madhuvishy>	 interface::manual {'data'
[23:43:10] <paravoid>	 yeah
[23:43:33] <paravoid>	 will you fix or should I?
[23:43:38] <madhuvishy>	 i can fix
[23:43:41] <paravoid>	 :)
[23:46:08] <RainbowSprinkles>	 Can someone kick HHVM on mw1275? It keeps spewing "LightProcess exiting" crud about every minute or so
[23:48:08] <wikibugs>	 (03PS1) 10Madhuvishy: labstore: Fix data interface require clause [puppet] - 10https://gerrit.wikimedia.org/r/357958
[23:48:24] <mutante>	 RainbowSprinkles: did ot
[23:48:46] <RainbowSprinkles>	 thx
[23:48:46] <madhuvishy>	 paravoid: ^^
[23:48:51] <mutante>	 !log mw1275 - restarted hhvm (php: Lost parent, LightProcess exiting in syslog)
[23:49:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:55] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 032] labstore: Fix data interface require clause [puppet] - 10https://gerrit.wikimedia.org/r/357958 (owner: 10Madhuvishy)
[23:53:10] <icinga-wm>	 RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[23:53:57] <mutante>	 RainbowSprinkles: it kept doing it. but now it recovered (also apache service)
[23:54:26] <madhuvishy>	 paravoid: all done :) thanks for all the patches!
[23:54:37] <paravoid>	 thanks for the merges :)
[23:54:46] <paravoid>	 and more importantly, code review!
[23:55:06] <madhuvishy>	 :) yw
[23:55:16] <paravoid>	 off to bed now, bye!
[23:55:20] <madhuvishy>	 gnite
[23:55:42] <RainbowSprinkles>	 mutante: Ok awesome, glad it's healthier now
[23:56:56] <mutante>	 RainbowSprinkles: i think this is https://phabricator.wikimedia.org/T124956
[23:58:05] <RainbowSprinkles>	 Sorta yeah
[23:58:12] <mutante>	 and separately service(s) were crashed
[23:58:25] <mutante>	 which was fixed, but that log line is still there. but it's like they were 2 things
[23:58:55] <RainbowSprinkles>	 I wonder if mw1275 is just being bleh