[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T0000). [00:11:07] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3310890 (10thcipriani) >>! In T166888#3322963, @faidon wrote: > These are all conjectures of mine, just by looking at the log, so correct me when... [00:17:59] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 60796.64 seconds [00:18:00] PROBLEM - Check systemd state on cp2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:18:01] PROBLEM - salt-minion processes on cp2003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:18:08] PROBLEM - Check systemd state on cp2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:18:08] PROBLEM - salt-minion processes on cp4012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:18:08] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:18:08] PROBLEM - Host cp1059 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:08] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:18] PROBLEM - salt-minion processes on cp2021 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:18:18] PROBLEM - Check systemd state on cp2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:18:18] PROBLEM - Check systemd state on cp3004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:18:18] PROBLEM - Check systemd state on cp1060 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:18:19] PROBLEM - salt-minion processes on cp1060 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:18:19] PROBLEM - salt-minion processes on cp3005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:18:19] PROBLEM - salt-minion processes on cp3004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:18:20] PROBLEM - Check systemd state on cp3005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:18:20] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:22] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [00:19:14] uhm.. what's happening with the salt minions [00:20:11] or did Icinga just forget downtimes again.. i guess that's it [00:20:13] looking [00:21:52] The Salt Master has rejected this minion's public key! [00:24:19] RECOVERY - salt-minion processes on cp3005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:25:05] !log salt-master: deleted salt-key for cp3005, stopped started minion cp3005 - key got accepted again (was: Salt Master has rejected this minion's public key) [00:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:18] RECOVERY - salt-minion processes on cp2021 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:31:21] !log cp2012 - fixed salt key issue as for cp3005 (delete key, stop/start minion, accept new key) [00:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:09] !log cp4020 - powercycling (host down, console sat at initramfs) [00:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:38] RECOVERY - Host cp4020 is UP: PING OK - Packet loss = 0%, RTA = 74.94 ms [00:36:38] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:39:18] PROBLEM - Check size of conntrack table on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:19] PROBLEM - Check the NTP synchronisation status of timesyncd on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:38] PROBLEM - Check systemd state on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:38] PROBLEM - puppet last run on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:38] PROBLEM - DPKG on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:38] PROBLEM - configured eth on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:38] PROBLEM - MD RAID on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:38] PROBLEM - dhclient process on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:39] PROBLEM - salt-minion processes on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:39] PROBLEM - Disk space on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:39:40] PROBLEM - Check whether ferm is active by checking the default input chain on cp4020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:40:38] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 74.00 ms [00:41:08] RECOVERY - Check size of conntrack table on cp4020 is OK: OK: nf_conntrack is 0 % full [00:41:28] RECOVERY - Disk space on cp4020 is OK: DISK OK [00:41:28] RECOVERY - Check whether ferm is active by checking the default input chain on cp4020 is OK: OK ferm input default policy is set [00:41:28] RECOVERY - DPKG on cp4020 is OK: All packages OK [00:41:28] RECOVERY - configured eth on cp4020 is OK: OK - interfaces up [00:41:28] RECOVERY - dhclient process on cp4020 is OK: PROCS OK: 0 processes with command name dhclient [00:41:29] RECOVERY - MD RAID on cp4020 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [00:41:58] !log cp4011 - like cp4010 - powercycling (host down, console sat at initramfs). it hat the "did not detect disk by uid" issue but boots normal after powercycle [00:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:28] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX] [00:43:38] PROBLEM - dhclient process on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:43:38] PROBLEM - Check size of conntrack table on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:43:38] PROBLEM - Check whether ferm is active by checking the default input chain on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:43:38] PROBLEM - Disk space on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:43:38] PROBLEM - Check systemd state on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:43:39] PROBLEM - MD RAID on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:43:39] PROBLEM - configured eth on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:43:40] PROBLEM - DPKG on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:43:40] PROBLEM - salt-minion processes on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:43:41] PROBLEM - puppet last run on cp4011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:44:08] ok, these are failed reinstalls [00:44:17] from "ex-cache_maps, to be decommed" [00:44:31] SAL "reimaging ex-cache_maps hosts" [00:45:07] so not critical and i'll make it recover anyways [00:45:19] they are role spare now [00:45:28] RECOVERY - Disk space on cp4011 is OK: DISK OK [00:45:29] RECOVERY - Check whether ferm is active by checking the default input chain on cp4011 is OK: OK ferm input default policy is set [00:45:29] RECOVERY - Check size of conntrack table on cp4011 is OK: OK: nf_conntrack is 0 % full [00:45:29] RECOVERY - dhclient process on cp4011 is OK: PROCS OK: 0 processes with command name dhclient [00:45:29] RECOVERY - Check systemd state on cp4011 is OK: OK - running: The system is fully operational [00:45:29] RECOVERY - configured eth on cp4011 is OK: OK - interfaces up [00:45:29] RECOVERY - salt-minion processes on cp4011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:45:30] RECOVERY - DPKG on cp4011 is OK: All packages OK [00:45:30] RECOVERY - MD RAID on cp4011 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [00:46:28] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX] [00:47:28] !log cp1059 - same thing - powercycle after failed boot after reimaging script [00:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:38] RECOVERY - Host cp1059 is UP: PING OK - Packet loss = 0%, RTA = 39.37 ms [00:51:08] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [00:52:48] ACKNOWLEDGEMENT - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T150256#3323647 [00:53:08] PROBLEM - Check whether ferm is active by checking the default input chain on cp1059 is CRITICAL: Return code of 255 is out of bounds [00:53:29] PROBLEM - Check systemd state on cp1059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:53:29] PROBLEM - salt-minion processes on cp1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:54:09] RECOVERY - Check whether ferm is active by checking the default input chain on cp1059 is OK: OK ferm input default policy is set [00:54:09] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:54:30] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 53 seconds ago with 2 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX] [00:54:39] RECOVERY - Host cp4019 is UP: PING OK - Packet loss = 0%, RTA = 73.83 ms [00:54:41] !log cp4019 - powercycled (same as others) | lvs1007 - sits at installer - waiting for IP to be configured (T150256) [00:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:51] T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006 - https://phabricator.wikimedia.org/T150256 [00:57:25] (03PS5) 10Jforrester: Beta Features: Update last-big-change-plus-six-month dates in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354731 [00:57:39] PROBLEM - Check whether ferm is active by checking the default input chain on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:57:40] PROBLEM - Check size of conntrack table on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:57:40] PROBLEM - Check systemd state on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:57:40] PROBLEM - MD RAID on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:57:40] PROBLEM - DPKG on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:57:40] PROBLEM - Disk space on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:57:40] PROBLEM - salt-minion processes on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:57:41] PROBLEM - configured eth on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:57:41] PROBLEM - dhclient process on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:57:42] PROBLEM - puppet last run on cp4019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:30] RECOVERY - Check whether ferm is active by checking the default input chain on cp4019 is OK: OK ferm input default policy is set [00:59:30] RECOVERY - Check size of conntrack table on cp4019 is OK: OK: nf_conntrack is 0 % full [00:59:30] RECOVERY - configured eth on cp4019 is OK: OK - interfaces up [00:59:30] RECOVERY - Disk space on cp4019 is OK: DISK OK [00:59:30] RECOVERY - dhclient process on cp4019 is OK: PROCS OK: 0 processes with command name dhclient [00:59:30] RECOVERY - DPKG on cp4019 is OK: All packages OK [00:59:30] RECOVERY - Check systemd state on cp4019 is OK: OK - running: The system is fully operational [00:59:31] RECOVERY - MD RAID on cp4019 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [01:00:30] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [01:00:30] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 58 seconds ago with 2 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File_line[login.defs-SYS_UID_MAX] [01:01:54] (03CR) 10Jforrester: [C: 031] Remove semanticness from another place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352985 (https://phabricator.wikimedia.org/T53642) (owner: 10MaxSem) [01:02:19] RECOVERY - salt-minion processes on cp4012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:02:21] RECOVERY - salt-minion processes on cp1060 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:02:22] (03CR) 10Jforrester: [C: 031] phpunit: replace deprecated strict=true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356349 (owner: 10Hashar) [01:05:39] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [01:06:09] RECOVERY - salt-minion processes on cp2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:09:09] RECOVERY - Check the NTP synchronisation status of timesyncd on cp4020 is OK: OK: synced at Thu 2017-06-08 01:09:05 UTC. [01:10:39] RECOVERY - salt-minion processes on cp4019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:11:19] !Log cp1060, cp2003, cp4012, cp4019, cp4020 - delete old salt-keys, accept new salt-keys, restart minion [01:11:29] RECOVERY - salt-minion processes on cp4020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:11:39] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [01:20:39] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [01:22:09] RECOVERY - Check systemd state on cp2015 is OK: OK - running: The system is fully operational [01:35:21] 10Operations: terbium maintenance cron "processEchoEmailBatch.php" is getting "access denied" from database - https://phabricator.wikimedia.org/T167373#3331054 (10Dzahn) [01:36:22] 10Operations: terbium maintenance cron "processEchoEmailBatch.php" is getting "access denied" from database - https://phabricator.wikimedia.org/T167373#3331043 (10Dzahn) >'wikiadmin'@'10.64.32.13' (using password: YES) (208.80.153.14) 10.64.32.13 = terbium 208.80.153.14 = **labtestweb2001.wikimedia.org** ^ lab... [01:37:18] !log maxsem@tin Synchronized php-1.30.0-wmf.2/extensions/GeoData/includes/Searcher.php: Livehack to stop exceptions (duration: 00m 46s) [01:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:25] !log manually running mediawiki maintenance job "echo_mail_batch" (on terbium as www-data, just like cron). did _NOT_ get denied by DB (T167373) [01:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:34] T167373: terbium maintenance cron "processEchoEmailBatch.php" is getting "access denied" from database - https://phabricator.wikimedia.org/T167373 [01:50:22] 10Operations: terbium maintenance cron "processEchoEmailBatch.php" is getting "access denied" from database - https://phabricator.wikimedia.org/T167373#3331062 (10Dzahn) ^ can't reproduce it when manually running it (as the same user from the same host), but ... the emails have been arriving (almost) every day s... [02:21:11] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [02:24:11] PROBLEM - Check systemd state on elastic2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:34:56] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 08m 41s) [02:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:47] !log deploying hotfix for T166958 [02:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:57] T166958: Unhandled exception on viewing T14974 - https://phabricator.wikimedia.org/T166958 [02:50:02] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.4) (duration: 05m 07s) [02:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:05] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3331087 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp3006.esams.wmnet', 'cp1046.eqiad.wmnet', 'cp1047.e... [02:55:21] PROBLEM - salt-minion processes on cp1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:55:21] PROBLEM - salt-minion processes on cp1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:56:27] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jun 8 02:56:27 UTC 2017 (duration 6m 26s) [02:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:01] PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100% [03:01:01] PROBLEM - puppet last run on cp1047 is CRITICAL: Return code of 255 is out of bounds [03:01:31] RECOVERY - Host cp3006 is UP: PING OK - Packet loss = 0%, RTA = 120.11 ms [03:02:11] PROBLEM - puppet last run on cp1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:02:11] PROBLEM - puppet last run on cp2009 is CRITICAL: Return code of 255 is out of bounds [03:02:41] PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% [03:03:41] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [03:03:47] ignore all of this, apparently wmf-auto-reimage doesn't downtime when other steps fail along the way :P [03:03:51] PROBLEM - Check whether ferm is active by checking the default input chain on cp3006 is CRITICAL: Return code of 255 is out of bounds [03:03:51] PROBLEM - DPKG on cp3006 is CRITICAL: Return code of 255 is out of bounds [03:03:51] PROBLEM - Disk space on cp3006 is CRITICAL: Return code of 255 is out of bounds [03:03:51] PROBLEM - salt-minion processes on cp3006 is CRITICAL: Return code of 255 is out of bounds [03:03:51] PROBLEM - dhclient process on cp3006 is CRITICAL: Return code of 255 is out of bounds [03:03:52] PROBLEM - Check size of conntrack table on cp3006 is CRITICAL: Return code of 255 is out of bounds [03:03:53] PROBLEM - puppet last run on cp3006 is CRITICAL: Return code of 255 is out of bounds [03:03:53] PROBLEM - Check systemd state on cp3006 is CRITICAL: Return code of 255 is out of bounds [03:03:53] PROBLEM - configured eth on cp3006 is CRITICAL: Return code of 255 is out of bounds [03:03:54] PROBLEM - MD RAID on cp3006 is CRITICAL: Return code of 255 is out of bounds [03:03:54] PROBLEM - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100% [03:05:11] RECOVERY - Host cp1047 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [03:06:11] RECOVERY - Host cp2009 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [03:06:41] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 36.40 ms [03:07:01] PROBLEM - IPMI Temperature on cp3006 is CRITICAL: Return code of 255 is out of bounds [03:07:21] PROBLEM - configured eth on cp1047 is CRITICAL: Return code of 255 is out of bounds [03:07:21] PROBLEM - dhclient process on cp1047 is CRITICAL: Return code of 255 is out of bounds [03:07:22] PROBLEM - DPKG on cp1047 is CRITICAL: Return code of 255 is out of bounds [03:07:31] PROBLEM - Disk space on cp1047 is CRITICAL: Return code of 255 is out of bounds [03:07:41] PROBLEM - Check systemd state on cp1047 is CRITICAL: Return code of 255 is out of bounds [03:07:51] PROBLEM - Check whether ferm is active by checking the default input chain on cp1047 is CRITICAL: Return code of 255 is out of bounds [03:08:01] PROBLEM - MD RAID on cp1047 is CRITICAL: Return code of 255 is out of bounds [03:08:11] PROBLEM - Check size of conntrack table on cp1047 is CRITICAL: Return code of 255 is out of bounds [03:08:11] PROBLEM - dhclient process on cp2009 is CRITICAL: Return code of 255 is out of bounds [03:08:21] PROBLEM - Check whether ferm is active by checking the default input chain on cp2009 is CRITICAL: Return code of 255 is out of bounds [03:08:31] PROBLEM - Check size of conntrack table on cp2009 is CRITICAL: Return code of 255 is out of bounds [03:08:31] PROBLEM - DPKG on cp2009 is CRITICAL: Return code of 255 is out of bounds [03:08:31] PROBLEM - salt-minion processes on cp2009 is CRITICAL: Return code of 255 is out of bounds [03:08:41] PROBLEM - MD RAID on cp2009 is CRITICAL: Return code of 255 is out of bounds [03:08:41] PROBLEM - configured eth on cp2009 is CRITICAL: Return code of 255 is out of bounds [03:08:52] PROBLEM - Check systemd state on cp1046 is CRITICAL: Return code of 255 is out of bounds [03:08:52] PROBLEM - dhclient process on cp1046 is CRITICAL: Return code of 255 is out of bounds [03:09:01] PROBLEM - Disk space on cp2009 is CRITICAL: Return code of 255 is out of bounds [03:09:01] PROBLEM - Check systemd state on cp2009 is CRITICAL: Return code of 255 is out of bounds [03:09:02] PROBLEM - MD RAID on cp1046 is CRITICAL: Return code of 255 is out of bounds [03:09:21] PROBLEM - Check whether ferm is active by checking the default input chain on cp1046 is CRITICAL: Return code of 255 is out of bounds [03:09:21] PROBLEM - configured eth on cp1046 is CRITICAL: Return code of 255 is out of bounds [03:09:21] PROBLEM - DPKG on cp1046 is CRITICAL: Return code of 255 is out of bounds [03:09:31] PROBLEM - Check size of conntrack table on cp1046 is CRITICAL: Return code of 255 is out of bounds [03:09:31] PROBLEM - Disk space on cp1046 is CRITICAL: Return code of 255 is out of bounds [03:09:32] PROBLEM - Check the NTP synchronisation status of timesyncd on cp1046 is CRITICAL: Return code of 255 is out of bounds [03:09:41] PROBLEM - Apache HTTP on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.075 second response time [03:09:41] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.151 second response time [03:10:01] PROBLEM - HHVM rendering on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time [03:10:04] mw1261 isn't part of my spam [03:10:41] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.118 second response time [03:10:42] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.185 second response time [03:11:01] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 74864 bytes in 0.234 second response time [03:12:41] PROBLEM - Check the NTP synchronisation status of timesyncd on cp2009 is CRITICAL: Return code of 255 is out of bounds [03:16:41] PROBLEM - Nginx local proxy to apache on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.153 second response time [03:16:51] PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% [03:17:11] PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100% [03:17:31] RECOVERY - Host cp3006 is UP: PING OK - Packet loss = 0%, RTA = 119.88 ms [03:17:41] RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.186 second response time [03:18:21] RECOVERY - Host cp1047 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [03:18:51] PROBLEM - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100% [03:21:11] RECOVERY - Host cp2009 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [03:22:51] RECOVERY - Check whether ferm is active by checking the default input chain on cp3006 is OK: OK ferm input default policy is set [03:22:51] RECOVERY - dhclient process on cp3006 is OK: PROCS OK: 0 processes with command name dhclient [03:22:51] RECOVERY - Check size of conntrack table on cp3006 is OK: OK: nf_conntrack is 0 % full [03:22:51] RECOVERY - Disk space on cp3006 is OK: DISK OK [03:22:51] RECOVERY - salt-minion processes on cp3006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:22:52] RECOVERY - DPKG on cp3006 is OK: All packages OK [03:22:52] RECOVERY - configured eth on cp3006 is OK: OK - interfaces up [03:22:53] RECOVERY - MD RAID on cp3006 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [03:24:21] RECOVERY - Check whether ferm is active by checking the default input chain on cp1046 is OK: OK ferm input default policy is set [03:24:21] RECOVERY - DPKG on cp1046 is OK: All packages OK [03:24:21] RECOVERY - configured eth on cp1046 is OK: OK - interfaces up [03:24:22] RECOVERY - salt-minion processes on cp1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:24:32] RECOVERY - Check size of conntrack table on cp1046 is OK: OK: nf_conntrack is 0 % full [03:24:32] RECOVERY - Disk space on cp1046 is OK: DISK OK [03:25:01] RECOVERY - dhclient process on cp1046 is OK: PROCS OK: 0 processes with command name dhclient [03:25:01] RECOVERY - Check systemd state on cp1046 is OK: OK - running: The system is fully operational [03:25:02] RECOVERY - MD RAID on cp1046 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [03:25:02] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [03:25:21] RECOVERY - salt-minion processes on cp1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:25:22] RECOVERY - DPKG on cp1047 is OK: All packages OK [03:25:22] RECOVERY - dhclient process on cp1047 is OK: PROCS OK: 0 processes with command name dhclient [03:25:22] RECOVERY - configured eth on cp1047 is OK: OK - interfaces up [03:25:31] RECOVERY - Disk space on cp1047 is OK: DISK OK [03:25:41] RECOVERY - Check systemd state on cp1047 is OK: OK - running: The system is fully operational [03:25:51] RECOVERY - Check whether ferm is active by checking the default input chain on cp1047 is OK: OK ferm input default policy is set [03:26:01] RECOVERY - Disk space on cp2009 is OK: DISK OK [03:26:01] RECOVERY - MD RAID on cp1047 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [03:26:02] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [03:26:02] RECOVERY - Check systemd state on cp2009 is OK: OK - running: The system is fully operational [03:26:11] RECOVERY - Check size of conntrack table on cp1047 is OK: OK: nf_conntrack is 0 % full [03:26:11] RECOVERY - dhclient process on cp2009 is OK: PROCS OK: 0 processes with command name dhclient [03:26:11] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [03:26:21] RECOVERY - Check whether ferm is active by checking the default input chain on cp2009 is OK: OK ferm input default policy is set [03:26:31] RECOVERY - Check size of conntrack table on cp2009 is OK: OK: nf_conntrack is 0 % full [03:26:31] RECOVERY - DPKG on cp2009 is OK: All packages OK [03:26:41] RECOVERY - MD RAID on cp2009 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [03:26:41] RECOVERY - configured eth on cp2009 is OK: OK - interfaces up [03:30:31] RECOVERY - salt-minion processes on cp2009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:33:01] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [03:37:01] RECOVERY - IPMI Temperature on cp3006 is OK: Sensor Type(s) Temperature Status: OK [03:37:41] RECOVERY - salt-minion processes on cp1059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:38:51] PROBLEM - HHVM rendering on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [03:38:52] PROBLEM - Apache HTTP on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.074 second response time [03:39:01] RECOVERY - salt-minion processes on cp3004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:39:31] RECOVERY - Check the NTP synchronisation status of timesyncd on cp1046 is OK: OK: synced at Thu 2017-06-08 03:39:30 UTC. [03:39:51] RECOVERY - HHVM rendering on mw1200 is OK: HTTP OK: HTTP/1.1 200 OK - 74866 bytes in 1.121 second response time [03:39:52] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.102 second response time [03:42:01] RECOVERY - Check systemd state on cp3004 is OK: OK - running: The system is fully operational [03:42:41] RECOVERY - Check the NTP synchronisation status of timesyncd on cp2009 is OK: OK: synced at Thu 2017-06-08 03:42:33 UTC. [03:43:01] RECOVERY - Check systemd state on cp3006 is OK: OK - running: The system is fully operational [03:44:01] RECOVERY - Check systemd state on cp3005 is OK: OK - running: The system is fully operational [03:44:41] RECOVERY - Check systemd state on cp1059 is OK: OK - running: The system is fully operational [03:44:51] RECOVERY - Check systemd state on cp4020 is OK: OK - running: The system is fully operational [03:45:31] RECOVERY - Check systemd state on cp1060 is OK: OK - running: The system is fully operational [03:46:21] RECOVERY - Check systemd state on cp2003 is OK: OK - running: The system is fully operational [03:47:31] RECOVERY - Check systemd state on cp2021 is OK: OK - running: The system is fully operational [03:51:46] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_maps into cache_upload functionally - https://phabricator.wikimedia.org/T164608#3331125 (10BBlack) 05Open>03Resolved a:03BBlack [03:55:36] 10Operations, 10ops-esams, 10hardware-requests: Decommission cp300[3456] - https://phabricator.wikimedia.org/T167376#3331127 (10BBlack) [03:56:55] 10Operations, 10ops-ulsfo, 10hardware-requests: Decommission cp4011, cp4012, cp4019, cp4020 - https://phabricator.wikimedia.org/T167377#3331138 (10BBlack) [03:58:26] 10Operations, 10ops-esams, 10Traffic: cp3003 network interface issues - https://phabricator.wikimedia.org/T162132#3331150 (10BBlack) 05Open>03declined cp3003 is decomming for good in T167376 [04:57:06] 10Operations, 10Labs: virbr0 interface present in some virt hosts - https://phabricator.wikimedia.org/T83732#917870 (10bd808) Is this still an issue that needs to be fixed? [04:59:51] 10Operations, 10Labs, 10wikitech.wikimedia.org: Turn on Cirrus replicas for labswiki (wikitech) - https://phabricator.wikimedia.org/T83760#3331225 (10bd808) @EBernhardson is this bug report still valid or just ancient cruft? [05:03:37] 10Operations, 10Labs, 10wikitech.wikimedia.org: Turn on Cirrus replicas for labswiki (wikitech) - https://phabricator.wikimedia.org/T83760#3331228 (10EBernhardson) 05Open>03Resolved a:03EBernhardson It looks like everything in deployment-prep is using either one or two replicas. Should be fine. [05:21:11] PROBLEM - Host elastic1035 is DOWN: PING CRITICAL - Packet loss = 100% [05:21:41] RECOVERY - Host elastic1035 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [05:29:31] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3331237 (10Marostegui) Sorry @MarcoAurelio that was almost 10:30pm our time :-( [05:40:22] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357756 [05:40:28] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357756 [05:41:43] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357756 (owner: 10Marostegui) [05:42:39] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357756 (owner: 10Marostegui) [05:42:52] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357756 (owner: 10Marostegui) [05:43:39] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1076 - T166205 (duration: 00m 45s) [05:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:49] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [05:49:45] (03PS1) 10Marostegui: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357757 (https://phabricator.wikimedia.org/T166205) [05:50:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357757 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [05:52:01] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357757 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [05:52:14] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357757 (https://phabricator.wikimedia.org/T166205) (owner: 10Marostegui) [05:52:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1074 - T166205 (duration: 00m 43s) [05:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:09] T166205: Convert unique keys into primary keys for some wiki tables on s2 - https://phabricator.wikimedia.org/T166205 [05:54:24] !log Deploy alter table s2 - db1074 - T166205 [05:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:34] <_joe_> !log uploading new service-checker version to reprepro, T167048 [05:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:43] T167048: Services need external monitoring - https://phabricator.wikimedia.org/T167048 [06:07:34] 10Operations, 10Monitoring, 10Services (next), 10User-Joe, 10User-mobrovac: Services need external monitoring - https://phabricator.wikimedia.org/T167048#3331247 (10Joe) @faidon at first I was thinking of implementing the checks on the LVS host (in the end, the puppettization is mostly the same), but I t... [06:15:33] (03CR) 10Muehlenhoff: [C: 031] check_ipmi_temp: load ipmi_devintf [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [07:00:05] !log Drop table updates on s6 - T139342 [07:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:15] T139342: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342 [07:38:39] (03PS1) 10Elukey: Remove webrequest_maps topic from Camus configuration [puppet] - 10https://gerrit.wikimedia.org/r/357768 [07:45:35] (03CR) 10Elukey: "Changes looks good: https://puppet-compiler.wmflabs.org/6692/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/357768 (owner: 10Elukey) [08:02:44] (03CR) 10Joal: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/357768 (owner: 10Elukey) [08:03:20] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3331391 (10elukey) Based on several guides like http://download.intel.com/support/motherboards/server/sb/configuring_raid_for_opti... [08:04:45] (03CR) 10Elukey: [C: 032] Remove webrequest_maps topic from Camus configuration [puppet] - 10https://gerrit.wikimedia.org/r/357768 (owner: 10Elukey) [08:17:10] jynus / marostegui / legoktm -- Avalaible for supervising T166028 ? [08:17:10] T166028: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028 [08:18:43] TabbyCat: I am around yes [08:19:03] if you think it is safe to proceed marostegui I can do it [08:19:22] TabbyCat: Yes, give me a sec to open up some extra tabs [08:19:27] sure thing [08:19:33] logstash, fatalerror, etc [08:19:57] dame una voz cuando estés :) [08:20:05] haha will do :) [08:21:17] TabbyCat: Listo/ready to check out the dbs! [08:21:32] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3331397 (10MarcoAurelio) Being handled in a minute. [08:21:49] (03PS5) 10Muehlenhoff: Use new repository layout for stretch onwards [puppet] - 10https://gerrit.wikimedia.org/r/357559 (https://phabricator.wikimedia.org/T158583) [08:21:59] !log Starting big global rename as requested in T166028 [08:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:24] marostegui: can I start it now then? [08:22:31] TabbyCat: go for it [08:22:36] oki [08:24:39] Jobs to rename Mlpearc to FlightTime have been queued on . [08:29:20] marostegui: how is it going? :) globalrenameprogress is not failing for me :) [08:29:42] TabbyCat: so far I see no issues [08:29:49] it is doing enwiki right now [08:29:57] the wiki with more edits [08:31:09] enwiki is done [08:31:26] I am seeing some lag [08:31:29] On enwiki [08:31:34] (03PS3) 10Ema: check_ipmi_temp: load ipmi_devintf [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) [08:31:39] But not too worrying so far [08:31:48] (03CR) 10Ema: [V: 032 C: 032] check_ipmi_temp: load ipmi_devintf [puppet] - 10https://gerrit.wikimedia.org/r/357617 (https://phabricator.wikimedia.org/T125205) (owner: 10Ema) [08:32:34] Only one host lagging now [08:32:47] one of the recentchanges servers (db1051) [08:33:12] gone now [08:33:31] I've got to fix some page moves that didn't happened on enwiki [08:33:40] but after the rename finishes [08:34:05] the rc hosts in codfw are lagging behind (but that doesn't impact anything as codfw isn't active) [08:34:42] (03PS1) 10Ema: base::kernel: create /etc/modules-load.d [puppet] - 10https://gerrit.wikimedia.org/r/357772 [08:34:51] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:51] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:51] PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:51] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:51] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:51] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:52] PROBLEM - puppet last run on graphite1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:52] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:53] PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:53] PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:59] o_O [08:35:00] moritzm, godog: we need https://gerrit.wikimedia.org/r/357772 to fix the puppet fails :( [08:35:04] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:11] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:11] PROBLEM - puppet last run on elastic2033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:11] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:11] PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:11] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:12] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:12] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:13] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:13] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:14] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:14] PROBLEM - puppet last run on cp1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:15] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:15] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:21] PROBLEM - puppet last run on elastic2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:29] ema: I thought it was in base already? [08:35:31] PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:31] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:31] PROBLEM - puppet last run on wezen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:32] PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:32] PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:35] godog: only for trusty [08:35:41] PROBLEM - puppet last run on mc2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:41] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:42] PROBLEM - puppet last run on wtp2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:48] I'll shut ircecho [08:35:51] PROBLEM - puppet last run on db1082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:51] PROBLEM - puppet last run on db1074 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:51] PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:51] PROBLEM - puppet last run on ms-fe1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:51] PROBLEM - puppet last run on elastic1042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:51] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:52] thanks [08:35:52] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:52] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:53] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:53] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:54] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:54] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:55] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:55] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:36:05] I'm going to set db1051 as non-transactional writes [08:36:22] !log temporarily stop ircecho on tegmen, puppet spam [08:36:24] it is fine now (hanging around 0-10 seconds) [08:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:33] thanks godog, was about to ask [08:37:14] godog: so basically https://gerrit.wikimedia.org/r/#/c/357617/3/modules/base/manifests/monitoring/host.pp has a require on File[/etc/modules-load.d], which is not defined on jessie because of the trusty conditional here https://gerrit.wikimedia.org/r/#/c/357591/2/modules/base/manifests/kernel.pp [08:37:40] ah ok, makes sense [08:37:59] ah, right. it's shipped on the dpkg level via systemd, but not on the puppet level [08:38:03] yup [08:38:34] TabbyCat: did it finish enwiki already? Lag is totally gone now [08:39:59] ema: can you adjust the comment stating that it can be removed once trusty is gone? other than that +1 [08:40:10] godog: sure thing [08:40:10] marostegui: yep, some minutes ago [08:40:23] marostegui: got to check the page moves later [08:40:26] ema: actually meh it'll fail the same way when we do that [08:40:35] cool, which wiki is it at now? [08:40:55] godog: unless we remove the unnecessary (after trusty is gone) require in host.pp [08:41:13] marostegui: mrwiki [08:41:21] https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress?username=FlightTime [08:41:29] ema: yeah I'd say remove the require instead but leave the if there [08:41:44] TabbyCat: that is useful, thanks! [08:41:51] it'll be racy maybe on trusty for a while, but we don't care that much [08:43:24] godog: ok [08:45:19] (03PS1) 10Ema: check_ipmi_temp: do not require /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/357774 [08:45:51] godog: ^ [08:46:43] (03CR) 10Filippo Giunchedi: [C: 031] check_ipmi_temp: do not require /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/357774 (owner: 10Ema) [08:47:03] (03CR) 10Ema: [V: 032 C: 032] check_ipmi_temp: do not require /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/357774 (owner: 10Ema) [08:48:05] ema: btw for recovery I think we can test run-puppet-agent --failed-only ! [08:48:16] godog: nice one, yeah [08:48:24] 5 wikis to go marostegui [08:48:32] 4 [08:48:38] godog: confirmed that the puppetfail is fixed on cp1008 [08:48:39] hehe yeah, I see [08:49:20] finished :D [08:49:24] \o/ [08:49:29] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3286412 (10JAllemandou) Looks good to me (even if I don't understand in depth what it means). I particularly like the idea of havi... [08:49:40] at least on that special page I mentioned, not sure from your part [08:50:02] godog: I'm gonna run-puppet-agent --failed-only across the fleet then [08:50:03] yes, everything looks fine [08:50:38] !log Rename user "Mlpearc" to "FlightTime" on Central Auth is now finished (T166028) [08:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:46] T166028: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028 [08:51:48] ema, godog: but we still need a followup fix of some sort, otherwise it's not ensured on trusty hosts that /etc/modules-load.d/ is created before ipmi.conf gets created? [08:52:03] (03Abandoned) 10Ema: base::kernel: create /etc/modules-load.d [puppet] - 10https://gerrit.wikimedia.org/r/357772 (owner: 10Ema) [08:53:37] 10Operations, 10DBA, 10Wikimedia-Site-requests: Rename user "Mlpearc" to "FlightTime" on Central Auth - https://phabricator.wikimedia.org/T166028#3331434 (10MarcoAurelio) 05Open>03Resolved p:05Triage>03Normal a:03MarcoAurelio Global rename is now done. Thanks to @Marostegui for his help during the... [08:53:47] moritzm: can we conditionally set the require in puppet? [08:55:14] moritzm: yeah we could, hopefully though we don't make any new trusty reinstall :D [08:55:27] also it'll converge on the second puppet run if it fails the first on trusty [08:57:51] ema: there might be hack, but not sure [08:58:17] godog: yeah, it's probably not a big deal, we can also choose to ignore it [08:58:23] !log swift eqiad-prod eqiad-prod: decom ms-be1005/6/7 - T166489 [08:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:31] T166489: Decommission ms-be1001 - ms-be1012 - https://phabricator.wikimedia.org/T166489 [08:58:54] (03PS1) 10Elukey: Bump Debian Jessie zookeeper version to 3.4.5+dfsg-2+deb8u2 [puppet] - 10https://gerrit.wikimedia.org/r/357775 [08:59:21] so cumin seems to be ignoring -p? [08:59:32] cumin -p 0 -b 12 bla bla [08:59:42] 7.3% (93/1272) success ratio (< 100.0% threshold) [08:59:47] volans: ^ [09:01:41] ema: LMK when it is ok to renable puppet/ircecho on tegmen btw [09:02:12] (03PS2) 10Elukey: Bump Debian Jessie zookeeper version to 3.4.5+dfsg-2+deb8u2 [puppet] - 10https://gerrit.wikimedia.org/r/357775 [09:02:52] godog: yeah currently struggling with cumin, most likely PEBKAC [09:03:20] ema: looking [09:03:38] (03CR) 10Elukey: "I think that this change might not be needed since the client role is only included in the server one afaics, but I'll merge anyway to kee" [puppet] - 10https://gerrit.wikimedia.org/r/357775 (owner: 10Elukey) [09:03:51] what is the exact problem? the -p decide if it aborts or not, but still show the restults at the end [09:04:11] (03CR) 10Muehlenhoff: [C: 031] Bump Debian Jessie zookeeper version to 3.4.5+dfsg-2+deb8u2 [puppet] - 10https://gerrit.wikimedia.org/r/357775 (owner: 10Elukey) [09:04:33] volans: I'm trying to run-puppet-agent --failed-only across the fleet [09:04:47] I don't care about the exit status of puppet-agent of course [09:04:52] (03CR) 10Elukey: [C: 032] Bump Debian Jessie zookeeper version to 3.4.5+dfsg-2+deb8u2 [puppet] - 10https://gerrit.wikimedia.org/r/357775 (owner: 10Elukey) [09:05:02] volans: sudo cumin -d --success-percentage 0 -b 12 '*' 'run-puppet-agent --failed-only || true' [09:05:11] no need for the || true [09:05:48] well at any rate this fails with [09:05:49] 11.6% (148/1272) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting [09:05:59] !log upgrade zookeeper packages to 3.4.5+dfsg-2+deb8u2 on conf100[123], conf200[23] and druid100[123] [09:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:27] volans: is there a way to let it just continue regardless of the number of fails? [09:07:01] (I thought that's what --success-percentage 0 was for) [09:07:31] ema: use 1 for now I'm looking at a possible bug, and you don't need the || true [09:07:48] volans: ack, trying -p 1 [09:08:28] also the real "failure" will probably be under few percent so like -p 95 because only the unreachable hosts should fail, all the others should skip or run puppet (and then maybe fail, but hopefully not if is fixed) [09:08:51] yup [09:09:24] (03CR) 10Hashar: [C: 031] Fix whitespace-related Rubocop warnings across the tree [puppet] - 10https://gerrit.wikimedia.org/r/357715 (owner: 10Faidon Liambotis) [09:10:31] volans: also I don't think I got any different output by using -d [09:10:45] ema: no that's debug level in the logs [09:10:50] you got a loooot more there ;) [09:10:57] /var/log/cumin/cumin.log [09:10:57] ah fair enough :) [09:11:11] -d, --debug Set log level to DEBUG. [09:11:13] :-P [09:12:48] (03CR) 10Hashar: [C: 031] check_puppetrun: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357716 (owner: 10Faidon Liambotis) [09:12:58] volans: there's definitely a bug there, -p 1 works while -p 0 doesn't [09:13:13] yep! I think I've found it... debugging ;) [09:13:17] thanks! [09:13:42] thank you for reporting it ;) and sorry for the trouble [09:15:50] (03PS4) 10Elukey: beta: profile::cassandra::allow_analytics: false [puppet] - 10https://gerrit.wikimedia.org/r/357344 (owner: 10Hashar) [09:17:16] (03CR) 10Elukey: [C: 032] beta: profile::cassandra::allow_analytics: false [puppet] - 10https://gerrit.wikimedia.org/r/357344 (owner: 10Hashar) [09:17:32] (03CR) 10Hashar: [C: 031] check_puppetrun: fix rubocop warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357716 (owner: 10Faidon Liambotis) [09:18:13] (03CR) 10Hashar: [C: 031] wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717 (owner: 10Faidon Liambotis) [09:18:43] hashar: thanks for the patch! Are you going to remove the cherry pick on the depl-prep puppet master or do you want me to do it? [09:18:48] godog: done, feel free to re-enable ircecho [09:18:52] and thanks :) [09:19:08] elukey: good morning. Which change/patch are you referring to ? [09:19:17] beta: profile::cassandra::allow_analytics: false :) [09:19:22] elukey: if it got merged, there is a crontab entry that automagically rebase the puppet repo for us [09:19:46] note I have absolutely no clue what that settings is actually doing and whether it should be false or true on beta :-} [09:19:47] hashar: sure, but yesterday due to a cherry picked patch (scap3 + jobrunners) the sync was broken [09:19:54] ahhh [09:20:06] yeah and I guess the cron falling does not trigger any mail notification so that is left unnoticed [09:20:07] ema: ack, {{done}} [09:20:16] hashar: yeah [09:20:22] though there might be Shinken prob to report an error [09:20:55] (03PS1) 10Alexandros Kosiaris: puppetmaster: Set stringify_facts = false [puppet] - 10https://gerrit.wikimedia.org/r/357776 [09:25:36] elukey: wmflabs might have a way to send email to all project admins. Maybe that could help [09:26:12] yeah [09:37:10] (03PS5) 10Alexandros Kosiaris: nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490 (owner: 10Hashar) [09:37:14] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] nagios_common: basic spec for contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/331490 (owner: 10Hashar) [09:40:05] (03CR) 10Hashar: "I can't tell whether the self.xx are actually needed :( But Dan would know for sure!" [puppet] - 10https://gerrit.wikimedia.org/r/357718 (owner: 10Faidon Liambotis) [09:41:25] !log updating mysql-connector-java on hadoop cluster [09:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:51] (03PS2) 10Alexandros Kosiaris: Fix whitespace-related Rubocop warnings across the tree [puppet] - 10https://gerrit.wikimedia.org/r/357715 (owner: 10Faidon Liambotis) [09:43:58] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix whitespace-related Rubocop warnings across the tree [puppet] - 10https://gerrit.wikimedia.org/r/357715 (owner: 10Faidon Liambotis) [09:44:09] (03PS2) 10Alexandros Kosiaris: check_puppetrun: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357716 (owner: 10Faidon Liambotis) [09:44:14] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] check_puppetrun: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357716 (owner: 10Faidon Liambotis) [09:46:31] moritzm: everything looks good, starting the upgrade of the zk eqiad cluster [09:46:39] ok! [09:48:51] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:55] (03CR) 10Hashar: [C: 031] "There is still a $DIR global variable, but that is not important for this standalone script." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357719 (owner: 10Faidon Liambotis) [09:50:53] conf1001 done, all good [09:51:01] will do conf1002 shortly [09:51:09] (03CR) 10Hashar: [C: 031] scap: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357720 (owner: 10Faidon Liambotis) [09:51:11] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [09:53:53] ACKNOWLEDGEMENT - HP RAID on ms-be1019 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167393 [09:54:00] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167393#3331591 (10ops-monitoring-bot) [09:55:59] (03CR) 10Hashar: [C: 04-1] "I cannot reproduce." [puppet] - 10https://gerrit.wikimedia.org/r/357721 (owner: 10Faidon Liambotis) [09:59:23] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Request access to analytics-privatedata-users for GoranSMilovanovic - https://phabricator.wikimedia.org/T167116#3331604 (10GoranSMilovanovic) @Dzahn Thanks a lot! [10:00:57] moritzm: all done, zk upgraded everywhere [10:01:06] (main-eqiad/main-codfw/druid) [10:01:32] (03CR) 10Hashar: "I would make that patch to bump rubocop version to 0.49.1 in the Gemfile. Looks like that is ready to be the final patch in the serie \O/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357722 (owner: 10Faidon Liambotis) [10:01:41] ok, great [10:01:51] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [10:02:13] whhhaaat [10:02:17] checking --^ [10:03:13] ah this is a stupid race condition, it failed while restarting the last conf1* node [10:03:31] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:04:31] looking ^ [10:04:35] one big spike, seems recovered [10:04:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:04:51] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:05:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:08:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:08:44] (03PS1) 10Muehlenhoff: Update to 1.1.0f [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357783 [10:11:31] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:13:41] PROBLEM - DPKG on d-i-test is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:14:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:16:54] (03CR) 10Hashar: [C: 031] "> 5 seconds on every job is quite a bit -- and on the CI instances with empty pagecaches and in VMs in a shared infrastructure without SSD" [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans) [10:19:49] (03PS1) 10Volans: Transports: fix success_threshold getter when set to 0 [software/cumin] - 10https://gerrit.wikimedia.org/r/357784 (https://phabricator.wikimedia.org/T167392) [10:19:51] (03PS1) 10Volans: Transports: fix ok_codes getter for empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/357785 (https://phabricator.wikimedia.org/T167394) [10:23:41] RECOVERY - DPKG on d-i-test is OK: All packages OK [10:24:39] (03PS1) 10Muehlenhoff: Update symbols for 1.1.0f [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357786 [10:24:53] (03CR) 10Muehlenhoff: [C: 032] Update to 1.1.0f [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357783 (owner: 10Muehlenhoff) [10:25:13] (03CR) 10Muehlenhoff: [C: 032] Update symbols for 1.1.0f [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357786 (owner: 10Muehlenhoff) [10:30:52] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3331701 (10elukey) Better view: ``` elukey@neodymium:~$ sudo cumin 'R:class = role::analytics_cluster::hadoop::worker' 'megacli -... [10:31:13] (03PS1) 10Ema: VCL: update wikiScrape regex [puppet] - 10https://gerrit.wikimedia.org/r/357787 [10:31:41] elukey: random policy per disk? :D [10:32:23] volans: yes! trying to set only one for all of them, if you have suggestions please let me know [10:32:26] fun times [10:32:43] atm I am removing Write cache OK if bad BBU [10:32:49] that seems really wrong [10:33:43] it depends IMHO [10:33:51] but a bit busy now, I can explain later [10:35:10] sure [10:39:36] (03PS1) 10Hashar: DO NOT SUBMIT: dump git info for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/357788 [10:41:50] (03Abandoned) 10Hashar: DO NOT SUBMIT: dump git info for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/357788 (owner: 10Hashar) [10:42:16] (03PS1) 10Muehlenhoff: Update man-sections patch for 1.1.0f [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357790 [10:45:43] (03CR) 10Muehlenhoff: [C: 032] Update man-sections patch for 1.1.0f [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357790 (owner: 10Muehlenhoff) [10:49:59] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3310890 (10hashar) Thank you @catrope @demon for the description of how patches are prepared to be tested and the gate system. I endorse your desc... [10:53:53] ACKNOWLEDGEMENT - HP RAID on ms-be1019 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167398 [10:53:57] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167398#3331752 (10ops-monitoring-bot) [10:55:01] godog: double task today too? what happened? [10:59:48] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/355871 (owner: 10Dzahn) [11:18:34] (03PS1) 10Ayounsi: Add mock rancid ssh key [labs/private] - 10https://gerrit.wikimedia.org/r/357791 [11:18:42] (03CR) 10Ema: [C: 031] "Thanks for fixing this!" [software/cumin] - 10https://gerrit.wikimedia.org/r/357784 (https://phabricator.wikimedia.org/T167392) (owner: 10Volans) [11:19:58] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3331797 (10hashar) >>! In T166888#3316057, @greg wrote: > Looking at the data we have it seems that the tests themselves take about [[ https://int... [11:20:23] (03CR) 10Ema: [C: 031] Transports: fix ok_codes getter for empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/357785 (https://phabricator.wikimedia.org/T167394) (owner: 10Volans) [11:20:42] (03CR) 10Volans: [C: 031] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/357791 (owner: 10Ayounsi) [11:21:07] thanks ema! [11:21:22] (03PS2) 10Volans: Transports: fix success_threshold getter when set to 0 [software/cumin] - 10https://gerrit.wikimedia.org/r/357784 (https://phabricator.wikimedia.org/T167392) [11:23:34] (03CR) 10Ayounsi: [V: 032 C: 032] Add mock rancid ssh key [labs/private] - 10https://gerrit.wikimedia.org/r/357791 (owner: 10Ayounsi) [11:24:47] (03CR) 10Volans: [C: 032] Transports: fix success_threshold getter when set to 0 [software/cumin] - 10https://gerrit.wikimedia.org/r/357784 (https://phabricator.wikimedia.org/T167392) (owner: 10Volans) [11:25:30] (03Merged) 10jenkins-bot: Transports: fix success_threshold getter when set to 0 [software/cumin] - 10https://gerrit.wikimedia.org/r/357784 (https://phabricator.wikimedia.org/T167392) (owner: 10Volans) [11:28:26] (03PS2) 10Volans: Transports: fix ok_codes getter for empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/357785 (https://phabricator.wikimedia.org/T167394) [11:38:21] PROBLEM - Host mw1294 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:13] !log powercycling mw1294, mgmt is unresponsive [11:41:19] !log Drop table updates on s7 - T139342 [11:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:32] T139342: DROP OAI-related tables - https://phabricator.wikimedia.org/T139342 [11:43:34] (03CR) 10Volans: [C: 032] Transports: fix ok_codes getter for empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/357785 (https://phabricator.wikimedia.org/T167394) (owner: 10Volans) [11:43:45] (03Merged) 10jenkins-bot: Transports: fix ok_codes getter for empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/357785 (https://phabricator.wikimedia.org/T167394) (owner: 10Volans) [11:44:41] RECOVERY - Host mw1294 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms [11:46:41] PROBLEM - nutcracker process on mw1294 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (nutcracker), command name nutcracker [11:47:41] RECOVERY - nutcracker process on mw1294 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [11:53:37] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3331924 (10hashar) And my final reply, following up on T166888#3322963 that does a breakdown of the job build steps. > 1. It clones the repositor... [11:58:45] 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3331935 (10MoritzMuehlenhoff) [11:59:06] 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3331949 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:02:17] 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3331951 (10MoritzMuehlenhoff) [12:02:22] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 136.75, 104.80, 73.54 [12:13:21] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 57.10, 77.45, 78.59 [12:13:41] (03PS3) 10Faidon Liambotis: hiera_lookup: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357719 [12:13:43] (03PS4) 10Faidon Liambotis: scap: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357720 [12:13:45] (03PS4) 10Faidon Liambotis: rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721 [12:13:47] (03PS4) 10Faidon Liambotis: rubocop: update rubocop to rubocop 0.49.1 [puppet] - 10https://gerrit.wikimedia.org/r/357722 [12:15:43] (03CR) 10jerkins-bot: [V: 04-1] rubocop: update rubocop to rubocop 0.49.1 [puppet] - 10https://gerrit.wikimedia.org/r/357722 (owner: 10Faidon Liambotis) [12:15:55] huh, still didn't like that [12:16:18] maybe it's 0.48 -> 0.49 :) [12:17:31] !log updated hhvm 3.18.2-dfsg-1+wmf5 to apt.wikimedia.org [12:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:46] !log uploaded hhvm 3.18.2-dfsg-1+wmf5 to apt.wikimedia.org [12:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:21] (03PS1) 10Faidon Liambotis: Fix another couple instances of RuboCop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357797 [12:19:57] (03CR) 10Hashar: [C: 04-1] "Faidon pointed rubocop/target_finder.rb list .fcgi has a ruby extension. That got introduced in rubocop 0.48.0." [puppet] - 10https://gerrit.wikimedia.org/r/357721 (owner: 10Faidon Liambotis) [12:19:59] (03CR) 10Faidon Liambotis: [V: 032 C: 032] Fix another couple instances of RuboCop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357797 (owner: 10Faidon Liambotis) [12:23:10] (03CR) 10Hashar: [C: 031] "That ignore is fine. The fault is rubocop solely rely on the file extension and blindly consider .fcgi files to be ... ruby." [puppet] - 10https://gerrit.wikimedia.org/r/357721 (owner: 10Faidon Liambotis) [12:23:56] (03PS1) 10Faidon Liambotis: nginx: tiny whitespace fix to make RuboCop happy [puppet/nginx] - 10https://gerrit.wikimedia.org/r/357798 [12:24:19] hashar: do we do this magic where submodules get updated automatically? [12:24:57] (03CR) 10Faidon Liambotis: [C: 032] nginx: tiny whitespace fix to make RuboCop happy [puppet/nginx] - 10https://gerrit.wikimedia.org/r/357798 (owner: 10Faidon Liambotis) [12:25:36] god I hate submodules so much [12:26:14] (03PS1) 10Faidon Liambotis: Update nginx submodule to include lint/rubocop fixes [puppet] - 10https://gerrit.wikimedia.org/r/357799 [12:30:25] (03CR) 10Faidon Liambotis: [C: 032] Update nginx submodule to include lint/rubocop fixes [puppet] - 10https://gerrit.wikimedia.org/r/357799 (owner: 10Faidon Liambotis) [12:31:38] (03PS3) 10Faidon Liambotis: wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717 [12:31:40] (03PS3) 10Faidon Liambotis: puppetmaster: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357718 [12:31:42] (03PS4) 10Faidon Liambotis: hiera_lookup: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357719 [12:31:44] (03PS5) 10Faidon Liambotis: scap: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357720 [12:31:46] (03PS5) 10Faidon Liambotis: rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721 [12:31:48] (03PS5) 10Faidon Liambotis: rubocop: update rubocop to rubocop 0.49.1 [puppet] - 10https://gerrit.wikimedia.org/r/357722 [12:38:38] paravoid: the submodules are not automagically updated by Gerrit in operations/puppet [12:38:49] yeah saw that [12:38:56] I guess this way folks can do their hack/dev in the submodule, and bumping the submodule require an explicit commit bump [12:39:41] the Jenkins job clone all submodules though. So maybe they will each need to be bumped to rubocop 0.49.1 [12:39:52] !log updating mwdebug* to HHVM 3.18.2+wmf5 [12:39:52] then in operations/puppet the change that bumps rubocop will also bump the submodules [12:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:06] hey, the rubocop 0.49.1 got V+2 :) [12:40:12] !!!! [12:40:37] AH .rubocop.yml excludes the git submodules \O/ [12:51:40] (03PS1) 10Faidon Liambotis: rubocop: remove stale comments from _todo.yml [puppet] - 10https://gerrit.wikimedia.org/r/357801 [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T1300). Please do the needful. [13:01:40] * aude has stuff for swat [13:01:56] looks like there's nothing else today [13:06:45] (03CR) 10Giuseppe Lavagetto: [C: 031] "Minor nit but LGTM; the output is satisfactory and the code is overall very clear." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [13:11:28] (03CR) 10Giuseppe Lavagetto: "So, I am conflicted, as this removes a safety net given it's not hard to mess up the format/syntax of redirects.dat." [puppet] - 10https://gerrit.wikimedia.org/r/357733 (owner: 10Faidon Liambotis) [13:12:21] PROBLEM - HHVM rendering on mw2245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "see comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357733 (owner: 10Faidon Liambotis) [13:13:12] RECOVERY - HHVM rendering on mw2245 is OK: HTTP OK: HTTP/1.1 200 OK - 75503 bytes in 0.263 second response time [13:14:56] (03PS1) 10Hashar: Rake: optimize typos task for CI [puppet] - 10https://gerrit.wikimedia.org/r/357804 (https://phabricator.wikimedia.org/T166888) [13:15:14] haha [13:15:17] I was just fixing that :) [13:15:28] ;D [13:16:48] bah the run took 1min25s ... [13:17:25] I think this Rakefile needs to be written from scratch with performance in mind honestly [13:17:41] what a bold statement ;} [13:17:56] git_changed_in_head() is called a few times now, every time spawning git [13:18:08] most commands don't use it etc. [13:18:18] I think the rakefile should start from there [13:18:21] find the git_changed_in_head [13:18:24] it can probably cache the command output [13:18:32] then filter that, and depending on the files found [13:18:40] 'require' (ruby) modules conditionally [13:18:53] and run against them [13:19:59] (03PS1) 10Giuseppe Lavagetto: cache: add monitoring of services at the SSL termination level [puppet] - 10https://gerrit.wikimedia.org/r/357805 (https://phabricator.wikimedia.org/T167048) [13:20:11] <_joe_> mobrovac: ^^ [13:20:38] (03PS2) 10Aude: Don't enable Wikibase data access yet for beta wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357601 (https://phabricator.wikimedia.org/T158324) [13:21:03] Hello - IPv6 for en.wikipedia.org seems to be down, is this known? [13:21:20] no [13:21:24] and I can confirm [13:21:26] XioNoX: ^^^ [13:21:42] wtf [13:22:03] looking [13:22:09] blackhole? [13:22:15] did you do anything? [13:22:30] I added v4 blackhole IPs [13:22:45] yeah it's a bug in the blackhole ACL probably [13:22:48] revert [13:22:56] rolling back [13:23:25] 2c8b414bf11ddbe997e731638395dc0352e1dfa2 is new and hasn't been tested before I think :( [13:23:33] probably needs to be split in blackhole4/6 [13:24:14] push in progress with JNT [13:24:25] l_bratch: thanks, I was wondering if my IPv6 was broken, so that definitely helped :) [13:24:33] also: sad that we didn't catch this by alerting :( [13:24:33] no problem :) [13:25:02] _joe_: is the service-checker pkg already installed on the caches? [13:25:21] <_joe_> it's included in service::monitoring [13:25:28] duh ofc [13:25:28] kk [13:25:41] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 518.51 seconds [13:25:49] I will get that [13:25:51] (03CR) 10Mobrovac: [C: 031] cache: add monitoring of services at the SSL termination level [puppet] - 10https://gerrit.wikimedia.org/r/357805 (https://phabricator.wikimedia.org/T167048) (owner: 10Giuseppe Lavagetto) [13:26:13] aude: can you please ping the channel when you're done with swat? I'd like to update some app servers after that [13:26:29] 10Operations, 10Monitoring, 10Patch-For-Review, 10Services (next), and 2 others: Services need external monitoring - https://phabricator.wikimedia.org/T167048#3332136 (10mobrovac) a:05mobrovac>03Joe [13:26:37] _joe_: I still disagree fwiw [13:26:53] <_joe_> paravoid: I'm working on the other option :) [13:27:08] by that logic we should be making all of our checks against all caches :) [13:27:14] but I don't think it's the right thing to do [13:27:33] we should have (and have!) cache checks that check that the cluster is healthy and in-sync etc. [13:27:35] <_joe_> but I got distracted by how hard it is to define that correctly in the current form of role::lvs::balancer, which would be where it would make sense [13:27:52] and then higher-level checks that check their own thing against the service IP [13:28:01] (03CR) 10Ema: cache: add monitoring of services at the SSL termination level (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357805 (https://phabricator.wikimedia.org/T167048) (owner: 10Giuseppe Lavagetto) [13:28:35] basically: each check should be doing its own thing, not trying to find unrelated (e.g. cache config coherency) issues [13:28:39] it's back! [13:28:54] <_joe_> paravoid: fair enough [13:29:31] <_joe_> mobrovac: I'll prepare a patch to check just at the LVS level in all DCs [13:29:55] paravoid, l_bratch, config rollback done [13:30:17] thanks XioNoX [13:30:23] probably worth doing !log for this kind of thing [13:30:47] hashar: so yeah, it's not about caching git_changed_in_head [13:30:48] _joe_: why at the lvs level now? we already have that? [13:30:59] looks good XioNoX, thanks :) [13:31:04] hashar: the whole thing should be made lazy [13:31:04] <_joe_> mobrovac: no, I mean the lvs level for the caches [13:31:12] ah ok [13:31:13] !log blackhole v4 IPs removed from all cr* routers [13:31:14] <_joe_> not the application lvs [13:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:31] !log aude@tin Synchronized php-1.30.0-wmf.4/extensions/Wikidata: Fix warning in date formatting T167360 (duration: 02m 16s) [13:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:41] T167360: "Warning: in_array() expects parameter 2 to be an array or collection" from Wikibase MwTimeIsoFormatter - https://phabricator.wikimedia.org/T167360 [13:34:13] (03PS3) 10Bmansurov: Enable ElectronPdf on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356881 (https://phabricator.wikimedia.org/T165954) [13:35:19] (03PS1) 10Muehlenhoff: Fix typo in symbols file [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357809 [13:35:50] (03CR) 10Muehlenhoff: [C: 032] Fix typo in symbols file [debs/openssl11] - 10https://gerrit.wikimedia.org/r/357809 (owner: 10Muehlenhoff) [13:36:51] (03PS1) 10Faidon Liambotis: Bump puppet & rake versions in the Gemfile [puppet] - 10https://gerrit.wikimedia.org/r/357810 [13:37:15] paravoid: I split v4/v6 in jnt, do you have some time to verify that it's working properly after I push it to a site? (or anyone else with v6 connectivity) [13:37:29] XioNoX: let me review the config first :) [13:38:05] (03CR) 10jerkins-bot: [V: 04-1] Bump puppet & rake versions in the Gemfile [puppet] - 10https://gerrit.wikimedia.org/r/357810 (owner: 10Faidon Liambotis) [13:38:33] paravoid: pushed to git [13:39:06] XioNoX: you should had mentioned that it's not just splitting but also adding those v4 IPs (or make them two separate commits) [13:39:17] but other than that it looks OK [13:39:32] it's a little weird why this happened in the first place though, I'm wondering why [13:40:30] yeah, I already had the case where an empty list would make the router block all traffic (interprete it as 0/0) [13:41:10] but yeah, unexpected behavior that it does it only for one protocol [13:41:25] !log aude@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [13:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:11] ugh [13:44:24] paravoid: pushing it to ulsfo as it's probably the pop serving the least users for now [13:44:28] no [13:44:29] wait :) [13:44:47] ok [13:45:44] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/6694/ says noop so merging" [puppet] - 10https://gerrit.wikimedia.org/r/356032 (https://phabricator.wikimedia.org/T166372) (owner: 10Faidon Liambotis) [13:45:50] (03PS3) 10Alexandros Kosiaris: Remove to_i/Integer from now unstringified facts [puppet] - 10https://gerrit.wikimedia.org/r/356032 (https://phabricator.wikimedia.org/T166372) (owner: 10Faidon Liambotis) [13:45:52] first I want to understand why this happened [13:45:55] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove to_i/Integer from now unstringified facts [puppet] - 10https://gerrit.wikimedia.org/r/356032 (https://phabricator.wikimedia.org/T166372) (owner: 10Faidon Liambotis) [13:46:14] but think unrelated... [13:46:21] I see no obvious reason why [13:46:57] (03PS2) 10Alexandros Kosiaris: puppetmaster: Set stringify_facts = false [puppet] - 10https://gerrit.wikimedia.org/r/357776 [13:47:04] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppetmaster: Set stringify_facts = false [puppet] - 10https://gerrit.wikimedia.org/r/357776 (owner: 10Alexandros Kosiaris) [13:47:25] !log aude@tin Synchronized php-1.30.0-wmf.4/extensions/RevisionSlider: Fix fatal error: T167359 (duration: 00m 44s) [13:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:34] T167359: Catchable fatal error: Argument 2 passed to RevisionSliderHooks::onDiffViewHeader() must be an instance of Revision, null given - https://phabricator.wikimedia.org/T167359 [13:47:49] 10Operations, 10Labs: virbr0 interface present in some virt hosts - https://phabricator.wikimedia.org/T83732#3332217 (10chasemp) 05Open>03Resolved a:03chasemp It seems not, I'm going to close this but anyone who knows differently please reopen ```for i in `cat labvirt`; do echo $i; ssh $i.eqiad.wmnet 'i... [13:47:55] XioNoX: any ideas? [13:48:15] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3332220 (10elukey) Current status is: ``` elukey@neodymium:~$ sudo cumin 'R:class = role::analytics_cluster::hadoop::worker' 'meg... [13:48:21] thinking [13:49:32] (03CR) 10Aude: [C: 032] Don't enable Wikibase data access yet for beta wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357601 (https://phabricator.wikimedia.org/T158324) (owner: 10Aude) [13:50:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:53:36] (03PS12) 10Faidon Liambotis: mediawiki: puppet compiler for Tim's redirects DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [13:54:14] (03CR) 10Faidon Liambotis: [C: 032] mediawiki: puppet compiler for Tim's redirects DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [13:54:36] single spike again [13:55:03] 10Operations, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review, 10Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#3332228 (10chasemp) 05Resolved>03Open ```elukey@deployment-aqs03:~$ dig -x 10.68.17.125 +short elukey ci-jessie-wi... [13:56:56] (03PS3) 10Faidon Liambotis: mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733 [13:57:57] 10Operations, 10Labs: Tools puppet failing: Detail: undefined method `>>' for "24443.99":String - https://phabricator.wikimedia.org/T167412#3332240 (10chasemp) [13:58:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:59:03] (03CR) 10Aude: [C: 032] Don't enable Wikibase data access yet for beta wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357601 (https://phabricator.wikimedia.org/T158324) (owner: 10Aude) [13:59:15] <_joe_> paravoid: don't forget to fix the source/content error :) [13:59:26] _joe_: oh I didn't see that [14:00:11] (03Merged) 10jenkins-bot: Don't enable Wikibase data access yet for beta wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357601 (https://phabricator.wikimedia.org/T158324) (owner: 10Aude) [14:00:21] (03CR) 10jenkins-bot: Don't enable Wikibase data access yet for beta wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357601 (https://phabricator.wikimedia.org/T158324) (owner: 10Aude) [14:01:29] 10Operations, 10Labs: Tools puppet failing: Detail: undefined method `>>' for "24443.99":String - https://phabricator.wikimedia.org/T167412#3332329 (10chasemp) Related? ```Commit: d3dc61097073773b308f2cc1bb9352c4aea61be8 Author: Alexandros Kosiaris Date: (5 hours ago) 2017-06-08... [14:02:38] !log aude@tin Synchronized wmf-config/InitialiseSettings-labs.php: Do not enable Wikibase data access yet on beta wiktionary (duration: 00m 43s) [14:02:41] 10Operations, 10Labs: Tools puppet failing: Detail: undefined method `>>' for "24443.99":String - https://phabricator.wikimedia.org/T167412#3332347 (10chasemp) This is probably from an operation against this fact: > sudo facter -p | grep swapsize_mb swapsize_mb => 24443.99 Where that fact is now a string... [14:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:20] done [14:04:21] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:04:21] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:04:30] paravoid: "If no conditions match, the router rejects the address. An empty prefix list results in an automatic permit of the tested address." [14:04:54] 10Operations, 10Labs: Tools puppet failing: Detail: undefined method `>>' for "24443.99":String - https://phabricator.wikimedia.org/T167412#3332386 (10chasemp) p:05Triage>03Normal [14:05:01] PROBLEM - nutcracker process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:05:21] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [14:05:21] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:05:51] RECOVERY - nutcracker process on thumbor1001 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [14:06:10] paravoid: that's the closer I can find to an explanation, even though it still doesn't make sens [14:06:30] I can open a ticket with juniper to know more, but not sure that would be helpful anyway [14:13:07] 10Operations, 10Labs: Tools puppet failing: Detail: undefined method `>>' for "24443.99":String - https://phabricator.wikimedia.org/T167412#3332403 (10chasemp) turns out 37b83e8b2c04a58f555ee5627a415561ab792d26 unintentionally resulted in this ```diff --git a/modules/toollabs/templates/gridengine/host-vmem.er... [14:14:07] (03CR) 10Daniel Kinzler: [C: 04-1] "We do want this, but we have not yet decided when. We should proabably wait at least until we have good data in testwikidatawiki.wb_term.t" [software/labsdb-auditor] - 10https://gerrit.wikimedia.org/r/357369 (https://phabricator.wikimedia.org/T167114) (owner: 10Ladsgroup) [14:14:14] 10Operations, 10Labs: Tools puppet failing: Detail: undefined method `>>' for "24443.99":String - https://phabricator.wikimedia.org/T167412#3332404 (10chasemp) Quoting @faidon from irc: ```yeah my suggestion wrt this would be a) swap = 3*ram is just silly obsolete advice, half a gig of swap should be plenty/e... [14:14:37] 10Operations, 10Labs: host-vmem.erb is doing operations that make no sense - https://phabricator.wikimedia.org/T167412#3332409 (10chasemp) [14:14:51] 10Operations, 10Labs, 10Patch-For-Review, 10cloud-services-team (Kanban): rebuild tools-grid-master as a large instance - https://phabricator.wikimedia.org/T162955#3332411 (10chasemp) [14:14:53] 10Operations, 10Labs: host-vmem.erb is doing operations that make no sense - https://phabricator.wikimedia.org/T167412#3332240 (10chasemp) [14:15:03] (03PS1) 10Faidon Liambotis: Revert swapsize_mb/memorysize_mb unstringification [puppet] - 10https://gerrit.wikimedia.org/r/357816 (https://phabricator.wikimedia.org/T167412) [14:15:29] (03PS1) 10Alexandros Kosiaris: toollabs: Fix memorysize_mb integer casts [puppet] - 10https://gerrit.wikimedia.org/r/357817 (https://phabricator.wikimedia.org/T167412) [14:16:10] (03CR) 10Faidon Liambotis: [C: 032] Revert swapsize_mb/memorysize_mb unstringification [puppet] - 10https://gerrit.wikimedia.org/r/357816 (https://phabricator.wikimedia.org/T167412) (owner: 10Faidon Liambotis) [14:16:17] (03CR) 10Rush: [C: 031] toollabs: Fix memorysize_mb integer casts [puppet] - 10https://gerrit.wikimedia.org/r/357817 (https://phabricator.wikimedia.org/T167412) (owner: 10Alexandros Kosiaris) [14:16:51] (03CR) 10Alexandros Kosiaris: [C: 031] Revert swapsize_mb/memorysize_mb unstringification [puppet] - 10https://gerrit.wikimedia.org/r/357816 (https://phabricator.wikimedia.org/T167412) (owner: 10Faidon Liambotis) [14:17:06] (03Abandoned) 10Alexandros Kosiaris: toollabs: Fix memorysize_mb integer casts [puppet] - 10https://gerrit.wikimedia.org/r/357817 (https://phabricator.wikimedia.org/T167412) (owner: 10Alexandros Kosiaris) [14:17:58] (03PS2) 10Faidon Liambotis: Revert swapsize_mb/memorysize_mb unstringification [puppet] - 10https://gerrit.wikimedia.org/r/357816 (https://phabricator.wikimedia.org/T167412) [14:19:09] (03CR) 10Faidon Liambotis: [V: 032 C: 032] Revert swapsize_mb/memorysize_mb unstringification [puppet] - 10https://gerrit.wikimedia.org/r/357816 (https://phabricator.wikimedia.org/T167412) (owner: 10Faidon Liambotis) [14:20:25] (03PS4) 10Faidon Liambotis: mediawiki: use compile_redirects as a function [puppet] - 10https://gerrit.wikimedia.org/r/357733 [14:20:44] 10Operations, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review, 10Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#3332427 (10hashar) From my digging: | May 9th | `652785` | June 8th | `692016` So I guess 505374 is a few months old. [14:21:29] (03PS1) 10Faidon Liambotis: raid: remove unused aac, twe, zfs [puppet] - 10https://gerrit.wikimedia.org/r/357819 [14:22:15] hashar: any clue what the error is @ https://gerrit.wikimedia.org/r/#/c/357810/ ? [14:22:33] (03CR) 10jerkins-bot: [V: 04-1] raid: remove unused aac, twe, zfs [puppet] - 10https://gerrit.wikimedia.org/r/357819 (owner: 10Faidon Liambotis) [14:23:11] (03PS11) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [14:23:12] paravoid: the rake task provided by rubocop does not seem to be compatible with rake 12 [14:24:07] ah, will be fixed with the new rubocop then? [14:24:15] I'll rebase :) [14:24:46] (03PS2) 10Faidon Liambotis: raid: remove unused aac, twe, zfs [puppet] - 10https://gerrit.wikimedia.org/r/357819 [14:25:46] paravoid: that is probably fixed in rubocop 0.38.0 there is a commit that claims to update the task for rake 11 ( https://github.com/bbatsov/rubocop/commit/88a200e59e10868450ceb4316ffc600d9a09b95c ) [14:26:57] (03PS2) 10Faidon Liambotis: Bump puppet & rake versions in the Gemfile [puppet] - 10https://gerrit.wikimedia.org/r/357810 [14:27:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [14:29:50] 10Operations, 10Performance-Team, 10Thumbor: Package latest version of Thumbor and deploy it - https://phabricator.wikimedia.org/T167286#3332457 (10fgiunchedi) [14:29:52] 10Operations, 10Performance-Team, 10Thumbor: Backport python-schedule and add it to jessie-wikimedia - https://phabricator.wikimedia.org/T167287#3332455 (10fgiunchedi) 05Open>03Resolved I've uploaded `schedule` `0.3.2-1~bpo8+1` to Debian `jessie-backports` with its maintainer approval. [14:30:26] (03PS6) 10Faidon Liambotis: rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721 [14:31:15] (03CR) 10Faidon Liambotis: [C: 032] rubocop: add smokeping.fcgi to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/357721 (owner: 10Faidon Liambotis) [14:31:28] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3332462 (10fgiunchedi) [14:31:30] 10Operations, 10Performance-Team, 10Thumbor: Package latest version of Thumbor and deploy it - https://phabricator.wikimedia.org/T167286#3323329 (10fgiunchedi) 05Open>03Resolved I've checked the diff and uploaded `thumbor` `6.3.2+git20170607-1` internally to `jessie-wikimedia` [14:33:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:34:40] (03CR) 10Filippo Giunchedi: [C: 031] wmflib: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357717 (owner: 10Faidon Liambotis) [14:35:21] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:35:30] (03CR) 10Mforns: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [14:38:21] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:39:34] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3332468 (10fgiunchedi) [14:39:36] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167393#3332471 (10fgiunchedi) [14:39:38] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167398#3332472 (10fgiunchedi) [14:41:21] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:41:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:43:56] 10Operations, 10Mail: Increase email log retention period for the main email relays - https://phabricator.wikimedia.org/T167333#3325007 (10fgiunchedi) FWIW if we also want to store mail logs off-host a simple solution would be to syslog exim logs too, syslog hosts already have 90d retention in place. [14:44:41] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:45:21] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:48:01] 10Operations, 10Monitoring: Monitoring: add link to graph for Icinga timeseries alarms - https://phabricator.wikimedia.org/T167422#3332481 (10Volans) [14:48:32] (03CR) 10Filippo Giunchedi: [C: 031] "prometheus-node-exporter has been moved to main" [puppet] - 10https://gerrit.wikimedia.org/r/357616 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [14:54:06] (03PS1) 10Giuseppe Lavagetto: role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824 [14:54:23] (03CR) 10Volans: "LGTM, small comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357819 (owner: 10Faidon Liambotis) [14:54:35] !log 2 blackhole IPs pushed to cr* routers [14:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:51] !log updating mw1261 to HHVM 3.18.2+wmf5 [14:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:04] (03PS1) 10Ayounsi: Rancid improvements [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) [15:09:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services, 10User-fgiunchedi: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3332563 (10fgiunchedi) [15:10:48] (03CR) 10Volans: "The puppet part looks good, a minor comment inline. I'm not familiar with Rancid to review that part." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi) [15:11:13] !log Upgrading rancid to 3 - T167288 [15:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:23] T167288: Rancid improvements - https://phabricator.wikimedia.org/T167288 [15:11:35] (03CR) 10Volans: "It will be nice to have the results of a puppet compiler too to verify" [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi) [15:14:29] 10Operations, 10Traffic: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#3332571 (10ema) I've collected 60s of ICMP traffic from GCE on the load balancers and sent a report through https://support.google.com/code/contact/cloud_platform_report?hl=en. I've also added a c... [15:15:22] thanks ema! [15:15:32] (03PS2) 10Alexandros Kosiaris: lvs: Remove all bgp keywords from configuration [puppet] - 10https://gerrit.wikimedia.org/r/356790 [15:19:41] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 49.41 seconds [15:20:12] elukey: thanks for the deployment-prep aqs re-init [15:20:21] PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:20:22] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:20:57] elukey: i can't believe it, but that seems to have fixed the restbase cluster [15:21:01] (03PS2) 10Ayounsi: Rancid improvements [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) [15:21:11] RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [15:21:12] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:21:53] elukey: i was honestly expecting to have to re-init both [15:22:42] !log updating mw1262-mw1265 to HHVM 3.18.2+wmf5 [15:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:53] (03CR) 10Ayounsi: Rancid improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi) [15:24:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] scap: fix rubocop warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357720 (owner: 10Faidon Liambotis) [15:24:47] urandom: Marko did the same work in the restbase cluster :) [15:24:54] no magic sadly :) [15:25:29] (03CR) 10Alexandros Kosiaris: [C: 031] hiera_lookup: fix rubocop warnings [puppet] - 10https://gerrit.wikimedia.org/r/357719 (owner: 10Faidon Liambotis) [15:26:11] oooooh [15:26:26] elukey: actually, that makes me feel somewhat better [15:26:50] because the more i thought about it, the more i was having a hard time believing it [15:27:00] mobrovac: and, thank you :) [15:27:03] <_joe_> akosiaris: before you merge https://gerrit.wikimedia.org/r/356790 [15:27:13] <_joe_> akosiaris: https://gerrit.wikimedia.org/r/357824 [15:27:15] <_joe_> :) [15:27:15] 10Operations, 10ops-eqiad, 10Dumps-Generation: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3332602 (10Cmjohnson) [15:27:30] (03CR) 10Alexandros Kosiaris: [C: 031] Bump puppet & rake versions in the Gemfile [puppet] - 10https://gerrit.wikimedia.org/r/357810 (owner: 10Faidon Liambotis) [15:27:35] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3332608 (10Cmjohnson) [15:27:53] <_joe_> akosiaris: still working on that unholy mess btw [15:28:07] _joe_: hehe, should I rebase on top of yours ? [15:28:15] 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3332611 (10Cmjohnson) [15:28:27] <_joe_> akosiaris: I'd say wait until I tell you so [15:28:30] ok [15:28:53] <_joe_> it shouldn't take too long, but in the end I hope to be able to reduce a little bit of the 4-level entanglement that those classes are [15:29:01] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3332619 (10Cmjohnson) [15:29:32] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3332623 (10Cmjohnson) [15:29:47] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3332624 (10Cmjohnson) [15:30:15] (03PS2) 10Giuseppe Lavagetto: role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824 [15:30:24] <_joe_> tbh, the lvs hierarchy is my last white whale of our puppet repo [15:31:00] <_joe_> well, excluding analytics, ci and labs, where I read the sign "here be dragons" and I didn't look back :P [15:31:31] <_joe_> but really, those classes defied most of my previous attempt to untangle them [15:31:43] akosiaris: i think it's better honestly [15:31:48] _joe_: don't you like my work? [15:32:02] <_joe_> mark: for puppet 0.25? yes [15:32:04] paravoid: ? [15:32:05] <_joe_> :) [15:32:06] ;-) [15:32:12] akosiaris: the 0o [15:32:14] paravoid: please tell you don't refer to 0o [15:32:18] (03CR) 10Ayounsi: "Compiler is happy: https://puppet-compiler.wmflabs.org/6699/" [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi) [15:32:22] oh come on [15:32:29] where have you ever seen this syntax ? [15:32:45] "prefixing with 0" automatically means octal is very evil too [15:33:05] yeah but that practically happens since the epoch [15:33:36] introducing a new weird syntax to solve the evilness of prefixing with 0 is not exactly solving the problem [15:34:18] python 3 did the same btw [15:34:22] and removed the old syntax entirely [15:34:57] In [1]: 02755 [15:34:57] File "", line 1 [15:34:57] 02755 [15:34:57] ^ [15:34:57] SyntaxError: invalid token [15:34:59] python3 force you to do iit ;) [15:35:00] In [2]: 0o2755 [15:35:02] Out[2]: 1517 [15:35:05] fyi :) [15:35:07] yeah [15:35:13] ruby was nicer and actually made it a style guide thing [15:35:38] the rationale is that this: [15:35:39] In [1]: 02755 [15:35:39] Out[1]: 1517 [15:35:42] is really super confusing [15:35:52] leading zeros are totally legit in decimals [15:36:07] omg .. python3 does that ? [15:36:18] (03PS3) 10Giuseppe Lavagetto: role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824 [15:36:18] yup :) [15:36:20] one more reason I won't care much about it for the next 10 years [15:36:27] wanna bet? [15:36:30] rotfl [15:36:37] well... data is with me [15:36:43] python3 is around for 10 years already [15:36:58] yet python 2 is still around [15:36:58] you know python 2 will go EOL in 2020 right :) [15:37:20] (03PS12) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [15:37:21] yeah I am wondering already how many times that will be extended [15:37:21] buster (stretch+1) probably won't ship with python 2 [15:37:33] doubt it will [15:37:37] <_joe_> paravoid: so we have to convert pybal to something samer [15:37:43] <_joe_> *saner [15:37:45] writing new code in python2 is a bad idea [15:38:03] s/in python2// [15:38:08] there.. fixed that for ya :P [15:38:13] akosiaris: you can use int('02755', 8) if you prefer :-P [15:38:19] the libraries have caught up which was the big knock on python3 for years [15:38:24] akosiaris: :) [15:38:25] bd808, they said that of my fortran code... [15:38:25] (03CR) 10jerkins-bot: [V: 04-1] role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824 (owner: 10Giuseppe Lavagetto) [15:38:26] not all of them [15:38:27] <_joe_> bd808: not really? [15:38:29] but most of them! [15:38:50] volans: overall I prefer not to use octals [15:38:50] <_joe_> I keep stumbling in python2-only things [15:39:05] chmod 1517? [15:39:15] (03CR) 10Volans: [C: 031] "LGTM for the puppet side, I left to the netops folks the rancid config ;)" [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi) [15:39:16] any library that doesn't support python3 is either dead tech or has a community that is out of touch with the rest of the world [15:39:21] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 117.62, 100.06, 77.25 [15:39:33] come on :) [15:39:37] (03PS3) 10Ayounsi: Rancid improvements [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) [15:39:40] Not trying to throw shame here, just my experience [15:39:44] In [1]: 02755 [15:39:44] Out[1]: 1517 [15:39:45] er [15:39:48] https://github.com/cea-hpc/clustershell/commits/master :) [15:40:17] lol [15:40:25] also, diamond is python2 [15:40:27] and I think ansible too? [15:40:43] and you are convincing me of which point I made? ;) [15:40:45] yes [15:40:48] hey look http://py3readiness.org/ :) [15:41:25] I see a few very importants tools to us in that list [15:41:42] ansible, carbon, diamond lol [15:41:43] for instance python-ldap is on that list as py2 only, but ldap3 is way way nicer to use [15:41:43] but yeah you can at least use python3-mostly-compatible code if your library doesn't support it [15:42:19] (03CR) 10Ayounsi: [C: 032] Rancid improvements [puppet] - 10https://gerrit.wikimedia.org/r/357825 (https://phabricator.wikimedia.org/T167288) (owner: 10Ayounsi) [15:42:31] (03CR) 10Mforns: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [15:42:50] bd808: true but the fact ldap3 is not in those 360 most popular packages says something [15:43:21] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 48.37, 77.47, 73.74 [15:43:43] ACKNOWLEDGEMENT - HP RAID on ms-be1019 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167426 [15:43:47] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167426#3332659 (10ops-monitoring-bot) [15:43:49] we're getting there I think [15:43:52] 343/360 is pretty good [15:44:09] yeah but my guess is a long tail distribution [15:44:22] which means those few left might very well end up being done around 2020 [15:44:25] godog: so ms-be1019 is making the BBU check flapping that in turn trigger icinga that trigger the raid_handler [15:44:43] or never for that matter [15:44:44] what you want to do? downtime, disable notification, fix the issue :D [15:44:58] akosiaris: it says that people don't write new code that talks to ldap very often I think, by YMMV [15:45:13] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3332664 (10Gilles) It seems like this type... [15:45:15] (03PS4) 10Giuseppe Lavagetto: role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824 [15:45:20] volans: heh I'll downtime for now, we're waiting the replacement battery from hp [15:45:44] eheheh ok :) [15:45:46] thanks [15:46:10] np [15:46:16] bd808: yeah but it also says that there are enough projects out there that are still using python-ldap. My guess is they will still do until they can't do otherwise [15:46:19] striker and other things I've written use ldap3 because besides being nicer to work with it is a pure python lib and not a wrapper around libldap so it's nicer for virtualenvs [15:47:06] (03CR) 10jerkins-bot: [V: 04-1] role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824 (owner: 10Giuseppe Lavagetto) [15:48:56] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/6703/lvs1001.wikimedia.org/ it's a noop, I'll fix the validation in the next ps but this can basically" [puppet] - 10https://gerrit.wikimedia.org/r/357824 (owner: 10Giuseppe Lavagetto) [15:49:40] (03PS1) 10Cmjohnson: Adding productin dns for analytics1069 T162216 [dns] - 10https://gerrit.wikimedia.org/r/357836 [15:51:21] (03PS2) 10Cmjohnson: Adding productin dns for analytics1069 T162216 [dns] - 10https://gerrit.wikimedia.org/r/357836 [15:51:52] (03CR) 10Cmjohnson: [C: 032] Adding productin dns for analytics1069 T162216 [dns] - 10https://gerrit.wikimedia.org/r/357836 (owner: 10Cmjohnson) [15:52:25] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for labtestpuppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/357841 [15:53:01] (03PS3) 10Cmjohnson: Adding productin dns for analytics1069 T162216 [dns] - 10https://gerrit.wikimedia.org/r/357836 [15:53:14] (03CR) 10Ema: "> https://puppet-compiler.wmflabs.org/6703/lvs1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/357824 (owner: 10Giuseppe Lavagetto) [15:55:53] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3332697 (10Gilles) Nevermind, it's because... [15:56:46] (03CR) 10RobH: [C: 032] DNS: Add mgmt and production DNS for labtestpuppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/357841 (owner: 10Papaul) [15:56:50] (03PS2) 10RobH: DNS: Add mgmt and production DNS for labtestpuppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/357841 (owner: 10Papaul) [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T1600). Please do the needful. [16:00:04] bmansurov and AaronSchulz: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:12] here [16:01:48] bmansurov: I'll take a look [16:01:54] thanks [16:03:29] bmansurov: this is scheduled in puppet swat but it is for mediawiki-config ? [16:03:43] oops, my bad [16:04:08] i'll move it to the morning swat [16:04:50] bmansurov: I can deploy it though if it is urgent? [16:05:09] godog, it's not urgent, but I'd appreciate it if you can deploy it [16:06:09] (03PS5) 10Giuseppe Lavagetto: role::lvs::balancer: convert to role/profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/357824 [16:06:31] bmansurov: ok! if it isn't an emergency I'd rather go through morning swat [16:06:45] ok, that sounds good too [16:08:31] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2161528 [16:09:17] AaronSchulz: merging https://phabricator.wikimedia.org/T165651 [16:09:19] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3332797 (10RobH) [16:09:53] (03PS5) 10Filippo Giunchedi: Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [16:12:28] (03CR) 10Filippo Giunchedi: [C: 032] Set cron script to dump MediaWiki DB lag times into statsd [puppet] - 10https://gerrit.wikimedia.org/r/354138 (https://phabricator.wikimedia.org/T149210) (owner: 10Aaron Schulz) [16:14:39] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3332839 (10Papaul) @chasemp do I have to put labtestnet2002 both eth0 and eth1 under labs-hosts1-b-codfw network or just plug eth1 and not put it under that network? Same... [16:21:39] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3332874 (10RobH) IRC Update: We chatted about this, basically he wanted to know if we had to setup dns for both interfaces (eth0 and eth1) prior to installation. Since on... [16:23:10] (03CR) 10BBlack: role::lvs::balancer: convert to role/profile (step 1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357824 (owner: 10Giuseppe Lavagetto) [16:23:11] 10Operations, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3332875 (10Nuria) [16:23:42] ACKNOWLEDGEMENT - HP RAID on ms-be1019 is CRITICAL: CRITICAL: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T167434 [16:23:51] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T167434#3332878 (10ops-monitoring-bot) [16:24:05] !log cp1074: varnish-backend-restart for mailbox lag [16:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:52] (03PS7) 10BBlack: numa_networking: add facter data from sysfs [puppet] - 10https://gerrit.wikimedia.org/r/355809 [16:24:54] (03PS8) 10BBlack: numa_networking: support NUMA in interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/355810 [16:24:56] (03PS8) 10BBlack: numa_networking: support NUMA in tlsproxy nginx config [puppet] - 10https://gerrit.wikimedia.org/r/355811 [16:24:58] (03PS1) 10BBlack: numa_networking: test enable on cp4021 [puppet] - 10https://gerrit.wikimedia.org/r/357844 [16:25:12] volans: not sure what to do about the ms-be1019 alert, downtime didn't work apparently [16:25:43] godog: looking [16:26:45] godog: maybe the icinga restart in the middle has to do with it? [16:26:46] https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=ms-be1019&service=HP+RAID [16:26:51] I'd suggest disable notification [16:27:21] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3332904 (10elukey) @Cmjohnson sorry to ping :) Any idea if we have a spare BBU for db1046? [16:28:30] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [16:28:46] volans: good idea, I'll do that [16:28:55] godog: actually no [16:29:00] there is a disable event_handler [16:29:10] i completely forgot about it [16:29:12] sorry [16:29:30] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3332932 (10Cmjohnson) @elukey. Yes, I have another decommissioned r510 to take it from. Ping you in a hour or so to replace [16:29:32] the handler handling is separated from the notification handling [16:29:48] ack, done the handler disabiling [16:29:56] that's probably why, the downtime suppress notification and not handlers [16:29:58] we'll need to find a way to remember to re-enable it [16:30:00] I guess [16:30:03] yeah [16:30:09] that's the hard part [16:30:46] meta-alert about disabled handlers to the rescue [16:31:07] lol [16:33:02] !log delete net.ifnames for ms-be2001 and ms-be2013 - T158429 [16:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:14] T158429: Switch to predictable network interface names? - https://phabricator.wikimedia.org/T158429 [16:35:20] godog: nice! [16:36:02] paravoid: yeah! quite straightforward really, delete the argument from /etc/default/grub ; update-grub and s/eth0/eno1/ in /etc/network/interfaces [16:36:14] no nefarious effects observed yet [16:37:33] godog: 70-persistent-net.rules ? [16:37:45] is replaced by this? [16:38:29] looks like, nice! [16:40:17] !log nuria@tin Started deploy [analytics/refinery@2fbed63]: (no justification provided) [16:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:02] aye [16:41:43] 10Operations, 10ops-eqiad, 10Analytics-Kanban, 10DBA, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3332986 (10elukey) @Cmjohnson thanks! Would it be possible to do the swap next week? Since this is an important DB I'd need to coordinate my team and Jaime/Manuel first. [16:44:26] !log nuria@tin Finished deploy [analytics/refinery@2fbed63]: (no justification provided) (duration: 04m 08s) [16:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:19] (03PS1) 10Jdlrobson: Update logo and dimensions for SR wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357848 (https://phabricator.wikimedia.org/T165896) [16:49:37] (03PS3) 10Dzahn: gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 [16:51:03] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3333018 (10faidon) Thanks for all the detailed responses from all of you, it's really appreciated. It's also great to see Docker patches proposed... [16:52:19] (03CR) 10jerkins-bot: [V: 04-1] gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 (owner: 10Dzahn) [16:53:07] (03PS5) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [16:54:00] (03PS1) 10BBlack: numa_networking: remove install-time bnx2x stuff [puppet] - 10https://gerrit.wikimedia.org/r/357850 [16:54:27] (03CR) 10jerkins-bot: [V: 04-1] gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [16:54:31] (03PS4) 10Dzahn: gerrit: dont let sshd listen on all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/354074 [16:55:51] (03PS6) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T1700). [17:01:41] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Add listening to localhost too, or bad things will happen(TM)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [17:02:37] Nothing for ORES [17:02:39] SOON [17:03:10] (03CR) 10Bmansurov: Update logo and dimensions for SR wordmark (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357848 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [17:07:50] 10Operations, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch errors about BulkShardRequest - https://phabricator.wikimedia.org/T167091#3333065 (10debt) p:05Triage>03Normal [17:09:30] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 122.22, 100.48, 78.95 [17:11:52] (03CR) 10Paladox: gerrit: let Apache proxy only listen on service IP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [17:12:24] (03CR) 10Paladox: gerrit: switch to base::service_unit and systemd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [17:13:31] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Services-next, 10Security-General, 10Services (next): Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#3333190 (10Fjalapeno) [17:13:39] 10Operations, 10Performance-Team, 10Reading-Infrastructure-Team-Backlog, 10Reading-Web-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3333191 (10Fjalapeno) [17:18:30] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 59.84, 77.19, 77.85 [17:26:13] (03PS1) 10Andrew Bogott: designate.conf: Replace identity_uri setting [puppet] - 10https://gerrit.wikimedia.org/r/357853 [17:29:31] (03PS2) 10Andrew Bogott: designate.conf: Replace identity_uri setting [puppet] - 10https://gerrit.wikimedia.org/r/357853 [17:31:41] (03CR) 10Andrew Bogott: [C: 032] designate.conf: Replace identity_uri setting [puppet] - 10https://gerrit.wikimedia.org/r/357853 (owner: 10Andrew Bogott) [17:36:43] !log arlolra@tin Started deploy [parsoid/deploy@f82cb4f]: Updating Parsoid to 108eed81 [17:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:45] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3333335 (10Gilles) [17:41:52] 10Operations, 10netops, 10Patch-For-Review: Rancid improvements - https://phabricator.wikimedia.org/T167288#3333354 (10ayounsi) > I think there's a lot of value in doing so. Agreed on the rest. Converted! > Upgrade to 3.6.2 Done > Switch from CVS to GIT Done > Replace password auth with ssh key auth Done,... [17:44:55] (03PS1) 10Jdlrobson: Undeploy Cards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357858 (https://phabricator.wikimedia.org/T167452) [17:45:04] (03CR) 10Jdlrobson: [C: 04-1] "Need to wait until next Thursday deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357858 (https://phabricator.wikimedia.org/T167452) (owner: 10Jdlrobson) [17:46:09] (03PS1) 10Cmjohnson: Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076 [puppet] - 10https://gerrit.wikimedia.org/r/357860 [17:46:55] !log arlolra@tin Finished deploy [parsoid/deploy@f82cb4f]: Updating Parsoid to 108eed81 (duration: 10m 12s) [17:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:23] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3333443 (10Cmjohnson) [17:49:33] (03CR) 10RobH: [C: 032] Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, [puppet] - 10https://gerrit.wikimedia.org/r/357860 (owner: 10Cmjohnson) [17:51:01] (03PS2) 10Cmjohnson: Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076 [puppet] - 10https://gerrit.wikimedia.org/r/357860 [17:51:05] (03CR) 10Cmjohnson: [V: 032] Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, [puppet] - 10https://gerrit.wikimedia.org/r/357860 (owner: 10Cmjohnson) [17:55:58] !log Updated Parsoid to 108eed81 (T136653, T167081) [17:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:09] T167081: Broken rendering of inserted template (incorrect comment stripping for style attribute) - https://phabricator.wikimedia.org/T167081 [17:56:09] T136653: Parsoid doesn't recognize interwiki shortcuts in the href attribute - https://phabricator.wikimedia.org/T136653 [17:56:22] (03PS1) 10Giuseppe Lavagetto: [WiP] role::lvs::balancer: refactor to role/profile (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/357863 [17:57:20] PROBLEM - Apache HTTP on mw2131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:21] PROBLEM - HHVM rendering on mw2131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:52] (03PS2) 10Giuseppe Lavagetto: [WiP] role::lvs::balancer: refactor to role/profile (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/357863 [17:58:10] RECOVERY - Apache HTTP on mw2131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.071 second response time [17:58:11] RECOVERY - HHVM rendering on mw2131 is OK: HTTP OK: HTTP/1.1 200 OK - 75442 bytes in 0.153 second response time [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T1800). Please do the needful. [18:00:05] James_F, bmansurov, MatmaRex, and MaxSem: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:14] I can deploy [18:00:22] Heya. [18:00:28] hi [18:00:32] here [18:00:47] MaxSem: it looks like wmf.4 is not actually live anywhere right now, correct? [18:01:06] Not true ^ [18:01:11] group0 [18:01:18] https://www.mediawiki.org/wiki/Special:Version says wmf.2 [18:01:22] http://tools.wmflabs.org/versions/ [18:01:37] MaxSem: MatmaRex test wikis but not mw.org [18:01:51] https://phabricator.wikimedia.org/source/mediawiki-config/browse/master/wikiversions.json;8d49ecd178ac84beb247ffca51c967f8a0e6c1dc$757 [18:01:53] ah, i see. https://test.wikipedia.org/wiki/Special:Version is wmf.4 [18:01:56] I thought it moved forward yesterday on my day off, guess I misread the bugmail I had [18:02:25] we're clear to move forward today though, aude fixed up the last major logspam blocker [18:02:43] (03PS6) 10MaxSem: Beta Features: Update last-big-change-plus-six-month dates in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354731 (owner: 10Jforrester) [18:02:48] (03CR) 10MaxSem: [C: 032] Beta Features: Update last-big-change-plus-six-month dates in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354731 (owner: 10Jforrester) [18:03:39] bleh Catchable fatal error: Argument 1 passed to EditPage::displayViewSourcePage() [18:03:47] (03Merged) 10jenkins-bot: Beta Features: Update last-big-change-plus-six-month dates in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354731 (owner: 10Jforrester) [18:03:49] must implement interface Content, null given [18:04:05] MaxSem: Where's that from? [18:04:14] from fatalmonitor [18:04:34] I mean, from prod or from Beta Cluster? [18:04:35] helpful answer MaxSem [18:04:35] :P [18:04:40] PROBLEM - Check systemd state on install2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:04:50] Is it suddenly arising or has it been happening for a while? [18:05:12] #til people still use `fatalmonitor` [18:05:17] don't remember it yesterday [18:05:27] MaxSem: https://phabricator.wikimedia.org/T161199 ? [18:05:47] RainbowSprinkles, it gives you better reaction time than kibana [18:05:54] (03CR) 10jenkins-bot: Beta Features: Update last-big-change-plus-six-month dates in comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354731 (owner: 10Jforrester) [18:06:10] yep, MatmaRex [18:06:42] well, it "Needs Triage", but it's been happening for two months and a change [18:07:08] James_F, pulled on mwdebug1002 [18:08:20] MaxSem: I haven't logged into fluorine (or whatever we're using now) in about a year or two ;-) [18:08:29] MaxSem: Looks good. [18:08:37] it's mwlog1001 now [18:09:29] Either way, haven't needed it :p [18:09:52] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/354731/6 (duration: 00m 44s) [18:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:01] there's a deploy now? I completely failed there by lining stuff up for later :/ [18:10:28] jdlrobson: Yup, 11:00 SF Monday/Wednesday/Thursdays. [18:10:30] PROBLEM - very high load average likely xfs on ms-be1019 is CRITICAL: CRITICAL - load average: 136.10, 106.99, 79.57 [18:10:36] bmansurov, yt? [18:10:40] yeh i see that now :) [18:10:47] MaxSem, yes [18:10:58] jdlrobson: The fact that it's /not/ there on Tuesdays (because of the train) trips people up. :-) [18:11:02] MaxSem: if you have space, i've got two for this evening which can go out now instead and save whoevers doing them later the hassle if you like [18:11:25] if we have time [18:12:13] (03PS4) 10MaxSem: Enable ElectronPdf on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356881 (https://phabricator.wikimedia.org/T165954) (owner: 10Bmansurov) [18:12:17] (03CR) 10MaxSem: [C: 032] Enable ElectronPdf on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356881 (https://phabricator.wikimedia.org/T165954) (owner: 10Bmansurov) [18:13:37] MaxSem: ill update wik... thanks :) [18:13:42] (03Merged) 10jenkins-bot: Enable ElectronPdf on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356881 (https://phabricator.wikimedia.org/T165954) (owner: 10Bmansurov) [18:14:30] bmansurov, pulled on mwdebug1002 [18:14:38] ok checking [18:15:45] MaxSem, works [18:15:52] milimetric, I still see that dashiki error in logs [18:15:54] (03CR) 10jenkins-bot: Enable ElectronPdf on all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/356881 (https://phabricator.wikimedia.org/T165954) (owner: 10Bmansurov) [18:17:30] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/356881/4 (duration: 00m 44s) [18:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:38] bmansurov, ^ [18:17:53] MaxSem, thanks! [18:18:55] thx MaxSem, just looking at it (back from medical leave today) [18:20:55] (03PS3) 10MaxSem: Remove semanticness from another place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352985 (https://phabricator.wikimedia.org/T53642) [18:21:20] (03CR) 10MaxSem: [C: 032] Remove semanticness from another place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352985 (https://phabricator.wikimedia.org/T53642) (owner: 10MaxSem) [18:22:50] PROBLEM - Check systemd state on install1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:24:20] (03Merged) 10jenkins-bot: Remove semanticness from another place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352985 (https://phabricator.wikimedia.org/T53642) (owner: 10MaxSem) [18:25:49] (03CR) 10jenkins-bot: Remove semanticness from another place [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352985 (https://phabricator.wikimedia.org/T53642) (owner: 10MaxSem) [18:25:53] !log maxsem@tin Synchronized multiversion/submodules.json: https://gerrit.wikimedia.org/r/#/c/352985/3 (duration: 00m 43s) [18:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:41] dear Zuul, did you not get enough human sacrifice already? [18:29:30] RECOVERY - very high load average likely xfs on ms-be1019 is OK: OK - load average: 55.26, 66.21, 79.54 [18:31:01] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3333560 (10Gilles) Note to self: SVG needs... [18:34:03] MatmaRex, pulled on mwdebug1002 [18:34:40] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[isc-dhcp-server] [18:34:52] MaxSem: yupp. works fine [18:36:31] !log maxsem@tin Synchronized php-1.30.0-wmf.4/includes/EditPage.php: https://gerrit.wikimedia.org/r/#/c/357855/ (duration: 00m 45s) [18:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:56] MatmaRex, ^ [18:37:00] thanks MaxSem [18:37:53] is there an app/command that provides page faults as a rate? [18:38:07] or do i need to grok this out of ps or something? [18:38:25] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for labtestneutron2002 and labtestnet2002 [dns] - 10https://gerrit.wikimedia.org/r/357869 [18:38:26] urandom: i'd assume theres an api maybe the amount of views of 404 pages or special:badtitle? [18:38:54] Zppix: sorry, i should have said 'linux page faults' [18:39:11] where page is in the context of disk/storage [18:39:13] urandom: oh in that case i have no clue xD [18:40:08] 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3333601 (10RobH) [18:40:33] (03PS1) 10Cmjohnson: Adding production dns for several new servers, wtp1025-48, ganeti1005-1008, kubestage1001/1002, dumpsdata1001/2, labvirt1015-18 T165173 T166264 T165531 T165520 T162216 T166076 [dns] - 10https://gerrit.wikimedia.org/r/357870 [18:40:38] !log maxsem@tin Synchronized php-1.30.0-wmf.4/extensions/LoginNotify/: https://gerrit.wikimedia.org/r/#/c/357743/ (duration: 00m 44s) [18:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:52] !log built gerrit_2.13.8+git1-wmf.5 on copper (T158946) [18:42:59] w00t [18:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:02] T158946: Update gerrit to 2.13.8 - https://phabricator.wikimedia.org/T158946 [18:43:32] got it: sar -B 1 10 [18:43:42] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for labtestneutron2002 and labtestnet2002 [dns] - 10https://gerrit.wikimedia.org/r/357869 [18:45:26] (03PS2) 10Cmjohnson: Adding production dns for several new servers, wtp1025-48, ganeti1005-1008, kubestage1001/1002, dumpsdata1001/2, labvirt1015-18 and stat1005/6 T165366 T165368 T165173 T166264 T165531 T165520 T162216 T166076 [dns] - 10https://gerrit.wikimedia.org/r/357870 [18:47:20] (03CR) 10Cmjohnson: [C: 032] Adding production dns for several new servers, wtp1025-48, ganeti1005-1008, kubestage1001/1002, dumpsdata1001/2, labvirt1015-18 and stat1005 [dns] - 10https://gerrit.wikimedia.org/r/357870 (owner: 10Cmjohnson) [18:49:33] 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3333632 (10Cmjohnson) [18:49:43] !log Restarting Cassandra, restbase-dev1001-a to test alternative disk access mode [18:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:17] 10Operations, 10ops-eqiad, 10Dumps-Generation, 10Patch-For-Review: rack/setup/install dumpsdata100[12] - https://phabricator.wikimedia.org/T165173#3259047 (10Cmjohnson) a:05Cmjohnson>03RobH Mac address has been added to dhcpd file. Assigning to @robh for install [18:50:38] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3333636 (10Cmjohnson) [18:52:00] PROBLEM - puppet last run on install1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[isc-dhcp-server] [18:52:02] jdlrobson, pulled on mwdebug1002 [18:53:13] MaxSem: testing [18:53:51] MaxSem: checked! LGTM [18:55:29] !log maxsem@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [18:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:08] MaxSem: Was that mw12$something? [18:56:21] (1/11 - just retry, probably transient) [18:56:27] RainbowSprinkles, mw1279 [18:56:38] That one blew up on me Tuesday too [18:56:42] Retrying made it shut up [18:58:55] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement stat1006 (stat1003 replacement) - https://phabricator.wikimedia.org/T165366#3333657 (10Cmjohnson) a:05Cmjohnson>03RobH @robh added mac address to dhcpd already, verified on switch that it's... [18:59:28] 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3333675 (10Cmjohnson) [18:59:40] !log maxsem@tin Synchronized php-1.30.0-wmf.4/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/#/c/357846/ (duration: 00m 49s) [18:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:57] jdlrobson, ^ [18:59:59] 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3290332 (10Cmjohnson) a:05akosiaris>03RobH @robh added mac address already [19:00:04] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T1900). [19:00:11] w00p thankkk you [19:00:19] jdlrobson, no time for another patch [19:00:29] RainbowSprinkles, all yours [19:00:34] no? booo :( [19:00:44] What was jdlrobsons? [19:01:17] Oh, that simple wordmark thing? [19:01:22] (03PS2) 10Chad: Update logo and dimensions for SR wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357848 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [19:01:32] (03CR) 10Chad: [C: 032] Update logo and dimensions for SR wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357848 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [19:01:41] :-O it's christmas! [19:01:58] I can jfdi faster than you can add it to the next swat window ;-) [19:02:27] (03Merged) 10jenkins-bot: Update logo and dimensions for SR wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357848 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [19:02:41] (03CR) 10jenkins-bot: Update logo and dimensions for SR wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357848 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [19:03:32] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labtestnet2002 - https://phabricator.wikimedia.org/T167159#3333690 (10RobH) [19:03:34] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw:labtestnet2002 switch port configuration - https://phabricator.wikimedia.org/T167322#3333688 (10RobH) 05Resolved>03Open Nevermind, I had a bad config and it didn't commit. I need to investiage and redo the change. [19:03:52] !log demon@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-sr.svg: new wordmark (duration: 00m 46s) [19:03:54] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3333691 (10Cmjohnson) [19:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:05] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw: labtestneutron2002 sswitch port configuration - https://phabricator.wikimedia.org/T167326#3333692 (10RobH) a:05Papaul>03RobH [19:05:16] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: New wordmark for mk/srwiki (duration: 00m 57s) [19:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:59] jdlrobson: You're live everywhere :) [19:06:08] 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/install kubestage100[12] - https://phabricator.wikimedia.org/T166264#3333700 (10RobH) [19:06:20] (03PS1) 10Chad: mw.org back to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357872 [19:06:30] 10Operations, 10ops-eqiad, 10Labs, 10Labs-Infrastructure, 10Patch-For-Review: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3333702 (10Cmjohnson) mac addresses were added to dhcpd file, not sure if h/w raid is needed..i believe these came with a controller. Also @mark was... [19:07:35] RainbowSprinkles: did you do a static cache flush? [19:07:46] Nope? [19:08:09] https://sr.m.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-sr.svg?r=4 vs https://sr.m.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-sr.svg [19:08:12] seeing different things [19:08:40] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3333707 (10Cmjohnson) [19:09:07] jdlrobson: Uno momento [19:09:15] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3264256 (10Cmjohnson) a:05Cmjohnson>03RobH [19:09:37] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3333709 (10Cmjohnson) [19:09:48] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3268863 (10Cmjohnson) a:05Cmjohnson>03RobH [19:09:54] Hmmm [19:10:00] RainbowSprinkles: i dont recall how to do that though.. [19:10:06] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3333711 (10Cmjohnson) [19:10:06] purgeList? [19:10:07] I threw it into purgeList [19:10:17] im seeing new asset now [19:10:29] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3284602 (10Cmjohnson) a:05Cmjohnson>03RobH [19:11:03] jdlrobson: Ok we good then? :) [19:11:11] (03PS1) 10ArielGlenn: script for retrieving raw flow revision content [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/357873 [19:11:22] (03CR) 10Chad: [C: 032] mw.org back to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357872 (owner: 10Chad) [19:11:31] RainbowSprinkles: i think so. Worse case will have to wait for cache to update. If you did something i think it worked [19:11:43] thanks MaxSem and RainbowSprinkles :) [19:12:39] (03Merged) 10jenkins-bot: mw.org back to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357872 (owner: 10Chad) [19:12:50] (03CR) 10jenkins-bot: mw.org back to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357872 (owner: 10Chad) [19:13:24] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: mw.org -> wmf.4 [19:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:05] (03PS1) 10RobH: setting install params for kubestage100[12] [puppet] - 10https://gerrit.wikimedia.org/r/357874 [19:16:32] 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/install kubestage100[12] - https://phabricator.wikimedia.org/T166264#3333723 (10RobH) [19:16:49] (03CR) 10RobH: [C: 032] setting install params for kubestage100[12] [puppet] - 10https://gerrit.wikimedia.org/r/357874 (owner: 10RobH) [19:17:08] (03PS1) 10Andrew Bogott: Designate api: Increase max query limit [puppet] - 10https://gerrit.wikimedia.org/r/357875 [19:19:10] (03CR) 10Andrew Bogott: [C: 032] Designate api: Increase max query limit [puppet] - 10https://gerrit.wikimedia.org/r/357875 (owner: 10Andrew Bogott) [19:19:14] (03PS2) 10Andrew Bogott: Designate api: Increase max query limit [puppet] - 10https://gerrit.wikimedia.org/r/357875 [19:21:16] (03PS1) 10Chad: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357876 [19:24:19] (03CR) 10Chad: [C: 032] group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357876 (owner: 10Chad) [19:25:34] (03Merged) 10jenkins-bot: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357876 (owner: 10Chad) [19:25:49] (03CR) 10jenkins-bot: group1 to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357876 (owner: 10Chad) [19:26:46] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.4 [19:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:34] (03PS1) 10RobH: Revert "setting install params for kubestage100[12]" [puppet] - 10https://gerrit.wikimedia.org/r/357877 [19:27:40] (03CR) 10RobH: [C: 032] Revert "setting install params for kubestage100[12]" [puppet] - 10https://gerrit.wikimedia.org/r/357877 (owner: 10RobH) [19:28:48] (03PS2) 10RobH: Revert "setting install params for kubestage100[12]" [puppet] - 10https://gerrit.wikimedia.org/r/357877 [19:28:52] (03CR) 10RobH: [V: 032 C: 032] Revert "setting install params for kubestage100[12]" [puppet] - 10https://gerrit.wikimedia.org/r/357877 (owner: 10RobH) [19:30:44] (03PS1) 10Zhuyifei1999: tools-static: add /fontcdn/ to reverse-proxy to Google Fonts [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027) [19:31:18] (03PS1) 10RobH: Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076" [puppet] - 10https://gerrit.wikimedia.org/r/357879 [19:31:45] (03CR) 10Zhuyifei1999: "https://tools.wmflabs.org/fontcdn/ is not ready." [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027) (owner: 10Zhuyifei1999) [19:32:41] (03CR) 10RobH: [C: 032] Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata100 [puppet] - 10https://gerrit.wikimedia.org/r/357879 (owner: 10RobH) [19:32:46] (03PS2) 10RobH: Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076" [puppet] - 10https://gerrit.wikimedia.org/r/357879 [19:33:13] (03CR) 10Zhuyifei1999: tools-static: add /fontcdn/ to reverse-proxy to Google Fonts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357878 (https://phabricator.wikimedia.org/T110027) (owner: 10Zhuyifei1999) [19:36:53] (03Abandoned) 10RobH: Revert "Adding mac addresses to dhcpd file for several systems, wtp1025-1046, stat1005-1006, ganeti1005-1008, labvirt1015-1018, dumpsdata1001-1002, kubestage1001-1002, analytics1069 task #'s T165173 T165366 T166264 T165531 T165368 T165520 T162216 T166076" [puppet] - 10https://gerrit.wikimedia.org/r/357879 (owner: 10RobH) [19:37:40] (03PS1) 10Cmjohnson: Fixing a typo in dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/357880 [19:38:42] (03CR) 10RobH: [C: 032] Fixing a typo in dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/357880 (owner: 10Cmjohnson) [19:38:50] (03PS2) 10RobH: Fixing a typo in dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/357880 (owner: 10Cmjohnson) [19:41:10] Weird. Something's requesting commons pages in the format of https://commons.wikimedia.org/wiki/File:Map_of_USA_OR.svg?uselang=⧼Lang⧽ [19:41:16] (image seems to be varying) [19:41:27] this is causing exceptions :\ [19:44:15] RainbowSprinkles: new? [19:44:24] Hadn't seen it before yet [19:44:33] But not necessarily "new" [19:44:43] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3333819 (10Gilles) OK, I've looked at the... [19:45:11] * greg-g nods [19:45:46] T167359 also seems to be still appearing, fix is incomplete? [19:45:46] T167359: Catchable fatal error: Argument 2 passed to RevisionSliderHooks::onDiffViewHeader() must be an instance of Revision, null given - https://phabricator.wikimedia.org/T167359 [19:45:55] Also spotted T167461 [19:45:56] T167461: SpecialMobileDiff: Call to member function getDiffBody() on non-object - https://phabricator.wikimedia.org/T167461 [19:46:53] Also geosearch queries are throwing PartialShardExceptions :\ [19:47:00] (but doubt this is a wmf.4 problem) [19:47:08] (03PS1) 10Eevans: Use Cassandra version that corresponds to what is being tested [puppet] - 10https://gerrit.wikimedia.org/r/357882 (https://phabricator.wikimedia.org/T160570) [19:47:33] 10Operations, 10ops-codfw, 10Labs, 10Labs-Infrastructure, 10netops: codfw: labtestneutron2002 switch port configuration - https://phabricator.wikimedia.org/T167326#3333850 (10Papaul) [19:47:37] (03PS1) 10RobH: reverting this patchset, as its introduced some errors into our dhcp system and caused it to fail. [puppet] - 10https://gerrit.wikimedia.org/r/357884 [19:47:49] (03CR) 10jerkins-bot: [V: 04-1] reverting this patchset, as its introduced some errors into our dhcp system and caused it to fail. [puppet] - 10https://gerrit.wikimedia.org/r/357884 (owner: 10RobH) [19:49:25] mutante: are you around? any chance i could convince you to merge https://gerrit.wikimedia.org/r/357882? it won't impact anything outside of the dev (non-prod) environment, and it'll make the puppet warnings there go away. [19:49:38] (and i need to run puppet here :)) [19:50:48] (03PS2) 10RobH: reverting this patchset, as its introduced some errors into our dhcp system and caused it to fail. [puppet] - 10https://gerrit.wikimedia.org/r/357884 [19:54:11] (03CR) 10RobH: [C: 031] reverting this patchset, as its introduced some errors into our dhcp system and caused it to fail. [puppet] - 10https://gerrit.wikimedia.org/r/357884 (owner: 10RobH) [19:54:23] (03CR) 10Cmjohnson: [C: 031] reverting this patchset, as its introduced some errors into our dhcp system and caused it to fail. [puppet] - 10https://gerrit.wikimedia.org/r/357884 (owner: 10RobH) [19:54:30] (03CR) 10RobH: [C: 032] reverting this patchset, as its introduced some errors into our dhcp system and caused it to fail. [puppet] - 10https://gerrit.wikimedia.org/r/357884 (owner: 10RobH) [19:55:52] RECOVERY - Check systemd state on install1002 is OK: OK - running: The system is fully operational [19:56:12] RECOVERY - puppet last run on install1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [19:57:17] mutante: basically, this is the minimum it'll take to make it happy about that package version currently installed [19:59:18] 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/install kubestage100[12] - https://phabricator.wikimedia.org/T166264#3333891 (10RobH) Ok, that large patchset had some issues that borked up dhcp. Rather than try to find the issues in the large one, we reverted it and will make small... [20:02:32] (03PS1) 10RobH: setting kubestage100[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/357890 [20:02:42] RECOVERY - Check systemd state on install2002 is OK: OK - running: The system is fully operational [20:02:43] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:02:59] 10Operations, 10ops-eqiad, 10Kubernetes, 10Patch-For-Review: rack/setup/install kubestage100[12] - https://phabricator.wikimedia.org/T166264#3333901 (10RobH) [20:04:05] (03CR) 10RobH: [C: 032] setting kubestage100[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/357890 (owner: 10RobH) [20:04:32] urandom: won't it add confusion if then the target_version and the package_version are different versions? [20:04:56] 3.7 => 3.11 [20:06:20] 10Operations, 10Performance-Team, 10Thumbor, 10MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), 10Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3333909 (10Gilles) On second thought, let'... [20:07:03] (03CR) 10Dzahn: Use Cassandra version that corresponds to what is being tested (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357882 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [20:12:35] !log imported gerrit_2.13.8+git1-wmf.5_amd64 on apt.wikimedia.org (T158946) [20:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:37] T158946: Update gerrit to 2.13.8 - https://phabricator.wikimedia.org/T158946 [20:23:24] !log gerrit2001: upgraded to 2.13.8+git1-wmf.5 / 2.13.8-1-g7c438d37a2 [20:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:17] mutante: well, maybe [20:39:38] mutante: i was trying to keep the changeset minimal considering that this version may yet change [20:39:50] and it's only used in the dev environment [20:41:39] mutante: so we could change the target version, but then there are also files like templates/some_config_file-3.7.erb [20:42:03] so if you extend that need for consistency, you'd probably want to rename them as well [20:42:45] mutante: which i'm willing to do, but i was aiming for least-invasive in the interest of shopping around for someone to merge :) [20:44:28] mutante: we could change target_version to 3.x, and rename all of the files accordingly [21:00:56] (03PS2) 10Eevans: Use Cassandra version that corresponds to what is being tested [puppet] - 10https://gerrit.wikimedia.org/r/357882 (https://phabricator.wikimedia.org/T160570) [21:10:50] (03CR) 10Eevans: "PC output here: http://puppet-compiler.wmflabs.org/6710, it shows 'no change' where it should, though fails to compile on one of the hosts" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357882 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [21:18:39] urandom: sorry, i was afk. reading now, and "minimal changeset" sounds good [21:19:40] heh [21:19:52] i am looking, hold on :) we'll do this [21:20:17] mutante: i updated it; it's slightly less minimal, but should be clearer [21:20:44] and more future-proof for this testing period [21:21:11] ok, yep, i see the compiler output [21:21:29] the one it fails on, it's that for some reason the compiler just doesn't know this host name at all [21:21:36] i can tell because the link is 404 [21:21:41] and not a real failure [21:21:42] right [21:21:48] rest looks good.. doing [21:21:54] mutante: awesome [21:22:18] (03PS3) 10Dzahn: Use Cassandra version that corresponds to what is being tested [puppet] - 10https://gerrit.wikimedia.org/r/357882 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [21:23:08] !log ppchelko@tin Started deploy [changeprop/deploy@56f7511]: Rate limiting code and config. T161710 [21:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:19] T161710: Automate RESTBase blacklisting - https://phabricator.wikimedia.org/T161710 [21:24:54] !log ppchelko@tin Finished deploy [changeprop/deploy@56f7511]: Rate limiting code and config. T161710 (duration: 01m 46s) [21:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:22] (03CR) 10Dzahn: [C: 032] "affects only -dev, not -prod and -test, they use target_version 2.2" [puppet] - 10https://gerrit.wikimedia.org/r/357882 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [21:26:14] PROBLEM - changeprop endpoints health on scb2001 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200) [21:26:18] urandom: merged on master [21:26:26] mutante: great; let me give it a go [21:26:44] PROBLEM - changeprop endpoints health on scb1003 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200) [21:26:45] PROBLEM - changeprop endpoints health on scb2005 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200) [21:26:45] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200) [21:26:45] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200) [21:27:04] PROBLEM - changeprop endpoints health on scb1004 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200) [21:27:14] PROBLEM - changeprop endpoints health on scb2002 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200) [21:27:14] PROBLEM - changeprop endpoints health on scb2004 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200) [21:27:16] :( [21:27:24] PROBLEM - changeprop endpoints health on scb2006 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200) [21:27:44] PROBLEM - changeprop endpoints health on scb2003 is CRITICAL: /sys/limit/{type}/{key} (test for /sys/limit/{type}/{key}) is CRITICAL: Test test for /sys/limit/{type}/{key} returned the unexpected status 403 (expecting: 200) [21:27:44] RECOVERY - puppet last run on restbase-dev1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [21:27:56] Pchelolo: revert? [21:28:26] what the hell is that [21:28:29] I'll revert [21:29:03] !log ppchelko@tin Started deploy [changeprop/deploy@56f7511]: dc1948f6bc7b1 [21:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:19] !log ppchelko@tin Finished deploy [changeprop/deploy@56f7511]: dc1948f6bc7b1 (duration: 00m 16s) [21:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:54] RECOVERY - puppet last run on restbase-dev1003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [21:31:02] mutante: looks good; thank you! [21:31:05] !log ppchelko@tin Started deploy [changeprop/deploy@56f7511]: dc1948f6bc7b1 Revert previous deploy [21:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:28] urandom: < icinga-wm> RECOVERY - puppet last run on restbase-dev1002 is OK: :) it just got drowned in the other unrelated alerts :o [21:31:35] yeah [21:32:07] ok, cool, that got me for the first couple seconds [21:32:44] Pchelolo: anything needed from root for that? [21:33:05] mutante: no, no [21:33:13] alright [21:33:54] RECOVERY - puppet last run on restbase-dev1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:34:03] !log ppchelko@tin Started deploy [changeprop/deploy@56f7511]: Revert previous deploy [21:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:11] !log ppchelko@tin Finished deploy [changeprop/deploy@56f7511]: Revert previous deploy (duration: 01m 07s) [21:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:01] (03PS3) 10Dzahn: DNS: Add mgmt and production DNS for labtestneutron2002 and labtestnet2002 [dns] - 10https://gerrit.wikimedia.org/r/357869 (owner: 10Papaul) [21:42:33] !log T160570: Rolling Cassandra restart, restbase-dev [21:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:45] T160570: Cassandra 3.x Tracking - https://phabricator.wikimedia.org/T160570 [21:48:27] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt and production DNS for labtestneutron2002 and labtestnet2002 [dns] - 10https://gerrit.wikimedia.org/r/357869 (owner: 10Papaul) [21:50:09] !log mobrovac@tin Started deploy [changeprop/deploy@56f7511]: (no justification provided) [21:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:43] !log mobrovac@tin Finished deploy [changeprop/deploy@56f7511]: (no justification provided) (duration: 00m 34s) [21:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:31] !log mobrovac@tin Started deploy [changeprop/deploy@56f7511]: (no justification provided) [21:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:03] !log mobrovac@tin Finished deploy [changeprop/deploy@56f7511]: (no justification provided) (duration: 01m 32s) [21:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:46] !log mobrovac@tin Started deploy [changeprop/deploy@dc1948f]: (no justification provided) [21:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:20] RECOVERY - changeprop endpoints health on scb2001 is OK: All endpoints are healthy [21:56:24] !log mobrovac@tin Finished deploy [changeprop/deploy@dc1948f]: (no justification provided) (duration: 01m 39s) [21:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:49] !log mobrovac@tin Started deploy [changeprop/deploy@836b070]: Rate limiting, attempt #2 [22:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:13] !log mobrovac@tin Finished deploy [changeprop/deploy@836b070]: Rate limiting, attempt #2 (duration: 01m 23s) [22:15:20] RECOVERY - changeprop endpoints health on scb2004 is OK: All endpoints are healthy [22:15:20] RECOVERY - changeprop endpoints health on scb2002 is OK: All endpoints are healthy [22:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:30] RECOVERY - changeprop endpoints health on scb2006 is OK: All endpoints are healthy [22:15:50] RECOVERY - changeprop endpoints health on scb2003 is OK: All endpoints are healthy [22:15:51] RECOVERY - changeprop endpoints health on scb2005 is OK: All endpoints are healthy [22:16:00] RECOVERY - changeprop endpoints health on scb1003 is OK: All endpoints are healthy [22:16:00] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [22:16:00] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [22:16:10] RECOVERY - changeprop endpoints health on scb1004 is OK: All endpoints are healthy [22:17:04] !log demon@tin Synchronized php-1.30.0-wmf.4/extensions/MobileFrontend/includes/specials/SpecialMobileDiff.php: (no justification provided) (duration: 00m 44s) [22:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:20] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/6711/" [puppet] - 10https://gerrit.wikimedia.org/r/355871 (owner: 10Dzahn) [22:17:26] (03PS2) 10Dzahn: phabricator: move hiera lookups to parameters [puppet] - 10https://gerrit.wikimedia.org/r/355871 [22:29:04] !log demon@tin Synchronized php-1.30.0-wmf.4/extensions/RevisionSlider/src/RevisionSliderHooks.php: Livehack/test (duration: 00m 44s) [22:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:33] (03PS1) 10RobH: adding in dumpsdata00[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/357949 [22:45:17] (03CR) 10RobH: [C: 032] adding in dumpsdata00[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/357949 (owner: 10RobH) [22:46:19] is wmf.4 going to be deployed on group2? [22:48:10] RainbowSprinkles: ^ [22:48:35] (03PS1) 10Ladsgroup: Change Persian Wikis from uca-fa to xx-uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357951 (https://phabricator.wikimedia.org/T139110) [22:48:52] I'm wondering because of T167473 - should I backport fix for .4 or .2? [22:48:53] T167473: DeleteArchive: Object does not implement ArrayAccess - https://phabricator.wikimedia.org/T167473 [22:53:09] SMalyshev: both are in prod right now, so i suppose both? [22:53:19] according to noc.wikimedia.org [22:53:30] ebernhardson: that's why I ask, wmf.2 should be theoretically gone now [22:53:34] madhuvishy: any news regarding those labstore changes? [22:53:41] but looks like it's not [22:54:03] Yes, shortly :) [22:54:10] ah, ok then :) [22:54:22] I can wait :) [22:55:25] paravoid: ah yes I saw your patch, i can merge now [22:55:44] (03CR) 10Madhuvishy: [C: 032] labstore: remove TC=$(which tc) [puppet] - 10https://gerrit.wikimedia.org/r/356107 (owner: 10Faidon Liambotis) [22:55:50] (03PS4) 10Madhuvishy: labstore: remove TC=$(which tc) [puppet] - 10https://gerrit.wikimedia.org/r/356107 (owner: 10Faidon Liambotis) [22:56:14] (03CR) 10Madhuvishy: [V: 032 C: 032] labstore: remove TC=$(which tc) [puppet] - 10https://gerrit.wikimedia.org/r/356107 (owner: 10Faidon Liambotis) [22:56:19] !log demon@tin Synchronized php-1.30.0-wmf.4/extensions/RevisionSlider/src/RevisionSliderHooks.php: Re-syncing with permanent committed fix (duration: 00m 44s) [22:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:37] SMalyshev: Go ahead and merge your wmf.4 patch, I'll sync it out while I'm on a roll [22:56:55] RainbowSprinkles: i could throw in a GeoData fix while you're at it too ;) [22:57:02] Oooooh [22:57:03] <3 [22:57:11] The partialshardexception? [22:57:12] :) [22:57:38] yes, elasticsearch broke their promise about mixed clusters [22:57:58] (03CR) 10Faidon Liambotis: [C: 04-2] "No, don't use a shell script, that's definitely the wrong way to do this. A systemd unit that manages the process is ideal. Running kill f" [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [22:57:59] mixed clusters (5.1.2 and 5.3.2 in same cluster) is supposed to work flawlessly, but they changed an enum [22:58:20] so that enum emit by 5.1.2 and read by 5.3.2 is interpreted differently :S [22:59:42] Ouch. At least it's fixable on the Cirrus side :) [23:00:04] madhuvishy: thanks! :) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170608T2300). Please do the needful. [23:00:05] Jdlrobson and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:07] madhuvishy: the rest too? [23:00:13] well, i'm just pointing geodata at a cluster thats all uniform :) [23:00:17] AH [23:00:19] That works too [23:02:24] I added a patch too last minute [23:02:26] paravoid: yes doing, rolling out the patch on tools first to make sure [23:02:30] ok :) [23:03:59] Amir1: I'm saving that for last, it might not make it today. I'm trying to wrap up a few production fixes we've got for wmf.4 first, so we can finish the train for the week [23:04:16] Well, does the collation swap need a maintenance script run I think? If not, could knock it out pretty quick [23:04:24] RainbowSprinkles: so https://gerrit.wikimedia.org/r/#/c/357953/ is the cherry-pick, the master one is merged [23:04:36] RainbowSprinkles: it needs the maintenance script [23:04:51] Yeah, gonna have to wait until last thing. I can't babysit the script right now [23:05:02] yeah sure [23:06:48] (03PS1) 10Chad: Swapping wikipedias to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357955 [23:07:36] (03PS4) 10Madhuvishy: labstore: use the interface_primary fact, not eth0 [puppet] - 10https://gerrit.wikimedia.org/r/356108 (owner: 10Faidon Liambotis) [23:10:27] (03PS1) 10RobH: updating recipe for 80% of lvm [puppet] - 10https://gerrit.wikimedia.org/r/357956 [23:11:41] (03PS2) 10RobH: updating recipe for 80% of lvm [puppet] - 10https://gerrit.wikimedia.org/r/357956 [23:13:03] (03CR) 10RobH: [C: 032] updating recipe for 80% of lvm [puppet] - 10https://gerrit.wikimedia.org/r/357956 (owner: 10RobH) [23:14:12] (03PS7) 10Dzahn: gerrit: let Apache proxy only listen on service IP [puppet] - 10https://gerrit.wikimedia.org/r/354078 [23:14:55] (03CR) 10Dzahn: gerrit: let Apache proxy only listen on service IP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354078 (owner: 10Dzahn) [23:15:01] !log demon@tin Synchronized php-1.30.0-wmf.4/extensions/GeoData/includes/Searcher.php: Temp hax to point GeoData at codfw DC (duration: 00m 43s) [23:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:14] ebernhardson: Your patch is live everywhere [23:15:43] (03Abandoned) 10Dzahn: lists: lower TTL for service IP change [dns] - 10https://gerrit.wikimedia.org/r/354064 (owner: 10Dzahn) [23:16:04] RainbowSprinkles: sweet, we'll know its working because it doesn't start emitting 100 errors/minute when wmf.4 goes out (but this is also already on wmf.2, so should be fine) :)_ [23:16:05] (03Abandoned) 10Dzahn: lists: switch v6 service IP [dns] - 10https://gerrit.wikimedia.org/r/354071 (owner: 10Dzahn) [23:16:32] !log demon@tin Synchronized php-1.30.0-wmf.4/extensions/CirrusSearch/includes/Job/DeleteArchive.php: Fix array access bug (duration: 00m 43s) [23:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:47] SMalyshev: You've live now too [23:17:57] RainbowSprinkles: cool, thanks! [23:18:04] (03PS5) 10Madhuvishy: labstore: avoid the hardcoding of eth0/eth1 [puppet] - 10https://gerrit.wikimedia.org/r/356109 (owner: 10Faidon Liambotis) [23:19:05] paravoid: gerrit wants me to manually rebase https://gerrit.wikimedia.org/r/#/c/356108 on [23:19:52] so doing that [23:21:08] (03PS5) 10Faidon Liambotis: labstore: use the interface_primary fact, not eth0 [puppet] - 10https://gerrit.wikimedia.org/r/356108 [23:21:10] (03PS6) 10Faidon Liambotis: labstore: avoid the hardcoding of eth0/eth1 [puppet] - 10https://gerrit.wikimedia.org/r/356109 [23:21:12] (03PS3) 10Faidon Liambotis: labstore: use /sbin/tc, not $PATH/tc [puppet] - 10https://gerrit.wikimedia.org/r/357597 [23:21:13] madhuvishy: ^ [23:21:40] (03Abandoned) 10Dzahn: fix lists/fermium: switch v6 service IP [puppet] - 10https://gerrit.wikimedia.org/r/354055 (owner: 10Dzahn) [23:21:51] paravoid: /\ thank you [23:21:57] (03CR) 10Madhuvishy: [V: 032 C: 032] labstore: use the interface_primary fact, not eth0 [puppet] - 10https://gerrit.wikimedia.org/r/356108 (owner: 10Faidon Liambotis) [23:23:07] actually [23:23:12] (03PS4) 10Faidon Liambotis: labstore: use /sbin/tc, not $PATH/tc [puppet] - 10https://gerrit.wikimedia.org/r/357597 [23:23:14] (03PS7) 10Faidon Liambotis: labstore: avoid the hardcoding of eth0/eth1 [puppet] - 10https://gerrit.wikimedia.org/r/356109 [23:23:19] let me swap the order of those two, to get the tc fix earlier [23:23:23] done :) [23:28:40] (03CR) 10Faidon Liambotis: raid: remove unused aac, twe, zfs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/357819 (owner: 10Faidon Liambotis) [23:29:06] paravoid: do you know why https://gerrit.wikimedia.org/r/#/c/356109 says Submit Including parents? [23:29:17] madhuvishy: because of ^^ [23:29:25] aah [23:29:27] I swapped the order, the /sbin/tc thing is first now [23:29:32] ah ah [23:29:33] okay [23:29:34] but I can swap it again if you want [23:29:48] you can do that from gerrit too with a rebase -> change parent revision [23:29:58] (03CR) 10Madhuvishy: [C: 032] labstore: use /sbin/tc, not $PATH/tc [puppet] - 10https://gerrit.wikimedia.org/r/357597 (owner: 10Faidon Liambotis) [23:31:49] paravoid: nah it's good, i was confused because it wouldn't let me rebase the /sbin/tc patch, nor submit it - but i realized it was because that was missing CR +2 [23:34:24] (03CR) 10Chad: [C: 032] Swapping wikipedias to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357955 (owner: 10Chad) [23:36:37] (03PS2) 10Dzahn: fix all the "role-role" in system::roles [puppet] - 10https://gerrit.wikimedia.org/r/354172 [23:36:51] (03Merged) 10jenkins-bot: Swapping wikipedias to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357955 (owner: 10Chad) [23:36:59] (03CR) 10jenkins-bot: Swapping wikipedias to wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/357955 (owner: 10Chad) [23:37:56] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: remaining wikis to wmf.4 [23:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:27] paravoid: hmmm [23:40:30] https://www.irccloud.com/pastebin/HOhF79da/ [23:42:04] oh you didn't PCC it after all? [23:42:10] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:42:39] oh right, I see the issue [23:42:41] ugh [23:42:55] i did put it through the compiler [23:43:05] interface::manual {'data' [23:43:10] yeah [23:43:33] will you fix or should I? [23:43:38] i can fix [23:43:41] :) [23:46:08] Can someone kick HHVM on mw1275? It keeps spewing "LightProcess exiting" crud about every minute or so [23:48:08] (03PS1) 10Madhuvishy: labstore: Fix data interface require clause [puppet] - 10https://gerrit.wikimedia.org/r/357958 [23:48:24] RainbowSprinkles: did ot [23:48:46] thx [23:48:46] paravoid: ^^ [23:48:51] !log mw1275 - restarted hhvm (php: Lost parent, LightProcess exiting in syslog) [23:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:55] (03CR) 10Faidon Liambotis: [C: 032] labstore: Fix data interface require clause [puppet] - 10https://gerrit.wikimedia.org/r/357958 (owner: 10Madhuvishy) [23:53:10] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [23:53:57] RainbowSprinkles: it kept doing it. but now it recovered (also apache service) [23:54:26] paravoid: all done :) thanks for all the patches! [23:54:37] thanks for the merges :) [23:54:46] and more importantly, code review! [23:55:06] :) yw [23:55:16] off to bed now, bye! [23:55:20] gnite [23:55:42] mutante: Ok awesome, glad it's healthier now [23:56:56] RainbowSprinkles: i think this is https://phabricator.wikimedia.org/T124956 [23:58:05] Sorta yeah [23:58:12] and separately service(s) were crashed [23:58:25] which was fixed, but that log line is still there. but it's like they were 2 things [23:58:55] I wonder if mw1275 is just being bleh