[00:00:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [00:05:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [00:05:40] is https://tools.wmflabs.org/blockcalc/ down [00:07:09] looks to be [00:10:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [00:15:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [00:20:09] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [00:25:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [00:30:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [00:35:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [00:40:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [00:45:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [00:50:09] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [00:55:09] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [01:00:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [01:05:09] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [01:10:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [01:15:09] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [01:20:10] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [01:25:08] PROBLEM - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] [01:27:11] (03CR) 10Alex Monk: [C: 031] "Seems like this should do what it says on the tin. Let's get a deployment window when they're next available." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244140 (https://phabricator.wikimedia.org/T114873) (owner: 10TTO) [01:27:37] ACKNOWLEDGEMENT - check_raid on bellatrix is CRITICAL: CRITICAL: HPSA [P420i/slot0: OK, log_1: 16.4TB,RAID6 OK, phy_1I:1:12: Predictive Failure] ori.livneh Jeff Green is aware and will look into it ASAP [01:44:15] !og disabled puppet on holmium and labservices1001 to control roll-out of https://gerrit.wikimedia.org/r/#/c/260037/ [01:44:23] !log disabled puppet on holmium and labservices1001 to control roll-out of https://gerrit.wikimedia.org/r/#/c/260037/ [01:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:46:08] (03CR) 10Andrew Bogott: [C: 032] Insert dns entries for labs bare-metal systems. [puppet] - 10https://gerrit.wikimedia.org/r/260037 (owner: 10Andrew Bogott) [01:47:09] RECOVERY - Disk space on restbase1003 is OK: DISK OK [01:50:29] PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: Puppet has 1 failures [02:07:08] (03PS9) 10Andrew Bogott: WIP: nova-network: have dnsmasq advertise the network host as a tftp server [puppet] - 10https://gerrit.wikimedia.org/r/259788 [02:07:10] (03PS1) 10Andrew Bogott: Fix dns lookups for bare-metal instances: [puppet] - 10https://gerrit.wikimedia.org/r/260323 [02:09:55] (03CR) 10Andrew Bogott: [C: 032] Fix dns lookups for bare-metal instances: [puppet] - 10https://gerrit.wikimedia.org/r/260323 (owner: 10Andrew Bogott) [02:15:49] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [02:17:17] (03PS10) 10Andrew Bogott: WIP: nova-network: have dnsmasq advertise a tftp server [puppet] - 10https://gerrit.wikimedia.org/r/259788 [02:19:04] (03PS11) 10Andrew Bogott: nova-network: have dnsmasq advertise a tftp server [puppet] - 10https://gerrit.wikimedia.org/r/259788 [02:20:43] !log disabling puppet on labnet1002 to mess with dnsmasq [02:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:23:04] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 09m 45s) [02:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:51] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Dec 21 02:29:51 UTC 2015 (duration 6m 47s) [02:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:57] 6operations, 10ops-codfw, 10fundraising-tech-ops: bellatrix hardware RAID predictive failure - https://phabricator.wikimedia.org/T122026#1894170 (10Jgreen) [02:46:36] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1894176 (10Matthewrbowker) [04:07:49] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 74437 MB (3% inode=99%) [04:09:28] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 23.81% of data above the critical threshold [100000000.0] [04:15:38] RECOVERY - Disk space on restbase1008 is OK: DISK OK [04:38:48] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:30:38] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:31:19] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:09] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:38] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:40] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1894271 (10KartikMistry) [06:56:30] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:20] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:48] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:50] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:54:18] PROBLEM - puppet last run on scandium is CRITICAL: CRITICAL: Puppet last ran 4 days ago [07:56:10] RECOVERY - puppet last run on scandium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:19:03] (03Abandoned) 10Muehlenhoff: Add ferm rules for coredb classes [puppet] - 10https://gerrit.wikimedia.org/r/228806 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [08:19:27] (03Abandoned) 10Muehlenhoff: Enable ferm on db1047 [puppet] - 10https://gerrit.wikimedia.org/r/235445 (owner: 10Muehlenhoff) [08:53:48] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [08:55:48] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [08:58:58] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:29] PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:18] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [09:03:38] RECOVERY - Apache HTTP on mw1232 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.435 second response time [09:05:08] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 69288 bytes in 1.236 second response time [09:07:03] !log stop cassandra on restbase1004, decomissioned [09:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:08:18] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [09:12:10] PROBLEM - cassandra service on restbase1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [09:13:07] ACKNOWLEDGEMENT - cassandra CQL 10.64.32.160:9042 on restbase1004 is CRITICAL: Connection refused Filippo Giunchedi decommissioning [09:13:07] ACKNOWLEDGEMENT - cassandra service on restbase1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Filippo Giunchedi decommissioning [09:13:07] ACKNOWLEDGEMENT - puppet last run on restbase1004 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi decommissioning [09:24:19] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [09:34:38] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [09:38:34] (03CR) 10Faidon Liambotis: [C: 04-2] "That won't work -- it's bad puppet code (the tor class is never included) for starters. Even if you fixed that, it wouldn't work due to li" [puppet] - 10https://gerrit.wikimedia.org/r/260064 (owner: 10Dzahn) [09:40:29] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [09:40:29] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [09:42:08] (03PS1) 10Faidon Liambotis: tor: fixup address for the second instance [puppet] - 10https://gerrit.wikimedia.org/r/260333 [09:42:20] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [09:42:22] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [09:42:50] (03CR) 10Faidon Liambotis: [C: 032 V: 032] tor: fixup address for the second instance [puppet] - 10https://gerrit.wikimedia.org/r/260333 (owner: 10Faidon Liambotis) [09:48:18] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [09:48:18] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [09:50:18] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [09:50:18] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [09:54:02] !log reenabling semisync replication on s3 [09:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:56:18] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [10:00:18] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [10:02:18] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [10:04:18] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [10:08:18] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [10:11:35] (03CR) 10Faidon Liambotis: varnish: switch from libGeoIP to libmaxminddb (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) (owner: 10Faidon Liambotis) [10:12:20] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [10:19:59] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [10:21:58] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [10:32:09] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [10:32:09] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [10:35:08] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 700 [10:40:09] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1001 [10:43:40] (03PS1) 10Hashar: beta: update submodules for mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/260338 (https://phabricator.wikimedia.org/T122018) [10:43:49] (03PS1) 10Jcrespo: Reconfiguring eventlogging mysql servers [puppet] - 10https://gerrit.wikimedia.org/r/260339 [10:44:57] (03PS2) 10Jcrespo: Reconfiguring eventlogging mysql servers [puppet] - 10https://gerrit.wikimedia.org/r/260339 [10:45:08] RECOVERY - check_mysql on db1008 is OK: Uptime: 240551 Threads: 114 Questions: 12038856 Slow queries: 2554 Opens: 4169 Flush tables: 2 Open tables: 402 Queries per second avg: 50.047 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:45:49] (03PS2) 10Alexandros Kosiaris: Correctly order MediaWiki servers in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/260232 (owner: 10Southparkfan) [10:45:57] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Correctly order MediaWiki servers in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/260232 (owner: 10Southparkfan) [10:46:29] (03PS3) 10Jcrespo: Reconfiguring eventlogging mysql servers [puppet] - 10https://gerrit.wikimedia.org/r/260339 [10:47:50] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [10:48:13] (03CR) 10Alexandros Kosiaris: [C: 032] bacula: rename mysql-bipe to mysql_bpipe [puppet] - 10https://gerrit.wikimedia.org/r/260191 (owner: 10Dzahn) [10:48:19] (03PS2) 10Alexandros Kosiaris: bacula: rename mysql-bipe to mysql_bpipe [puppet] - 10https://gerrit.wikimedia.org/r/260191 (owner: 10Dzahn) [10:49:21] (03CR) 10Alexandros Kosiaris: [C: 032] base: rename standard-packages to standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/260196 (owner: 10Dzahn) [10:49:40] (03PS2) 10Alexandros Kosiaris: base: rename standard-packages to standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/260196 (owner: 10Dzahn) [10:49:43] (03PS4) 10Jcrespo: Reconfiguring eventlogging mysql servers [puppet] - 10https://gerrit.wikimedia.org/r/260339 [10:49:59] (03CR) 10Hashar: [C: 031 V: 031] "Cherry picked on beta puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/260338 (https://phabricator.wikimedia.org/T122018) (owner: 10Hashar) [10:50:22] (03PS1) 10Muehlenhoff: Enable ferm on mw1161-mw1169 [puppet] - 10https://gerrit.wikimedia.org/r/260340 [10:51:00] (03CR) 10Jcrespo: [C: 032] Reconfiguring eventlogging mysql servers [puppet] - 10https://gerrit.wikimedia.org/r/260339 (owner: 10Jcrespo) [10:51:49] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [10:53:50] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [10:55:40] (03PS1) 10Jcrespo: Revert "bacula: rename mysql-bipe to mysql_bpipe" [puppet] - 10https://gerrit.wikimedia.org/r/260341 [10:55:58] ^akosiaris [10:56:49] (03PS2) 10Jcrespo: Revert "bacula: rename mysql-bipe to mysql_bpipe" [puppet] - 10https://gerrit.wikimedia.org/r/260341 [10:58:49] (03CR) 10Jcrespo: [C: 032 V: 032] Revert "bacula: rename mysql-bipe to mysql_bpipe" [puppet] - 10https://gerrit.wikimedia.org/r/260341 (owner: 10Jcrespo) [10:59:29] (03PS1) 10Jcrespo: Revert "Revert "bacula: rename mysql-bipe to mysql_bpipe"" [puppet] - 10https://gerrit.wikimedia.org/r/260343 [10:59:39] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [10:59:40] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [11:01:45] (03CR) 10Filippo Giunchedi: [C: 031] mw_rc_irc: rename irc-echo to irc_echo [puppet] - 10https://gerrit.wikimedia.org/r/260202 (owner: 10Dzahn) [11:02:58] !log emergency restart of db1047's mysql [11:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:01] (03PS3) 10Alexandros Kosiaris: base: rename standard-packages to standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/260196 (owner: 10Dzahn) [11:06:22] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [11:06:35] (03CR) 10Alexandros Kosiaris: [V: 032] base: rename standard-packages to standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/260196 (owner: 10Dzahn) [11:07:33] (03PS2) 10Alexandros Kosiaris: Revert "Revert "bacula: rename mysql-bipe to mysql_bpipe"" [puppet] - 10https://gerrit.wikimedia.org/r/260343 (owner: 10Jcrespo) [11:07:39] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "Revert "bacula: rename mysql-bipe to mysql_bpipe"" [puppet] - 10https://gerrit.wikimedia.org/r/260343 (owner: 10Jcrespo) [11:10:21] (03CR) 10Faidon Liambotis: [C: 031] Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [11:11:03] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [11:12:12] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [11:25:44] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [11:28:42] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [11:34:02] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [11:34:52] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [11:44:32] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [11:47:33] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [11:49:43] (03Abandoned) 10Lokal Profil: Make issued and modified typed [puppet] - 10https://gerrit.wikimedia.org/r/251492 (https://phabricator.wikimedia.org/T117533) (owner: 10Lokal Profil) [11:50:26] (03Abandoned) 10Lokal Profil: Localisation updates from translatewiki.net [puppet] - 10https://gerrit.wikimedia.org/r/251493 (owner: 10Lokal Profil) [11:52:23] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [12:01:32] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [12:03:23] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [12:09:22] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [12:10:12] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [12:11:14] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [12:13:14] PROBLEM - RAID on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:14:32] PROBLEM - puppet last run on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:17:12] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [12:18:23] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 33 minutes ago with 0 failures [12:19:04] RECOVERY - RAID on mw1016 is OK: OK: no RAID installed [12:19:04] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [12:24:13] PROBLEM - puppet last run on mw1016 is CRITICAL: CRITICAL: Puppet has 12 failures [12:25:02] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [12:25:54] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [12:29:44] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [12:30:54] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [12:34:12] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 73693 MB (3% inode=99%) [12:47:52] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 7 failures [12:49:36] (03PS1) 10Mobrovac: Monitoring script: Report failing x-ample title in the message [puppet] - 10https://gerrit.wikimedia.org/r/260354 [12:51:03] (03CR) 10Alexandros Kosiaris: [C: 032] OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [12:51:10] (03PS16) 10Alexandros Kosiaris: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [12:53:33] RECOVERY - Disk space on restbase1008 is OK: DISK OK [13:08:14] PROBLEM - RAID on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:08:24] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:43] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:08:44] PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:09:22] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:09:32] PROBLEM - puppet last run on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:03] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:13] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:11:33] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:02] PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:12:34] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 35 failures [13:12:54] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:13:03] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:13:22] RECOVERY - Disk space on mw1006 is OK: DISK OK [13:13:23] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:13:23] akosiaris, awesome! thank you!!! should we rebuild the osm db now? [13:13:44] (03CR) 10Alexandros Kosiaris: [C: 032] Monitoring script: Report failing x-ample title in the message [puppet] - 10https://gerrit.wikimedia.org/r/260354 (owner: 10Mobrovac) [13:14:30] yurik: yeah, not so awesome, it's not working. because of that flat-nodes parameter. I think you should rebuild the db indeed if that's what produces it [13:14:52] MaxSem, ^ [13:15:22] !log disabled puppet on maps-test2001 and commented out osmupdater crontab entry until we fix the sync process [13:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:53] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:19:12] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:19:23] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:21:37] weee [13:22:26] yurik, wanna do the honor? [13:23:00] @seen hoo [13:23:00] Steinsplitter: Last time I saw hoo they were quitting the network with reason: Ping timeout: 260 seconds N/A at 12/21/2015 4:28:20 AM (8h54m40s ago) [13:27:52] MaxSem: when you pick the file to update the db with, please let me know because the state.txt osmosis file needs to be manually created the very first time and it's dependent on what you initialize the DB with (dates are important basically) [13:33:12] RECOVERY - Disk space on mw1006 is OK: DISK OK [13:33:12] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 57 minutes ago with 0 failures [13:33:44] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [13:33:53] RECOVERY - RAID on mw1006 is OK: OK: no RAID installed [13:34:02] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [13:34:13] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [13:34:23] RECOVERY - configured eth on mw1006 is OK: OK - interfaces up [13:34:43] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:34:54] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [13:35:02] RECOVERY - DPKG on mw1006 is OK: All packages OK [13:36:10] yurik, ping me when you're back [13:36:16] MaxSem, back [13:36:16] sure thing akosiaris [13:36:31] yurik, sooooo.... ;) [13:36:33] thanks! [13:36:38] (03PS2) 10Alexandros Kosiaris: Monitoring script: Report failing x-ample title in the message [puppet] - 10https://gerrit.wikimedia.org/r/260354 (owner: 10Mobrovac) [13:37:07] MaxSem, its better if you do the honors of re-importing into postgres ) [13:37:14] you know those params far better ) [13:37:20] 6operations, 10DBA: db1024 (s2 master) will run out of disk space in ~4 months - https://phabricator.wikimedia.org/T122048#1894662 (10jcrespo) 3NEW [13:37:24] akosiaris, will we need to suspend repl? [13:37:31] yurik, then we'll have to wait till next year [13:37:39] MaxSem, why? [13:37:42] ah [13:37:44] yes [13:37:45] cuz no key [13:38:00] MaxSem, hangout [13:39:03] PROBLEM - puppet last run on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:39:04] yurik: I have done that already [13:39:14] (03:15:22 μμ) akosiaris: !log disabled puppet on maps-test2001 and commented out osmupdater crontab entry until we fix the sync process [13:39:37] akosiaris, but did you stop the db replication? [13:39:47] akosiaris, hmm - something's wrong then because it's ensure => absent in maps.pp [13:39:54] yurik: why would I do that ? [13:40:24] akosiaris, that's what i was asking - do we need to stop replication when we do a full re-import :)) [13:40:27] 6operations, 10DBA: db1024 (s2 master) will run out of disk space in ~4 months - https://phabricator.wikimedia.org/T122048#1894662 (10jcrespo) [13:40:35] yurik: no, not really [13:40:39] ok [13:40:48] well, at least that's the hope [13:41:22] PROBLEM - DPKG on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:32] famous last words [13:41:43] PROBLEM - Disk space on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:53] PROBLEM - configured eth on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:53] PROBLEM - RAID on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:41:53] PROBLEM - RAID on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:04] PROBLEM - SSH on mw1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:12] PROBLEM - nutcracker port on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:23] PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:44] (03PS2) 10ArielGlenn: 2014.7.5 jessie, backport patches for singleton SAuth class [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/259674 [13:43:02] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:43:43] (03PS1) 10Alexandros Kosiaris: osm::planet_sync: Add ensure parameter as well in sync cron [puppet] - 10https://gerrit.wikimedia.org/r/260361 [13:45:33] RECOVERY - Disk space on mw1016 is OK: DISK OK [13:45:42] RECOVERY - configured eth on mw1016 is OK: OK - interfaces up [13:45:53] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:02] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [13:46:02] RECOVERY - SSH on mw1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [13:46:12] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:46:33] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:46:52] !log extending online s2-master data disk by +100GB [13:46:52] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:47:02] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:47:06] (03CR) 10Alexandros Kosiaris: [C: 032] osm::planet_sync: Add ensure parameter as well in sync cron [puppet] - 10https://gerrit.wikimedia.org/r/260361 (owner: 10Alexandros Kosiaris) [13:47:12] RECOVERY - DPKG on mw1016 is OK: All packages OK [13:47:30] (03PS1) 10ArielGlenn: Revert "2014.7.5 jessie, backport patches for singleton SAuth class" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260362 [13:47:38] (03PS2) 10ArielGlenn: Revert "2014.7.5 jessie, backport patches for singleton SAuth class" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260362 [13:47:52] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [13:48:02] (03CR) 10ArielGlenn: [C: 032 V: 032] Revert "2014.7.5 jessie, backport patches for singleton SAuth class" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260362 (owner: 10ArielGlenn) [13:48:16] akosiaris, could you create a dir in maps2001 /srv with write access to max & i? [13:48:44] (03PS1) 10ArielGlenn: Revert "make ping_on_rotate work without minion data cache" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260363 [13:48:47] (03PS2) 10ArielGlenn: Revert "make ping_on_rotate work without minion data cache" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260363 [13:48:56] (03CR) 10ArielGlenn: [C: 032 V: 032] Revert "make ping_on_rotate work without minion data cache" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260363 (owner: 10ArielGlenn) [13:49:02] MaxSem: ok that parts is fixed https://gerrit.wikimedia.org/r/260361 [13:49:19] (03PS1) 10ArielGlenn: Revert "jessie 2014.7.5 continue reading events even after getting one with wrong tag" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260364 [13:49:24] (03PS2) 10ArielGlenn: Revert "jessie 2014.7.5 continue reading events even after getting one with wrong tag" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260364 [13:49:33] (03CR) 10ArielGlenn: [C: 032 V: 032] Revert "jessie 2014.7.5 continue reading events even after getting one with wrong tag" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260364 (owner: 10ArielGlenn) [13:49:34] PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:50:00] (03PS1) 10ArielGlenn: Revert "jessie 2014.7.5 patch for batch cli returns with broken dict" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260366 [13:50:03] yurik: /srv/temp [13:50:05] (03PS2) 10ArielGlenn: Revert "jessie 2014.7.5 patch for batch cli returns with broken dict" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260366 [13:50:12] thx [13:50:16] (03CR) 10ArielGlenn: [C: 032 V: 032] Revert "jessie 2014.7.5 patch for batch cli returns with broken dict" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260366 (owner: 10ArielGlenn) [13:50:52] RECOVERY - Disk space on mw1006 is OK: DISK OK [13:51:33] RECOVERY - RAID on mw1016 is OK: OK: no RAID installed [13:52:19] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/259071 (https://phabricator.wikimedia.org/T121435) (owner: 10Chad) [13:52:55] (03PS1) 10ArielGlenn: Revert "bump version number for wmf build, 2014.7.5+ds-1+wm1" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260368 [13:53:17] (03PS3) 10Alexandros Kosiaris: Monitoring script: Report failing x-ample title in the message [puppet] - 10https://gerrit.wikimedia.org/r/260354 (owner: 10Mobrovac) [13:53:23] (03CR) 10Alexandros Kosiaris: [V: 032] Monitoring script: Report failing x-ample title in the message [puppet] - 10https://gerrit.wikimedia.org/r/260354 (owner: 10Mobrovac) [13:54:25] 6operations, 10DBA: db1024 (s2 master) will run out of disk space in ~4 months - https://phabricator.wikimedia.org/T122048#1894688 (10jcrespo) While extending the Logical volume was successful, extending the actual xfs partition fails with: ``` xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Cannot allocate me... [13:56:01] Hey does anyone here have an IPv6 connection they could help me smoke test with? [13:56:53] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:24] AndyRussG: I have one but I'm about to be in a phone call [13:57:30] can it wait a few minutes? [13:58:10] apergos: thx so much, paravoid is helping on -dev! [13:58:16] ok great! [13:59:43] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:23] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:04:33] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:04:43] RECOVERY - Disk space on mw1006 is OK: DISK OK [14:05:32] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [14:05:43] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [14:05:53] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [14:06:03] RECOVERY - configured eth on mw1006 is OK: OK - interfaces up [14:06:43] RECOVERY - DPKG on mw1006 is OK: All packages OK [14:07:45] !log me and yurik are nuking old maps data and reimporting planet [14:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:53] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [14:09:32] RECOVERY - RAID on mw1006 is OK: OK: no RAID installed [14:12:32] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [14:18:24] (03PS1) 10ArielGlenn: Revert "Revert "jessie 2014.7.5 patch for batch cli returns with broken dict"" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260371 [14:18:27] (03PS1) 10ArielGlenn: Revert "Revert "jessie 2014.7.5 continue reading events even after getting one with wrong tag"" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260372 [14:18:29] (03PS1) 10ArielGlenn: Revert "Revert "make ping_on_rotate work without minion data cache"" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260373 [14:18:30] (03PS1) 10ArielGlenn: Revert "Revert "2014.7.5 jessie, backport patches for singleton SAuth class"" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260374 [14:18:32] (03PS1) 10ArielGlenn: Revert "Revert "bump version number for wmf build, 2014.7.5+ds-1+wm1"" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260375 [14:19:34] apergos: would you be around to test over IPv6 after this goes out on production, say around 16:15 or 16:30 UTC? (paravoid is likely to be in a meeting) [14:19:54] that is... 7:30 our time, I'll be in that same meeting [14:20:06] if it were 15:45 UTC I could do it [14:20:12] And [14:20:17] AndyRussG: [14:20:32] 16:30 UTC is 18:15-18:30 EET (our time), so no [14:20:39] how can there be three people in here with and for tab-complete? [14:20:47] bah, it's two hours difference now [14:20:52] in that case, yes [14:21:41] AndyRussG: also, https://www.sixxs.net/main/ if you want to play more with all that :) [14:22:13] !log disabling puppet on labnet1002 for dnsmasq tests [14:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:01] paravoid: yeah I saw that, seemed you have to register to use a real tunnel? [14:23:20] it's been so long I don't remember [14:23:52] I did sometime back [14:23:57] apergos: now you see how hard a time I had choosing an IRC nick! [14:24:16] rg_andy not so much I guess [14:24:39] actually I wanted "arg" [14:24:43] :-D [14:25:10] Then I wouldn't be able to tell if people were pinging me or complaining about my stupid mistakes [14:25:40] same thing really :-P [14:25:54] anyways yes, 16:xx UTC I am here [14:26:06] 17:00 UTC I turn into a meeting pumpkin [14:26:22] apergos: OK got it, much appreciated :) [14:28:24] well, at least "arg" would have saved people from typing "arg" again [14:28:47] They could have just said arg² [14:29:44] (03CR) 10Hashar: [C: 031 V: 031] "Impacts production hosts gallium.wikimedia.org and scandium.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/260189 (owner: 10Dzahn) [14:30:53] paravoid: do you have a few minutes to help me understand what’s happening with my labs-metal pxe boot? Here’s a dump from the dhcp side: https://dpaste.de/4awt [14:35:13] (03CR) 10Hashar: [C: 04-1] "Needs reindentation. The file uses two spaces." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/259699 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [14:36:03] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 13 failures [14:36:11] (03CR) 10Hashar: [C: 031] RuboCop: fixed Style/ColonMethodCall offence [puppet] - 10https://gerrit.wikimedia.org/r/259702 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [14:38:12] 6operations: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1894746 (10MoritzMuehlenhoff) 3NEW [14:44:22] andrewbogott: a bit later if you don't mind [14:44:33] sure, at your convenience :) [14:47:53] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 13 failures [14:48:34] (03PS1) 10Jcrespo: Disable critical errors for codfw slaves due to lag [puppet] - 10https://gerrit.wikimedia.org/r/260377 [14:49:57] (03CR) 10Mark Bergsma: Disable critical errors for codfw slaves due to lag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260377 (owner: 10Jcrespo) [14:50:44] (03PS2) 10Jcrespo: Disable critical errors for codfw slaves due to lag [puppet] - 10https://gerrit.wikimedia.org/r/260377 [14:52:21] (03CR) 10Jcrespo: [C: 04-1] "(aside from the technical flaws, which should be corrected, too) I think there is not strong decision yet about this." [puppet] - 10https://gerrit.wikimedia.org/r/260377 (owner: 10Jcrespo) [14:54:03] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: puppet fail [14:55:42] (03PS3) 10Jcrespo: Disable critical errors for codfw slaves due to lag [puppet] - 10https://gerrit.wikimedia.org/r/260377 [14:56:16] (03CR) 10Jcrespo: [C: 04-1] Disable critical errors for codfw slaves due to lag [puppet] - 10https://gerrit.wikimedia.org/r/260377 (owner: 10Jcrespo) [15:11:50] (03CR) 10Ottomata: "Thanks for the reminder, this is no longer used, and can be totally removed. I'll try to get to that today, so probably this patch won't " [puppet] - 10https://gerrit.wikimedia.org/r/260194 (owner: 10Dzahn) [15:13:17] (03PS1) 10Faidon Liambotis: Revoke Coren's access [puppet] - 10https://gerrit.wikimedia.org/r/260379 [15:13:32] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:14:06] (03CR) 10Jcrespo: [C: 04-2] "This doesn't work: https://puppet-compiler.wmflabs.org/1523/db1073.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/260377 (owner: 10Jcrespo) [15:14:36] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Revoke Coren's access [puppet] - 10https://gerrit.wikimedia.org/r/260379 (owner: 10Faidon Liambotis) [15:23:58] (03PS2) 10ArielGlenn: Revert "Revert "2014.7.5 jessie, backport patches for singleton SAuth class"" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260374 [15:31:12] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:36:22] (03CR) 10ArielGlenn: [C: 032 V: 032] Revert "bump version number for wmf build, 2014.7.5+ds-1+wm1" [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260368 (owner: 10ArielGlenn) [15:37:32] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:43] (03PS2) 10ArielGlenn: bump version number for wmf build, 2014.7.5+ds-1+wm1 [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260375 [15:37:45] (03PS3) 10ArielGlenn: 2014.7.5 jessie, backport patches for singleton SAuth class [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260374 [15:37:47] (03PS2) 10ArielGlenn: make ping_on_rotate work without minion data cache [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260373 [15:37:49] (03PS2) 10ArielGlenn: jessie 2014.7.5 continue reading events even after getting one with wrong tag [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260372 [15:37:51] (03PS2) 10ArielGlenn: jessie 2014.7.5 patch for batch cli returns with broken dict [debs/salt] (jessie) - 10https://gerrit.wikimedia.org/r/260371 [15:38:40] (03PS1) 10Faidon Liambotis: Remove mpelletier from paging [puppet] - 10https://gerrit.wikimedia.org/r/260385 [15:39:02] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Remove mpelletier from paging [puppet] - 10https://gerrit.wikimedia.org/r/260385 (owner: 10Faidon Liambotis) [15:44:21] 6operations, 10hardware-requests: eqiad: (2) servers request for ORES - https://phabricator.wikimedia.org/T119598#1894881 (10akosiaris) >>! In T119598#1885991, @RobH wrote: > Ok, the cleanup of spares (via another task, they were migrated into a sheet for tracking and re-audited) has resulted in my finding a f... [15:44:26] !log adding new 1TB disk to restbase1007 [15:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:24] (03PS4) 10Jcrespo: Disable critical errors for codfw slaves due to lag [puppet] - 10https://gerrit.wikimedia.org/r/260377 [15:54:14] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:58] the API is really showering me with timeouts today, on requests like https://commons.wikimedia.org/w/api.php?action=query&list=usercontribs&uclimit=max&ucdir=newer&ucprop=title|timestamp|comment|tags&ucuser=... [15:56:27] i've been running similar queries a few days ago and they were going smoothly. [15:56:59] (03CR) 10Jcrespo: "This works." [puppet] - 10https://gerrit.wikimedia.org/r/260377 (owner: 10Jcrespo) [15:57:51] anomie: ostriches: thcipriani: marktraceur: Krenair: Hi! I'm here to be swatted! [15:58:06] * marktraceur swats AndyRussG [15:58:17] I should really take my name off the SWAT list [15:58:46] So that was the last time? [15:59:42] PROBLEM - restbase endpoints health on restbase1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:59:53] PROBLEM - dhclient process on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:59:54] (03Abandoned) 10Ottomata: Revert new kafka role inclusion on kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/260089 (owner: 10Ottomata) [16:00:05] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151221T1600). Please do the needful. [16:00:05] AndyRussG prtksxna: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:44] PROBLEM - salt-minion processes on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:00:49] AndyRussG: SWAT !== swat [16:00:55] I can SWAT. prtksxna ping for SWAT. [16:00:59] I'm not doing the SWAT today, sorry [16:01:18] thcipriani: thanks! also marktraceur thanks (in another sense) [16:01:28] !log jessie packages for salt with local patches deployed on restbase1001, looks fine but just in case. [16:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:45] thcipriani: o/ [16:02:29] * apergos lurks [16:04:24] RECOVERY - restbase endpoints health on restbase1004 is OK: All endpoints are healthy [16:05:55] PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: Connection timed out [16:06:36] looking ^ [16:06:48] did any of you reboot/work with mw1085? [16:06:55] thcipriani: Any thing you need me to do? [16:07:32] prtksxna: nope, you're patch looks fine to me. Just getting through CentralNotice update first. [16:07:44] thcipriani: Cool :) [16:08:31] !log depool restbase1007 [16:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:55] RECOVERY - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is OK: TCP OK - 3.003 second response time on port 9042 [16:13:45] PROBLEM - RAID on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:15:24] 6operations, 10hardware-requests: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1894961 (10fgiunchedi) started today to expand restbase1007: ``` sfdisk -d /dev/sda | sfdisk /dev/sdc mdadm --add /dev/md0 /dev/sdc1 mdadm --add /dev/md1 /dev/sdc2 mdadm --... [16:16:01] gwicke urandom ^ [16:16:39] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for cwdent - https://phabricator.wikimedia.org/T121916#1894963 (10RobH) a:3cwdent @cwdent: Since we have Katie's approval, we'll just need a few things from you. Do you know what groups you need to be added to to view the pageview data neede... [16:16:54] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.223, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:17:00] gwicke: I was incorrect in my estimation on expanding to a third ssd btw, resizing the fs is quick but resizing the raid0 isn't [16:17:03] that's me ^ [16:17:12] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for cwdent - https://phabricator.wikimedia.org/T121916#1894965 (10RobH) p:5Triage>3Normal [16:17:15] PROBLEM - cassandra-a service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is deactivating [16:18:22] hmm scap timeout to mw1085 and probably another mw server as well, still hung up. [16:18:28] !log thcipriani@tin Synchronized php-1.27.0-wmf.9/extensions/CentralNotice/resources/subscribing/ext.centralNotice.geoIP.js: SWAT: Update CentralNotice [[gerrit:260316]] (duration: 03m 03s) [16:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:43] nope, just mw1085 :\ [16:18:57] ^ AndyRussG check the centralnotice update please. [16:19:19] thcipriani: K will do! [16:19:47] !log reboot restbase1007, load through the roof [16:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:55] RECOVERY - RAID on mw1016 is OK: OK: no RAID installed [16:19:58] AndyRussG: standing by for your tests when you want them [16:22:12] !log mw1085.eqiad.wmnet times out on SSH connection [16:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:29] apergos: K from here it looks like the new code is out! Try a private browsing window and make sure your network tab is capturing requests first, please!! :) [16:22:53] AndyRussG: url please? [16:23:04] apergos: en.wikipedia.org [16:23:30] apergos: sorry, it's just to check that you get correctly geolocated on IPv6 [16:24:06] Once you're there, also check your cookies to see what your GeoIP cookies are (I think there'll be two, one for .wikipedia.org and another for en.wikpedia.org) [16:24:35] Pls check in the network tab that there was a request to geoiplookup.wikimedia.org/ [16:24:51] thcipriani, management interface is also unresponsive [16:24:58] And in the console, pls try to see what mw.centralNotice.data.country says [16:25:04] apergos: and thx!!!! [16:25:33] correction: the actual management is reponsive, the serial console interface is not [16:26:05] hm I don't see the geoiplookup request [16:26:12] let me check about cookies [16:26:44] apergos: private browsing? cleared cache/localstorage module store? [16:26:48] What browser are you testing on? [16:26:50] !log powercycling mw1085.eqiad.wmnet [16:26:51] private window [16:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:26] firefox 42.0 [16:27:35] Hi thcipriani [16:27:39] apergos: I'm pretty sure I've had interference from existing locally stored content in FF private windows, can u try Chrome or Chromium? [16:27:43] it is rebooting now [16:27:45] Krenair: Hello [16:27:47] sure [16:27:49] let me do that [16:27:50] thx! [16:28:08] thcipriani, would I be able to squeeze in a backport of the patch for https://phabricator.wikimedia.org/T121596#1892700 ? [16:28:15] PROBLEM - DPKG on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:28:25] PROBLEM - RAID on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:28:35] PROBLEM - Disk space on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:28:44] PROBLEM - SSH on mw1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:47] AndyRussG: you get to walk me through dev tools on chrome though [16:28:52] I don't even know where they are [16:28:55] PROBLEM - configured eth on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:05] PROBLEM - salt-minion processes on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:06] ostriches: your call on https://phabricator.wikimedia.org/T121596#1892700 [16:29:15] PROBLEM - dhclient process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:15] apergos: K try ctrl-shift-I [16:29:28] It's almost identical [16:29:34] Private window first, again [16:29:34] PROBLEM - nutcracker port on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:45] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [16:29:54] PROBLEM - nutcracker process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:59] Then ctrl-shift-I, then click on network (maybe not necessary, but just in case) to make sure it captures network activity [16:30:02] geoiplookup is in the list, yay [16:30:08] ;) [16:30:11] yeah I just couldn\t find it in their silly menu [16:30:11] nice! [16:30:17] let's see [16:30:22] Krenair: hmm, looks like ostriches is away :\ [16:31:14] RECOVERY - dhclient process on restbase1007 is OK: PROCS OK: 0 processes with command name dhclient [16:31:35] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [16:31:55] RECOVERY - cassandra-a service on restbase1007 is OK: OK - cassandra-a is active [16:32:01] thcipriani, is approval from ostriches needed because fundraising? [16:32:24] RECOVERY - salt-minion processes on restbase1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:33:01] mw1016 is overloaded, but responding [16:33:30] Krenair: yeah, this week we're trying to limit deploys because fundraising season and everyone is going to be not around fairly shortly. See https://wikitech.wikimedia.org/wiki/Deployments#Week_of_December_21st [16:33:56] thcipriani, okay. this backport only touches JS files that are loaded by VE [16:34:05] !log sync-common to mw1085 [16:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:34:30] well I couldn't get that element but I see that geo is set ti city: athens and country:gr [16:34:35] so thats good anyways [16:34:36] dammit xchat [16:34:44] thcipriani, okay, I guess this isn't security or data loss [16:34:54] eh [16:35:05] are we talking about the white-space thing that wrecked everything? [16:35:21] i would totally backport that. it's a big UI regression in some cases [16:36:15] AndyRussG: does that get it for you? [16:36:49] apergos: which element couldn't you get? [16:37:16] apergos: you should definitely get something from mw.centralNotice.data.country in the console [16:37:23] well I was looking for a way to get at mw.centralNotice.data.country [16:37:36] Open the JavaScript console [16:37:39] And just type that [16:37:47] jynus: thanks for the sync-common, appreciated :) [16:37:48] (or paste) [16:38:06] oh, I tried giving an actual javascript command [16:38:07] silly me [16:38:10] np! [16:38:21] of course it did something bizarre with that [16:38:36] It's the console "tab", along the top of the dev tools area [16:38:44] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 60 failures [16:38:47] that\s where I was [16:38:55] thcipriani, please feel free to give me feedback on things I should or should not do, sometimes it is the lack of knowledge/coordionation [16:38:57] 'GR' [16:39:04] apergos: OK fantastic [16:39:05] which matches with the geoip returns [16:39:06] so good [16:39:09] Yeah! [16:39:22] apergos: and the cookies? There should be two GeoIP cookies [16:40:08] jynus: sync-common was definitely needed in this instance. [16:40:22] thanks [16:40:29] jynus: hola, do you have couple mins? [16:40:41] apergos: One nearly empty (on the .wikpedia.org domain) and another with the same info in window.Geo (on the en.wikpedia.org domain) [16:40:49] nuria, plase give me some minutes, I'm in the middle of something [16:40:54] jynus: k [16:41:32] yeah there\s one that says v6 [16:41:56] and the other one says GR, Athens, stuff [16:41:56] right [16:42:00] so that looks ok [16:42:05] apergos: OK fantastic, yea that's perfect [16:42:35] anything else to check? [16:42:47] apergos: one last thing to try? Just reload the page in the same tab, and look at the network requests, and check the same stuff in the JS console? [16:42:52] sure [16:43:00] The second time around there shouldn't be any requests to geoiplookup [16:43:16] But the data in Geo and mw.centralNotice.data.country should still be good [16:43:19] well there is [16:43:51] apergos: still a request to geoiplookup? [16:43:53] yep [16:44:14] apergos: Hmmm... Which domains are your two GeoIP cookies for? [16:45:04] oh yeah so [16:45:09] they are both apparently for en.wi.o [16:45:13] er en.wp.o [16:45:22] BTW this is a smaller issue that I can try debugging by setting up a realy IPv6 connection [16:45:41] apergos: that's weird. If that's the case, one should have overwritten the other [16:46:12] (03PS1) 10Ottomata: Remove misc/logging.pp and ::relay related classes that are no longer used [puppet] - 10https://gerrit.wikimedia.org/r/260391 [16:46:12] Isn't one just for .wikipedia.org without the "en"? [16:46:19] well it's not listing that way [16:46:33] maybe I need to expand some of these columns [16:46:34] thcipriani: Should I be moving my patch for the next window? [16:46:45] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:46:51] ah there it is [16:46:58] yeah that was in a hidden column meh [16:47:27] so yes. one is .wikipedia.org, one is en.wikipedia.org (the one with all the geo stuff) [16:47:37] AndyRussG: is it ok if I go ahead and deploy the next patch in the window? Everything seem to be working ok for the time being? [16:47:45] thcipriani: the CentralNotice patch is definitely good, we won't be reverting or anything, just taking advantage of apergos's help to look at some more minor details [16:47:45] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 1 failures [16:48:12] prtksxna: kk, now I can get yours out the door [16:48:14] thcipriani: so yeah all good, thanks for waiting and asking!!! Sorry to have hogged the SWAT time prtksxna [16:48:25] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [16:48:50] thcipriani: Thanks! [16:49:03] 6operations: Move misc/udp2log.pp to a module, and role/logging.pp somewhere better - https://phabricator.wikimedia.org/T122058#1895064 (10Ottomata) 3NEW a:3Ottomata [16:49:11] apergos: K that's good. Hmm the extra call to geoiplookup might be from some other legacy code calling it [16:49:14] so AndyRussG I have 11 more minutes available [16:49:18] ah ha [16:49:27] (03CR) 10Ottomata: [C: 032] Remove misc/logging.pp and ::relay related classes that are no longer used [puppet] - 10https://gerrit.wikimedia.org/r/260391 (owner: 10Ottomata) [16:49:28] apergos: though it might also be a bug [16:50:39] apergos: we're certainly much better off than before and we can confirm the larger issue is gone by checking server and banner logs. If you feel like opening a debug session quickly and don't have more pressing stuf, that'd be fun, otherwise I can set up IPv6 to hack elsewise :) [16:51:09] I can be back here in a little over an hour if you like [16:51:26] apergos: OK sure! that'd be fantastic :) [16:51:33] registering might be faster (I can\t remember whether they had to approve me in some fashion or if it was automated)... anyways, I'll peek in [16:51:43] Yeah I'll also register! [16:51:48] ttyl [16:51:53] apergos: cya thx again!! [16:53:30] 6operations, 10hardware-requests: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1895082 (10fgiunchedi) on `restbase1007` the last step didn't go according to plan, while resizing was ongoing the machine's load went through the roof and I've force reboot... [16:53:52] AndyRussG: No worries :) [16:54:06] prtksxna: thx! [16:54:45] PROBLEM - nutcracker port on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:55:05] PROBLEM - nutcracker process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:56:27] !log thcipriani@tin Synchronized php-1.27.0-wmf.9/extensions/Popups/Popups.hooks.php: SWAT: Use ExtensionRegistry to determine whether TextExtracts is installed [[gerrit:260346]] (duration: 02m 48s) [16:56:29] thcipriani: Looks fixed to me. [16:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:56:38] prtksxna: that was quick :) [16:56:58] thcipriani: Indeed. Thanks a lot! [16:57:08] prtksxna: thanks for checking, appreciated. [16:57:42] !log timeout on sync-file to mw1016.eqiad.wmnet [16:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:57:53] yep, same issue [16:57:54] RECOVERY - Disk space on mw1016 is OK: DISK OK [16:57:58] thcipriani: I just saw the logmsgbot's message :P I guess I refreshed at just the right time [16:58:19] but I am trying not to restartis as ^it is still alive [16:58:21] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1895112 (10GWicke) 1004 finished decommissioning yesterday. [16:58:38] prtksxna: yeah, I think most of the servers had been done syncing for a while. mw1016 took a while to timeout. [16:59:10] there is a problem with the job runners, I am not sure ig it has been reported [17:01:55] RECOVERY - DPKG on mw1016 is OK: All packages OK [17:03:38] 6operations, 6Analytics-Kanban: Move misc/udp2log.pp to a module, and role/logging.pp somewhere better - https://phabricator.wikimedia.org/T122058#1895141 (10Ottomata) [17:04:14] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 60 failures [17:04:14] PROBLEM - Disk space on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:04:45] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:04:54] RECOVERY - dhclient process on mw1016 is OK: PROCS OK: 0 processes with command name dhclient [17:05:05] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [17:05:57] robh: time to watch the job queue for a global rename with +50000 edits? [17:06:27] jynus: ^ you are aware of this i see from earlier comment? [17:06:44] wasnt sure if you were commenting or investigating so just checking [17:06:52] (we're also in ops meeting, sorry if we reply slowly!) [17:08:04] PROBLEM - DPKG on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:10:30] O_O [17:10:55] PROBLEM - salt-minion processes on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:10:55] PROBLEM - dhclient process on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:11:15] PROBLEM - nutcracker port on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:12:24] RECOVERY - Disk space on mw1016 is OK: DISK OK [17:14:15] robh: in case you talked to me: not started yet, not sure about the aforementoined issues. [17:14:25] not related [17:14:35] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:14:42] Sorry, its mid ops meeting, I cannot really interrupt it unless its outage. [17:14:48] Otherwise I plan to run it down post meeting [17:14:57] (we have a one hour meeting weekly to sync up the team) [17:15:05] oh, sure. no problem. sorry for bothering. [17:15:10] no bother at all! [17:15:36] Folks letting us know about stuff is awesome =] I'm most def. going to get on it post meeting =] [17:16:34] (03PS1) 10Ottomata: Add udp2log module, make role::logging::mediawiki use it [puppet] - 10https://gerrit.wikimedia.org/r/260394 (https://phabricator.wikimedia.org/T122058) [17:16:47] 6operations, 10hardware-requests: eqiad: (2) servers request for ORES - https://phabricator.wikimedia.org/T119598#1895210 (10yuvipanda) >>! In T119598#1885991, @RobH wrote: > ** We went with ores100X, but Alex should approve and may want to use something different. These aren't machines for running ORES itsel... [17:18:25] PROBLEM - Disk space on mw1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:20:02] 6operations, 10Incident-Labs-NFS-20151216: Add step in start-nfs to ask operator to consider dropping some snapshots - https://phabricator.wikimedia.org/T121890#1895225 (10yuvipanda) Let's do this if there is more than one extra snapshot. [17:21:54] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: puppet fail [17:23:00] hey hashar, yt? are you the right person to ask about puppet catalog compilre? [17:24:41] <_joe_> ottomata: I would be, what's your problem? [17:24:50] _joe_: are you not on vacatioN!? [17:24:52] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1895233 (10Eevans) [17:24:53] <_joe_> yes [17:24:56] was confused because this [17:24:57] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1525/ [17:24:59] says no changes [17:25:01] <_joe_> just passing by IRC [17:25:04] http://puppet-compiler.wmflabs.org/1525/fluorine.eqiad.wmnet/ [17:25:05] http://puppet-compiler.wmflabs.org/1525/fluorine.eqiad.wmnet/ [17:25:07] oops [17:25:11] sorry [17:25:11] this [17:25:13] says no changes [17:25:14] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1525/ [17:25:29] but this says there are? http://puppet-compiler.wmflabs.org/1525/fluorine.eqiad.wmnet/ or am i reading it wrong? [17:25:29] <_joe_> look at the console output [17:25:36] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1895241 (10Eevans) [17:25:37] 6operations, 10RESTBase, 10RESTBase-Cassandra: Perform cleanups to reclaim space from recent topology changes - https://phabricator.wikimedia.org/T121535#1895240 (10Eevans) 5Open>3Resolved [17:25:47] <_joe_> that "no changes" means "not associated to any vcs event via zuul" [17:26:02] <_joe_> https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1525/console tells you 1 diff [17:26:43] hmm, ok, cool [17:26:44] good to know [17:26:56] is there a way to see the diff? like what files/lines actually would change? [17:26:57] on the host? [17:26:58] <_joe_> andrew b had the same doubt [17:27:09] <_joe_> https://puppet-compiler.wmflabs.org/1525/fluorine.eqiad.wmnet/ here you have the diff [17:27:27] yeah, i see that, but i mostly see class renmaes, etc. [17:27:31] which is expected [17:27:35] i want to know what actually changes on the host [17:28:01] <_joe_> all the file[] resources that are only in old, will be unmanaged now [17:28:15] <_joe_> the nrpe::check is gone too [17:29:05] hmm [17:29:10] <_joe_> looks like you removed something important like base::firewall? I dunno [17:29:32] <_joe_> service[ferm] is only in old... weird [17:30:08] <_joe_> yep, you have removed the inclusion of base::firewall somehow [17:30:27] oHh, yes that makes sense, need to move that to node instead of module firewall rules [17:30:47] right? that should just be in node declartion, ja? [17:30:48] include base::firewall [17:30:48] ? [17:30:57] <_joe_> or in the role, yes [17:31:10] k role is fine [17:31:19] <_joe_> k, gtg now [17:31:23] k thank you bye! [17:31:23] :) [17:31:29] <_joe_> netflix waits for me! [17:31:32] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for cwdent - https://phabricator.wikimedia.org/T121916#1895250 (10cwdent) Thanks @RobH, I checked with @andyrussg and those are indeed the correct groups. I signed the production access agreement and will assign this back to you. [17:31:43] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for cwdent - https://phabricator.wikimedia.org/T121916#1895253 (10cwdent) a:5cwdent>3RobH [17:32:46] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:33:04] 6operations: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1895256 (10RobH) a:3RobH [17:33:18] 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1894746 (10RobH) [17:33:48] 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1894746 (10RobH) So to confirm no data from these needs to be retained? The reclaim process includes wiping disks, so just checking. [17:34:43] (03PS2) 10Ottomata: Add udp2log module, make role::logging::mediawiki use it [puppet] - 10https://gerrit.wikimedia.org/r/260394 (https://phabricator.wikimedia.org/T122058) [17:35:30] 6operations, 10ops-codfw, 5Patch-For-Review: power off Codfw-Cisco Servers - https://phabricator.wikimedia.org/T115372#1895265 (10Papaul) a:5Papaul>3RobH This is complete. Please let me know if you have any questions. [17:37:41] (03CR) 10Ottomata: [C: 032] Add udp2log module, make role::logging::mediawiki use it [puppet] - 10https://gerrit.wikimedia.org/r/260394 (https://phabricator.wikimedia.org/T122058) (owner: 10Ottomata) [17:37:55] PROBLEM - configured eth on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:38:06] PROBLEM - nutcracker port on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:38:25] PROBLEM - nutcracker process on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:38:45] PROBLEM - salt-minion processes on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:39:04] PROBLEM - DPKG on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:39:15] PROBLEM - RAID on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:39:26] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:26] PROBLEM - Disk space on mw1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:40:16] ottomata: woohoo [17:40:32] great to see all these manifests get cleaned up [17:40:35] 6operations, 6Analytics-Kanban, 5Patch-For-Review: Move misc/udp2log.pp to a module - https://phabricator.wikimedia.org/T122058#1895283 (10Ottomata) [17:40:41] 6operations, 6Analytics-Kanban, 5Patch-For-Review: Move misc/udp2log.pp to a module [3 pts] - https://phabricator.wikimedia.org/T122058#1895064 (10Ottomata) [17:40:44] paravoid: :) [17:40:50] yeah, only two more files in manifests/misc left! [17:42:12] 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1895294 (10MoritzMuehlenhoff) We've exported the original LDAP data (which was used for the import to seaborgium/serpens) and Andrew Bogott and myself have have kept a copy of these for a while, a... [17:42:14] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [17:42:26] RECOVERY - nutcracker process on mw1012 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [17:42:27] 6operations, 10hardware-requests: Decomission/reclaim nembus/neptunium - https://phabricator.wikimedia.org/T122050#1895295 (10MoritzMuehlenhoff) [17:42:45] RECOVERY - salt-minion processes on mw1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:42:56] RECOVERY - DPKG on mw1012 is OK: All packages OK [17:43:13] 6operations: setup/deploy WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1895299 (10RobH) p:5Normal>3High [17:43:14] RECOVERY - RAID on mw1012 is OK: OK: no RAID installed [17:43:25] RECOVERY - Disk space on mw1012 is OK: DISK OK [17:43:25] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [17:43:55] RECOVERY - configured eth on mw1012 is OK: OK - interfaces up [17:47:25] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 13 failures [17:48:26] !log restarted hhvm on mw1012.eqiad.wmnet [17:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:56:42] (03PS1) 10Ottomata: Replace burrow kafka config with new role [puppet] - 10https://gerrit.wikimedia.org/r/260396 (https://phabricator.wikimedia.org/T121659) [17:57:28] akosiaris: have time to give htis a look over today? https://gerrit.wikimedia.org/r/#/c/260047/ [17:57:37] oh sorry (in meeting and you are talking! :? ) [17:58:30] 6operations: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1895327 (10coren) 3NEW [17:58:39] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1895337 (10RobH) p:5Normal>3High [18:00:33] Steinsplitter: so meeting is over and we have three opsen looking into the job runner issue [18:00:37] thank you for reporting it =] [18:00:47] akosiaris: i can talk about the hadoop->es thing whenever you want [18:00:53] AndyRussG: still need an ipv6 monkey? [18:01:22] so what are the job runner issues? [18:01:35] paravoid: take a look at https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Jobrunners+eqiad&h=&tab=m&vn=&hide-hf=false&m=mem_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [18:01:44] seems like a slow memory exhaustion [18:01:45] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 60 failures [18:01:55] apergos: sure that'd be fantastic! Sorry I remembered I have a meeting just now, but it'll be done in 30 min [18:02:00] heh [18:02:06] no problem, I have packages to roll [18:02:13] ottomata: so the reasoning is that we got nothing right now that can do what discovery wants, right ? [18:02:24] pageview api does not cover their needs, eventbus either ? [18:02:29] 6operations: setup/deploy WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1895343 (10RobH) [18:02:37] (03PS2) 10Ottomata: Replace burrow kafka config with new role [puppet] - 10https://gerrit.wikimedia.org/r/260396 (https://phabricator.wikimedia.org/T121659) [18:02:53] no, akosiaris: https://phabricator.wikimedia.org/T120281#1891092 [18:02:58] apergos: K thanks! [18:03:00] ugh [18:03:01] memleak? [18:03:09] paravoid: seems like it [18:03:15] 6operations, 10Wikimedia-Site-Requests: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1895345 (10Jdforrester-PERSONAL) [18:03:15] no task yet [18:03:54] akosiaris: even if the scores were in aqs (which is a little weird, the scores will eventually be specific to search results), there would still need to be something that pulled the scores and POSTed updates to elasticsearch. [18:04:05] there's no good place to do that now [18:04:18] hadoop is a good place to do that, since it is a distributed job framework [18:04:30] oh that would obviously be a small client in ES cluster boxes [18:04:44] it would do no calculations or something [18:04:50] akosiaris: apparently it isn't so small, its a lot of batch updates, and there needs to be failed job handling [18:04:52] just get the results and put them in ES [18:05:19] yeah, but that is basically standard in any of those things [18:05:24] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:05:25] like, hundreds of batches of millions of document updates [18:05:52] so the problem is that the scores are not in pageview api so they need to be calculated anyway [18:05:55] which makes sense [18:06:03] it would be weird for AQS to provide that [18:06:10] it's way to service specific anyway [18:06:14] too* [18:06:21] ja, scores could go into kafka [18:06:28] and then somehting could subscribe and do the updates [18:06:29] 6operations, 10Incident-Labs-NFS-20151216: Reinstall labstore1002 to ensure consistency with labstore1001 - https://phabricator.wikimedia.org/T121905#1895353 (10chasemp) p:5Triage>3High a:3chasemp [18:06:30] i agree that's better [18:06:35] but that doesn't really exist [18:06:41] and it does exist in hadoop [18:07:02] AaronSchulz: ping? [18:07:45] the way discovery was describing it, it sounded like making a new service or job runner or something just for this is more hacky and error prone than opening the firewall. [18:07:50] ottomata: yeah obviously that something does not exist and should exist and it's something that should be built it in the ES cluster [18:07:55] hadoop has all the needed parallelization and job handling already [18:08:01] and not in a cluster in a galaxy far far away [18:08:18] hadoop should however be the provider of all that data (scores and such) [18:08:22] this is very simliar to the change propogation problem [18:08:32] so i agree with you [18:08:38] i just don't thikn we have a solution now [18:08:45] 6operations, 10Incident-Labs-NFS-20151216: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#1895362 (10mark) a:3faidon [18:08:56] tbh, when the next thing like that comes, we still wont [18:09:04] ? [18:09:05] cause we will have solved the previous problem adhoc [18:09:15] and we will just repeat this convo [18:09:28] oh, the next thing that needs to reach out from hadoop? [18:09:29] aye. [18:09:49] ori: hey [18:09:59] yeah, i the new job queue is very new and probably won't be ready for quite a while [18:10:01] def not next quater [18:10:13] what new job queue ? [18:10:13] paravoid: hey [18:10:14] and you are right, there will probably be more asks for sending analytics data to prod systems before then [18:10:27] ori: jobrunners appear to be OOMing, see https://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Jobrunners+eqiad&h=&tab=m&vn=&hide-hf=false&m=mem_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [18:10:40] akosiaris: one of the proposed eventbus use cases is a new job queue [18:10:47] ori: it seems new, but to predate by a few days the expansion of the job pool that you did [18:10:50] based on kafka [18:11:07] ori: so I'm guessing that expansion had something to do with this? [18:11:16] nope [18:11:17] not sure why it's tight to event bus but yes a new jobqueue would be very very good [18:11:20] but, they are doing restbase change propagation first, which is similar, but not so generic [18:11:29] akosiaris: its not tied to eventbus [18:11:38] s/tight/tied/ sorry [18:11:45] its more tied to kafka, but eventbus would be related [18:11:49] in that schema-ed events produced to kafka [18:11:54] could be a source for certain jobs [18:12:00] seems to coincide with2015-12-15 02:03 ori: deployed I6ebffe559 to job runners [18:12:01] ok [18:12:09] i don't remember what that change was, looking it up [18:12:28] the expansion of the job runner pool was then too [18:12:45] but I think the memory increase predates it by a couple of days [18:12:48] ottomata: ok then, we should make a priority though to undo that hadoop => ES change [18:13:01] akosiaris: i think that's fari [18:13:03] fair [18:13:19] alright [18:13:30] akosiaris: i am worried though, new generic job queue sounds like a hard problem, and one that is solved sorta by hadoop already.>.>>>haha, maybe we need a prod hadoop cluster :D [18:13:30] ? [18:13:51] oh it is a hard problem indeed [18:14:01] hmm [18:14:04] so yeah, the worry is that it is hard, and we will try to invent our own, and it will take a long time. [18:14:04] but it's something we need like yesterday [18:14:04] or maybe not [18:14:15] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [18:14:24] ottomata: oh we should not do it in NIH syndrome way [18:14:26] it coincides perfectly with that deploy [18:14:26] there were various events on the 6th, 9th and 11th [18:14:32] so, yes, if we have a new job queue, then we can prioritize moving these ES update jobs to that [18:14:39] but the real increase above standard levels is on the 15th indeed [18:14:40] ? [18:14:56] ottomata: ok, wanna comment on that task with that then ? [18:15:02] k [18:15:03] I 'll amend the router acls [18:15:26] 2015-12-15T18:36:00+00:00 121182924560 0 47741205725 [18:15:26] 2015-12-15T19:18:00+00:00 176712015830 0 306616497270 [18:15:38] yeah [18:15:47] actually no, that's 18:37 ori: Depooled and drained mw1161-1169 app servers, now re-purposing as job runners, per T121549 [18:15:50] hrmmmmmmm [18:16:01] i still think it's the change [18:16:22] anyways, how bad is it? we can restart the jobrunner to stave off any immediate crisis [18:16:50] jobrunners are dying because of OOM [18:16:56] damn [18:16:59] first swapping, then eventually OOMin [18:17:00] ok [18:17:05] investigating then [18:17:05] ori: I 've restarted hhvm on mw1003 and mw1012 already [18:17:13] AaronSchulz: please look as well [18:17:17] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1895374 (10Ottomata) > when we have the service, we c... [18:17:18] thanks guys [18:17:26] I pinged above already, he hasn't responded yet [18:17:30] still early :) [18:17:51] akosiaris: updated ticket, lemme know if i misrepresented you there [18:18:03] in other qs: akosiaris can you look this over? https://gerrit.wikimedia.org/r/#/c/260047/ [18:18:19] 6operations, 10ops-eqiad, 10Incident-Labs-NFS-20151216, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1895381 (10coren) a:5coren>3chasemp [18:18:24] ottomata: yes I 've have the tab opened already [18:18:31] cool thank you :) [18:18:34] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.223, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [18:18:36] PROBLEM - Restbase root url on restbase1007 is CRITICAL: Connection refused [18:18:51] the ratio of mem used to total hasn't changed a ton [18:18:54] PROBLEM - cassandra-a service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [18:19:31] that's me ^, known [18:19:32] so, the ones that you added have tons of memory more [18:19:34] 6operations, 10Incident-Labs-NFS-20151216, 6Labs: Investigate better way of deferring activation of Labs LVM volumes (and corresponding snapshots) until after system boot - https://phabricator.wikimedia.org/T121629#1895383 (10yuvipanda) Need to figure out if lvm snapshots need to be activated for COW to work [18:19:36] PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: Connection refused [18:19:55] currently 245 / 742.1 = ~0.33, minima 65.2 / 177.7 = ~0.37 [18:19:59] 12G vs. 64G [18:20:36] mw1001 - mw1016 have 12G of RAM (older generation), while mw1161-mw1169 have 64G of RAM [18:20:53] (03CR) 10Ottomata: [C: 032] Replace burrow kafka config with new role [puppet] - 10https://gerrit.wikimedia.org/r/260396 (https://phabricator.wikimedia.org/T121659) (owner: 10Ottomata) [18:22:26] ACKNOWLEDGEMENT - Restbase root url on restbase1007 is CRITICAL: Connection refused Filippo Giunchedi ssd/raid expansion in progress T119935 [18:22:26] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: Connection refused Filippo Giunchedi ssd/raid expansion in progress T119935 [18:22:27] ACKNOWLEDGEMENT - cassandra-a service on restbase1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed Filippo Giunchedi ssd/raid expansion in progress T119935 [18:22:27] ACKNOWLEDGEMENT - restbase endpoints health on restbase1007 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.223, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Filippo Giunchedi ssd/raid expansion in progress T119935 [18:22:51] currently mw1016 is responding at <1KB/s on serial console, I would powecycle it, but that will not solve the problem, only send it somewhere else [18:23:45] PROBLEM - puppet last run on mw1094 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:20] (03PS1) 10Ottomata: Remove unused ganglia kafka views [puppet] - 10https://gerrit.wikimedia.org/r/260404 (https://phabricator.wikimedia.org/T121659) [18:25:30] ottomata: <3 [18:25:45] hehe ;) [18:25:45] PROBLEM - NTP on mw1016 is CRITICAL: NTP CRITICAL: No response from NTP server [18:26:27] I've honestly lost track of the jobrunner setup post-hhvm [18:26:58] where are the job limits per runner provisioned now? jobrunner.conf? [18:27:19] modules/mediawiki/templates/jobrunner/jobrunner.conf.erb [18:27:49] !log mw1017: enabled jemalloc profiling, restarted hhvm, now running hhvm-collect-heaps [18:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:28:19] (03Abandoned) 10Dzahn: tor: use tor::instance to set up both [puppet] - 10https://gerrit.wikimedia.org/r/260064 (owner: 10Dzahn) [18:30:15] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:30:24] (03CR) 10Ottomata: [C: 032] Remove unused ganglia kafka views [puppet] - 10https://gerrit.wikimedia.org/r/260404 (https://phabricator.wikimedia.org/T121659) (owner: 10Ottomata) [18:32:07] (03PS1) 10Ottomata: eventlogging role now uses new kafka analytics role [puppet] - 10https://gerrit.wikimedia.org/r/260405 (https://phabricator.wikimedia.org/T121659) [18:35:39] !log correction: previous log message was for mw1015, not mw1017 [18:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:35:53] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1895422 (10akosiaris) >>! In T120281#1895374, @Ottomata wrote: >> when we have the paravoid, akosiaris: do you know about hhvm-collect-heaps and hhvm-diff-heaps, btw? I'm already running it on mw1015 but i'm mentioning it since it's a useful tool. i should document it on wikitech. [18:36:06] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 60 failures [18:36:17] ori: no I did not. please do document it [18:36:45] I do [18:37:00] I've code reviewed it too :P [18:37:20] (03PS3) 10Dzahn: tor: move role to module/role [puppet] - 10https://gerrit.wikimedia.org/r/260065 [18:37:39] (03CR) 10jenkins-bot: [V: 04-1] tor: move role to module/role [puppet] - 10https://gerrit.wikimedia.org/r/260065 (owner: 10Dzahn) [18:37:49] maybe we can reproduce it, paravoid and then get some metrics about what it is exactly causing it? [18:38:12] i re-did it as a python script and got it upstreamed back in june https://github.com/facebook/hhvm/blob/master/hphp/tools/hhvm-leak-isolator [18:38:21] but we don't have that yet [18:40:01] hm, I'm not sure if the 3.11 packages would ship that [18:40:03] I'll check and make sure [18:40:16] good to know! [18:43:54] RECOVERY - NTP on mw1016 is OK: NTP OK: Offset -0.000744342804 secs [18:44:15] RECOVERY - RAID on mw1016 is OK: OK: no RAID installed [18:44:25] RECOVERY - SSH on mw1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [18:44:25] RECOVERY - Disk space on mw1016 is OK: DISK OK [18:44:45] RECOVERY - configured eth on mw1016 is OK: OK - interfaces up [18:44:45] RECOVERY - salt-minion processes on mw1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:44:45] RECOVERY - dhclient process on mw1016 is OK: PROCS OK: 0 processes with command name dhclient [18:44:58] did you do something? [18:45:06] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [18:45:20] (I did not) [18:45:25] probably the OOM killer did [18:45:31] load average: 29.78, 38.11, 43.71 [18:45:34] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:45:45] (03CR) 10Dzahn: [C: 04-1] ""git::clone" not just "git", right?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260247 (https://phabricator.wikimedia.org/T120932) (owner: 10Hoo man) [18:45:54] RECOVERY - DPKG on mw1016 is OK: All packages OK [18:46:10] I will resync it [18:47:06] !log common-sync: Copying to mw1016.eqiad.wmnet from tin.eqiad.wmnet [18:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:47:54] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [18:48:38] (03CR) 10Alexandros Kosiaris: [C: 04-1] "various inline comments. I see there is provision for codfw from the start. That's fine but I am not in love with the comment approach, mi" (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260047 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [18:48:42] ottomata: ^ [18:49:45] RECOVERY - puppet last run on mw1094 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:50:05] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 60 failures [18:50:16] PROBLEM - DPKG on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:50:23] 6operations, 10Incident-Labs-NFS-20151216: Investigate need and candidate for labstore100(1|2) kernel upgrade - https://phabricator.wikimedia.org/T121903#1895448 (10faidon) a:5faidon>3None Have we seen those in labstore1002? The ones I see in labstore1001 right now all look like being I/O related (I/O bein... [18:50:45] PROBLEM - SSH on mw1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:45] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:51:05] PROBLEM - RAID on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:51:51] it is a "hot potato" situation :-) [18:52:14] RECOVERY - DPKG on mw1011 is OK: All packages OK [18:52:25] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [18:52:35] RECOVERY - SSH on mw1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [18:52:36] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [18:52:56] RECOVERY - RAID on mw1011 is OK: OK: no RAID installed [18:53:14] (03PS4) 10ArielGlenn: tor: move role to module/role [puppet] - 10https://gerrit.wikimedia.org/r/260065 (owner: 10Dzahn) [18:53:37] (03CR) 10Ottomata: "Cool, can remove the codfw comments. We would provision this now, except we are blocked on https://phabricator.wikimedia.org/T121882 ther" [puppet] - 10https://gerrit.wikimedia.org/r/260047 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [18:56:30] (03PS1) 10Filippo Giunchedi: cassandra: add restbase100[3456] instances [dns] - 10https://gerrit.wikimedia.org/r/260411 [18:59:04] PROBLEM - configured eth on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:04] PROBLEM - RAID on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:41] apergos: we could maybe debug over hangout/screenshare for a few minutes? Or Krinkle have you discovered why apergos (and maybe you) continued to get geoiplookup calls on every pageview after the fix? Just saw https://gerrit.wikimedia.org/r/#/c/260409/, I guess that's the explanation, that we don't know which of the cookies we'll get via $.cookie? [19:00:04] PROBLEM - Disk space on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:00:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase100[3456] instances [dns] - 10https://gerrit.wikimedia.org/r/260411 (owner: 10Filippo Giunchedi) [19:00:33] * apergos peeks in [19:00:54] PROBLEM - SSH on mw1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:01:25] PROBLEM - nutcracker port on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:01:46] AndyRussG: Remember that for the majority case, this patch didn't cause repeated access, it introduced access in the first place. Previously geoiplookup would not be consulted at all. [19:02:03] Now it's working again, and as normal does it for each page view if Ipv6/varnish didn't get it [19:02:07] that's always been that way [19:02:09] 6operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1895469 (10RobH) [19:02:16] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 60 failures [19:02:30] paravoid, ps aux while freezing: https://phabricator.wikimedia.org/P2446 [19:02:42] Krinkle: nope [19:02:55] Krinkle: but the (failed) intent of setting the cookie in JS is to prevent repeated lookups, no? [19:02:56] the cookie is deliberately settable from JS [19:02:56] PROBLEM - dhclient process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:03:00] AndyRussG is correct [19:03:05] ori: read my comment [19:03:08] https://gerrit.wikimedia.org/r/#/c/260409/ [19:03:18] settability is irrelevant at this point [19:03:24] not unless the code is changed to use it in that way [19:03:24] http://blog.jasoncust.com/2012/01/problem-with-documentcookie.html [19:03:35] PROBLEM - salt-minion processes on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:03:45] (^ from Krinkle's comment) [19:03:48] it'll need 1) To know the wildcard domain (can be configurable, e.g. for all non-wikimedia.org wikis), 2) to search document.cookie and pick one [19:03:56] PROBLEM - nutcracker process on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:04:12] 6operations, 10ops-codfw: tegmen/wmf6381: apply hostname label & update visible label field in racktables - https://phabricator.wikimedia.org/T122065#1895473 (10RobH) 3NEW a:3Papaul [19:04:16] that code never existed and so that part never worked. It's always been either cookie from varnish once, or geo on every view [19:04:52] Krinkle: at least you can also confirm that on your IPv6 connection you're also getting correctly geolocated now, right? [19:04:54] 6operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1895488 (10RobH) [19:05:10] we can (and should, if not already) allow the browser to cache geo for some time though. Or alternatively, we could put it in window.sessionStorage which would match cookie semantics and make it easier instead of searching through cookies. [19:05:31] AndyRussG: As of this patch I'm now actually seeing banners for the first time where Im staying in SF. [19:05:42] because it's now correctly falling back to geo ip lookups [19:05:53] Krinkle: well that's a damn fine improvement! [19:05:59] I see it also setting the cookie (I can tell, from the url-percent encoded colon in the value) [19:06:08] * AndyRussG guzzles all the chmapaign [19:06:13] 6operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1882900 (10RobH) [19:06:15] :-D [19:06:26] PROBLEM - DPKG on mw1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:06:36] but it's not using the cookie on subsequent requests. But code for that doesn't exist so is expected to not work. [19:06:39] the varnish code that determines the domain is https://github.com/wikimedia/operations-puppet/blob/production/templates/varnish/geoip.inc.vcl.erb#L102-L134 [19:07:03] sounds good [19:07:21] Krinkle: OK so at least I wasn't completely wrongheaded in the original commet [19:08:12] (03CR) 10Ottomata: [C: 032] eventlogging role now uses new kafka analytics role [puppet] - 10https://gerrit.wikimedia.org/r/260405 (https://phabricator.wikimedia.org/T121659) (owner: 10Ottomata) [19:08:45] ori: AndyRussG: So, yeah, it'd be a good improvement to update the client code to match the domain setting. That way it actually will work. [19:09:07] Looks like the varnish implementation was written in such a way to intentionally allow JS to update it, but was never made use of. [19:09:19] yeah [19:09:21] i wrote it :) [19:09:34] RECOVERY - salt-minion processes on mw1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:09:44] RECOVERY - nutcracker process on mw1011 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:09:55] RECOVERY - Disk space on mw1011 is OK: DISK OK [19:09:59] (03PS1) 10RobH: setting tegmen dns entries [dns] - 10https://gerrit.wikimedia.org/r/260412 [19:10:15] RECOVERY - DPKG on mw1011 is OK: All packages OK [19:10:45] RECOVERY - SSH on mw1011 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [19:10:45] RECOVERY - dhclient process on mw1011 is OK: PROCS OK: 0 processes with command name dhclient [19:10:51] Krinkle: ori: thanks again 4 your help on this! [19:10:55] (03PS1) 10Ottomata: Use new kafka role in role::logstash::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/260413 (https://phabricator.wikimedia.org/T121659) [19:10:58] don't thank me [19:11:04] !log rolling restart of hhvm on the eqiad jobrunners [19:11:05] RECOVERY - configured eth on mw1011 is OK: OK - interfaces up [19:11:05] RECOVERY - RAID on mw1011 is OK: OK: no RAID installed [19:11:05] Krinkle gets all the credit [19:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:11:16] RECOVERY - nutcracker port on mw1011 is OK: TCP OK - 0.000 second response time on port 11212 [19:11:18] (03CR) 10RobH: [C: 032] setting tegmen dns entries [dns] - 10https://gerrit.wikimedia.org/r/260412 (owner: 10RobH) [19:11:47] apergos: looks like this is all figured out, the behaviour you were seeing is the same as what Krinkle described and explained! Thanks in any case :) [19:11:50] AndyRussG: does this mean I'm off the hook? or are there other things I can help you check ? [19:11:51] ah [19:11:55] perfect! [19:11:59] yeah [19:12:02] then I shall go check out... dinner! [19:12:03] All good 4 now! [19:12:06] tegmen, eh [19:12:10] apergos: enjoy! [19:12:14] thanks! [19:12:23] mutante: star name. [19:12:35] :) just saw [19:12:43] could also be a video game [19:12:58] i admit i may be partial to easy to type star names ;] [19:13:32] TIL.."A structure that covers or roofs over a part. " [19:13:33] Krinkle: on the nitpicky inline comment aspect of life, did you see my gerrit comment? I don't see when we would expect a "" GeoIP cookie value (as described in the comment in the code in your patch) [19:13:50] (03PS1) 10Filippo Giunchedi: cassandra: add restbase100[3456] instances to seeds [puppet] - 10https://gerrit.wikimedia.org/r/260414 [19:13:54] PROBLEM - Disk space on restbase1003 is CRITICAL: DISK CRITICAL - free space: /var 110739 MB (3% inode=99%) [19:13:55] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:14:05] !log powercycling mw1011 [19:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:30] (03CR) 10Ottomata: [C: 032] Use new kafka role in role::logstash::eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/260413 (https://phabricator.wikimedia.org/T121659) (owner: 10Ottomata) [19:14:39] (03PS2) 10Dzahn: contint: rename git-daemon to git_daemon [puppet] - 10https://gerrit.wikimedia.org/r/260189 [19:14:45] (03PS2) 10Filippo Giunchedi: cassandra: add restbase100[3456] instances to seeds [puppet] - 10https://gerrit.wikimedia.org/r/260414 [19:14:52] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase100[3456] instances to seeds [puppet] - 10https://gerrit.wikimedia.org/r/260414 (owner: 10Filippo Giunchedi) [19:14:55] (03CR) 10Dzahn: [C: 032] contint: rename git-daemon to git_daemon [puppet] - 10https://gerrit.wikimedia.org/r/260189 (owner: 10Dzahn) [19:15:19] (03PS1) 10Ottomata: Use new kafka role in role::cache::kafka [puppet] - 10https://gerrit.wikimedia.org/r/260415 (https://phabricator.wikimedia.org/T121659) [19:15:30] (03PS2) 10Ottomata: Use new kafka role in role::cache::kafka [puppet] - 10https://gerrit.wikimedia.org/r/260415 (https://phabricator.wikimedia.org/T121659) [19:16:11] AndyRussG: The "" is unlikely indeed, but I always account for it based on past experience. [19:16:18] (03PS3) 10Dzahn: contint: rename git-daemon to git_daemon [puppet] - 10https://gerrit.wikimedia.org/r/260189 [19:16:21] Cookies are a fragile mechanism [19:16:27] Krinkle: where might it come from? [19:16:52] AndyRussG: A better example might be "" rather than "" [19:17:11] whcih I think the code already accounts for as well [19:17:17] (03PS2) 10Dzahn: mw_rc_irc: rename irc-echo to irc_echo [puppet] - 10https://gerrit.wikimedia.org/r/260202 [19:17:19] Krinkle: right, that's a specific example of the general case of cookie corruption [19:17:24] K yes [19:17:32] AndyRussG: The main thing I want to avoid is any assumptions like "if not A, then it is B" [19:17:43] But rather, make sure code always checks for both. [19:17:58] (03PS1) 10RobH: setting tegmen install parameters [puppet] - 10https://gerrit.wikimedia.org/r/260417 [19:17:59] and not assume there is only ywo possible types of values and be lazy about validating the latter if the former isn't the case [19:18:18] Krinkle: yeah [19:18:25] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 42 failures [19:19:05] My (very big) mistake here was thinking I was understanding and correctly refactoring the code without checking expected inputs carefully [19:19:15] 42 failures means it's likely real, looks [19:19:17] And also assuming IPv6 was a very minor edge case [19:19:43] Krinkle: you may be interested in https://phabricator.wikimedia.org/T59126 [19:19:44] AndyRussG: So yeah, if fixing the second cookie bug (it not being updated) is allowed within the freeze, that'd be good. It would cut out the geoiplookup delay-to-banner from Ipv6 repeat views. [19:19:50] ("Provide a JavaScript API for geoip lookups ") [19:19:53] (03PS1) 10Filippo Giunchedi: restbase: reprovision restbase1004 [puppet] - 10https://gerrit.wikimedia.org/r/260418 [19:19:55] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:20:06] (03PS3) 10Ottomata: Use new kafka role in role::cache::kafka [puppet] - 10https://gerrit.wikimedia.org/r/260415 (https://phabricator.wikimedia.org/T121659) [19:20:14] ori: hmm [19:20:15] it should be done correctly in one place, and all js code that needs geoip info should use some common api [19:20:25] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:20:40] (03PS2) 10RobH: setting tegmen install parameters [puppet] - 10https://gerrit.wikimedia.org/r/260417 [19:20:43] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: reprovision restbase1004 [puppet] - 10https://gerrit.wikimedia.org/r/260418 (owner: 10Filippo Giunchedi) [19:21:33] (03PS3) 10RobH: setting tegmen install parameters [puppet] - 10https://gerrit.wikimedia.org/r/260417 [19:21:41] Krinkle: ori: I think there was a patch to use a different database now that does deal with IPv6 geolocation... https://gerrit.wikimedia.org/r/#/c/253619/ [19:21:41] (03CR) 10RobH: [C: 032] setting tegmen install parameters [puppet] - 10https://gerrit.wikimedia.org/r/260417 (owner: 10RobH) [19:22:07] 6operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1895531 (10RobH) [19:22:12] That would be even better but I figured that'd be too high risk at this point. [19:22:17] (03PS4) 10Dzahn: contint: rename git-daemon to git_daemon [puppet] - 10https://gerrit.wikimedia.org/r/260189 [19:22:18] AndyRussG: that is total orthogonal [19:22:21] 6operations, 10Wikimedia-Site-Requests: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1895533 (10Krenair) Has the venue been announced yet? [19:22:22] !start rebase race [19:22:25] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [19:22:27] if anything, it's another reason to have a single implementation [19:22:32] !log reimage restbase1004 [19:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:22:47] so that if the format of the data provided by varnish changes, there is only one place to update [19:22:49] (03PS4) 10Ottomata: Use new kafka role in role::cache::kafka [puppet] - 10https://gerrit.wikimedia.org/r/260415 (https://phabricator.wikimedia.org/T121659) [19:23:18] ori: I totally agree on that, yes of course [19:23:21] ori: Yeah, I think centralising the new geoIP module in CentralNotice would be a good start. Though I propose to change the interface from what it is currently in CentralNotice. [19:23:26] 6operations, 10Wikimedia-Site-Requests: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1895535 (10Krenair) [19:23:37] (03CR) 10Ottomata: [C: 032 V: 032] Use new kafka role in role::cache::kafka [puppet] - 10https://gerrit.wikimedia.org/r/260415 (https://phabricator.wikimedia.org/T121659) (owner: 10Ottomata) [19:24:04] Krinkle: that'd be fine of course. Our first (failed) task was to correctly separate CN from GeoIP stuff, and have an interface that worked for what we needed [19:24:42] The most we'd need is a promise or something to tell us when GeoIP data is ready [19:25:23] Krinkle: aaaand for even better CN performance, eventually getting geo data as an input into RL context [19:25:50] (potentially though there are cache fragmentation issues, hopefully not insurmountable) [19:25:51] A promise that is resolved with the geo object. As opposed to being resolved with undefined, to then look at window.Geo [19:26:05] tries to get something merged.. rebase number 4 in a row [19:26:13] Krinkle: sure that's also fine [19:26:14] (03PS5) 10Dzahn: contint: rename git-daemon to git_daemon [puppet] - 10https://gerrit.wikimedia.org/r/260189 [19:26:14] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [19:26:14] I don't think RL will or should have Geo data [19:26:16] (03PS1) 10Ottomata: Use new kafka role in eventlogging alerts [puppet] - 10https://gerrit.wikimedia.org/r/260419 (https://phabricator.wikimedia.org/T121659) [19:26:25] (03PS4) 10RobH: setting tegmen install parameters [puppet] - 10https://gerrit.wikimedia.org/r/260417 [19:26:29] arrg [19:26:32] endless rebaseing in ff [19:26:39] yes [19:26:39] mutante: me too ;] [19:26:43] we are doing it to one another no doubt [19:26:47] 6operations, 6Performance-Team: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1895546 (10faidon) 3NEW [19:26:50] Krinkle: well, it would let us reduce the number of users that we run the client-side campaign selection code for, and cut down the size of that code [19:27:03] mutante: yours is first in testing [19:27:04] i shall wait [19:27:07] AndyRussG: There are 101 other ways we can do that. [19:27:10] pls lemme know when yer merged [19:27:25] Fragmenting RL by Geo is both not scalable and also highly confusing and separates the concerns wronly. [19:27:28] !log running checkLocalUser.php --delete=1 for real this time on terbium [19:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:27:35] mine was behind everyone else that last time [19:27:51] robh: thanks, one done. go ahead [19:27:52] (03PS2) 10Ottomata: Use new kafka role in eventlogging alerts [puppet] - 10https://gerrit.wikimedia.org/r/260419 (https://phabricator.wikimedia.org/T121659) [19:28:00] (03PS5) 10RobH: setting tegmen install parameters [puppet] - 10https://gerrit.wikimedia.org/r/260417 [19:28:19] Krinkle: OK, happy to hear about other routes. True, piggybacking CN choiceData on RL is a bit of a hack, but it also allows us to bring in RL dependencies only as needed [19:28:21] * robh finds this amusing [19:28:40] AndyRussG: Like I said, there are other ways. We can talk about it in January [19:28:43] git was great at distributed team merging until we made it not so much ;] [19:28:54] Krinkle: sounds excelent! [19:29:15] (03CR) 10RobH: [V: 032] setting tegmen install parameters [puppet] - 10https://gerrit.wikimedia.org/r/260417 (owner: 10RobH) [19:29:19] robh: at one point we changed the policy, to fast-forward only [19:29:39] tired of waiting and it passed the 6 other rebases. [19:29:42] since then it's even more rebasing but it prevents that weird issue we ran into once [19:29:54] AndyRussG: Is there any metric like 'time to banner' at the moment? [19:30:04] ok, done [19:30:08] 6operations, 10Fundraising-Backlog, 10Traffic, 10Unplanned-Sprint-Work, 3Fundraising Sprint Zapp: Firefox SPDY is buggy and is causing geoip lookup errors for IPv6 users - https://phabricator.wikimedia.org/T121922#1895565 (10faidon) While this is definitely a real bug, it appears that its effect was grea... [19:30:22] something we could track and measure impact of when we improve e.g. Geo cookie updating. [19:30:25] (03PS3) 10Ottomata: Use new kafka role in eventlogging alerts [puppet] - 10https://gerrit.wikimedia.org/r/260419 (https://phabricator.wikimedia.org/T121659) [19:30:31] (03CR) 10Ottomata: [C: 032 V: 032] Use new kafka role in eventlogging alerts [puppet] - 10https://gerrit.wikimedia.org/r/260419 (https://phabricator.wikimedia.org/T121659) (owner: 10Ottomata) [19:30:53] (03PS1) 10Ottomata: Remove role::analytics::kafka::* [puppet] - 10https://gerrit.wikimedia.org/r/260422 (https://phabricator.wikimedia.org/T121659) [19:30:54] Krinkle: no, but I'd like to make one! I was thinking of ways of joining webrequests related to unique pageviews [19:31:33] Krinkle: ori: BTW, the pageview impression discrepancy is all but gone now (just checked Hive data following the deploy) [19:32:55] ottomata: so what's the plan with limn.pp then? :) [19:33:55] (03PS1) 10Yuvipanda: toollabs: Remove NFS dependency from proxies [puppet] - 10https://gerrit.wikimedia.org/r/260424 [19:34:04] valhallasw`cloud: ^ [19:34:27] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:34:45] AndyRussG: Cool. Would love to maybe later hear a rough % increase for donations per hour or some other metric you guys have [19:34:56] $$$ [19:35:01] (03CR) 10Merlijn van Deen: [C: 04-1] toollabs: Remove NFS dependency from proxies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/260424 (owner: 10Yuvipanda) [19:35:20] Krinkle: you bet! [19:35:47] (03PS2) 10Yuvipanda: toollabs: Remove NFS dependency from proxies [puppet] - 10https://gerrit.wikimedia.org/r/260424 [19:35:50] valhallasw`cloud: bah ^ [19:36:02] valhallasw`cloud: I made that change, but atom's ex-mode decided to stop working and ':w' did not work [19:36:05] * YuviPanda efiddles [19:36:43] (03CR) 10Merlijn van Deen: [C: 031] toollabs: Remove NFS dependency from proxies [puppet] - 10https://gerrit.wikimedia.org/r/260424 (owner: 10Yuvipanda) [19:38:18] (03CR) 10Ottomata: [C: 032] Remove role::analytics::kafka::* [puppet] - 10https://gerrit.wikimedia.org/r/260422 (https://phabricator.wikimedia.org/T121659) (owner: 10Ottomata) [19:40:45] (03PS3) 10Yuvipanda: toollabs: Remove NFS dependency from proxies [puppet] - 10https://gerrit.wikimedia.org/r/260424 [19:40:53] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Remove NFS dependency from proxies [puppet] - 10https://gerrit.wikimedia.org/r/260424 (owner: 10Yuvipanda) [19:42:10] 6operations, 6Performance-Team, 5Patch-For-Review: Provision additional jobrunners - https://phabricator.wikimedia.org/T121549#1895613 (10ori) 5Open>3Resolved a:3ori [19:45:57] (03PS3) 10Dzahn: mw_rc_irc: rename irc-echo to irc_echo [puppet] - 10https://gerrit.wikimedia.org/r/260202 [19:46:21] (03CR) 10Dzahn: [C: 032] mw_rc_irc: rename irc-echo to irc_echo [puppet] - 10https://gerrit.wikimedia.org/r/260202 (owner: 10Dzahn) [19:47:33] 6operations, 6Performance-Team: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1895631 (10Krinkle) [19:48:42] (03PS1) 10Yuvipanda: tools: fix stupid typo [puppet] - 10https://gerrit.wikimedia.org/r/260427 [19:48:58] ccccccevibtjiehdcrchjevjkuuceiiurcnuveljvuhe [19:48:59] (03PS2) 10Yuvipanda: tools: fix stupid typo [puppet] - 10https://gerrit.wikimedia.org/r/260427 [19:49:05] aahhh nice [19:49:08] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: fix stupid typo [puppet] - 10https://gerrit.wikimedia.org/r/260427 (owner: 10Yuvipanda) [19:49:57] akosiaris: 2FA over irc! [19:50:05] :P [19:51:11] 6operations, 10Wikimedia-Mailing-lists: Need listadmin password reset for Wikitech-ambassadors mailing list - https://phabricator.wikimedia.org/T122070#1895644 (10Az1568) 3NEW a:3RobH [19:53:58] (03PS4) 10Dzahn: mw_rc_irc: rename irc-echo to irc_echo [puppet] - 10https://gerrit.wikimedia.org/r/260202 [19:54:12] 6operations, 10Wikimedia-Mailing-lists: Need listadmin password reset for Wikitech-ambassadors mailing list - https://phabricator.wikimedia.org/T122070#1895693 (10RobH) 5Open>3Resolved done and email sent! [19:55:09] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1895709 (10ori) a:5aaron>3Ottomata [19:55:11] chasemp: andrewbogott valhallasw`cloud ok, so NFS is gone from the proxies now! \o/ [19:55:32] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1881702 (10ori) @Ottomata can you indeed file a task for codfw zookeeper? [19:55:46] fwiw, you can https://wikitech.wikimedia.org/wiki/Hiera:Tools/host/tools-proxy-01 and unmount the mounts by hand and puppet won't mount them back [19:56:56] (03PS1) 10Dzahn: add wikimania2017, regular and .m. [dns] - 10https://gerrit.wikimedia.org/r/260431 (https://phabricator.wikimedia.org/T122062) [19:57:22] YuviPanda: cool. maybe do that and test if everything is OK? including puppet [19:57:32] I did [19:57:37] ah, good [19:57:40] :) [19:57:52] ok, now to go do halfak stuff, then check on worker-08 [19:57:54] hi halfak [19:58:04] ready to break staging again? :) [19:58:09] Yeah! [19:58:21] (03PS2) 10Dzahn: add wikimania2017, regular and .m. [dns] - 10https://gerrit.wikimedia.org/r/260431 (https://phabricator.wikimedia.org/T122062) [19:58:32] halfak: ok let's break it [19:58:36] (03CR) 10Dzahn: [C: 032] add wikimania2017, regular and .m. [dns] - 10https://gerrit.wikimedia.org/r/260431 (https://phabricator.wikimedia.org/T122062) (owner: 10Dzahn) [19:58:57] So last time, we had problems in our merged config, right? [19:59:13] halfak: yup. it didn't seem to 'take' [19:59:47] halfak: I'm going to undo my local changes now [19:59:48] mutante, aren't you going to wait for my question to be addressed? [20:00:11] Krenair: no, all i'm doing is adding DNS, not installing a wiki [20:00:15] ok... [20:00:17] and that's needed either way [20:00:22] as the first step [20:00:23] halfak: we also didn't do a pip upgrade, so yamlconf didn't get picked up [20:00:32] ops will also need to do the apache part [20:01:05] Oh yeah. Good point. [20:01:05] halfak: 44: ERROR/MainProcess] consumer: Cannot connect to redis://ores-redis-01:6379//: Error 111 connecting to ores-redis-01:6379. Connection [20:01:10] halfak: on -staging [20:01:11] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1895775 (10Dzahn) added the DNS entry which is needed as the first step wikimania2017.wikimedia.org exists now and the wiki installation can go ahead [20:01:20] halfak: so I think the celery endpoint isn't doing the merging [20:01:21] (adding wikimania2017 to the wikimedia ServerAlias in modules/mediawiki/files/apache/sites/wikimania.conf) [20:01:40] YuviPanda, checking on that [20:02:06] YuviPanda, definitely is merging. [20:02:10] 6operations, 10ops-codfw, 10EventBus, 10MediaWiki-Cache, and 2 others: setup kafka2001 & kafka2002 - https://phabricator.wikimedia.org/T121558#1895776 (10Ottomata) https://phabricator.wikimedia.org/T121882 [20:02:19] halfak: can you point me to the code that's doing the merging for celery? [20:02:25] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1895777 (10Krenair) a:3Krenair From the puppet side we'll also need to add wikimania2017 to the wikimania ServerAlias in modules/mediawiki/files/apache/sites/wikimania.conf [20:02:41] https://github.com/wiki-ai/ores-wikimedia-config/blob/master/ores_celery.py#L10 [20:02:59] load() will automatically merge multiple arguments [20:03:03] hmm right [20:03:16] Such that args on the right get precedence. [20:03:33] * halfak is proud of how short this script is. [20:03:35] so uwsgi seems to be able to merge them fine [20:03:38] and celery can't [20:03:40] which is strange [20:03:50] * halfak checks on the staging server [20:04:55] Hmmmm [20:05:03] ooo I wonder what their cwd is [20:05:11] Yeah [20:05:13] My thought [20:05:15] too [20:05:39] It was finding the config file before though [20:05:44] 695: /srv/ores/config [20:05:46] it's fine [20:05:48] Looks right to me [20:06:04] YuviPanda, shouldn't even start up since the main config is coming from that dir too. [20:06:08] 6operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1895784 (10RobH) [20:06:32] halfak: yup [20:07:22] * halfak restarts celery [20:07:58] so yeah... this is weird. [20:08:04] * YuviPanda straces [20:08:14] OH! [20:08:16] Can't read [20:08:18] Perms are wrong [20:08:26] :) [20:08:29] YuviPanda, ^ [20:08:49] halfak: no, they're right since www-data can read them [20:08:51] *you* can't read them [20:08:57] if you do sudo -u www-data it works [20:09:02] YuviPanda, perms are different for both files [20:09:10] I bet that's why it is reading one and not the other. [20:09:13] It just makes sense. [20:09:19] Maybe it isn't running as www-data [20:09:24] halfak: it is [20:09:31] I checked that too [20:09:33] ps auxf [20:09:41] also that entire directory is owned by www-data [20:10:23] Damn. Seemed so reasonable [20:11:05] So, we can be sure that the first config file *is* loaded. [20:11:20] am looking at the strace now [20:11:24] to see what's going on [20:11:40] (/tmp/strac if you wanna follow along) [20:12:26] Oh! Wait... what? For some reason, 99 comes before 00 in the glob [20:12:45] aah [20:12:52] yes [20:12:54] I see that [20:13:01] By why is that not true for the other? [20:13:04] it's opening 99 first [20:13:04] Is it just random, [20:13:11] that's possible but why randomly consistent? [20:13:18] halfak: or maybe the other one is failing but not at startup? [20:13:27] Could be. [20:13:28] since I was only watching startup logs for failure [20:13:34] Not using cache unless we can run the scorer. [20:13:47] So why does glob read in reverse alpha order? [20:14:04] I think glob doesn't guarantee order [20:14:11] That's fair I guess. [20:14:19] 6operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1895806 (10RobH) [20:14:23] http://stackoverflow.com/a/6773636 [20:14:25] * halfak works on [20:14:38] 6operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1895808 (10RobH) a:5RobH>3akosiaris Assigning to Alex for implementation. [20:14:39] halfak: apparently it's just 'order in which they appear in the underlying filesystem' [20:14:46] which for all our purposes is 'random' [20:15:15] yeah [20:16:49] halfak: ok, so I guess we should sort that in python before passing it on. [20:16:54] TIL [20:17:30] * halfak deploys to staging [20:17:54] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1895829 (10Krenair) @Coren: I assume you should still be mentioned in https://wikitech.wikimedia.org/wiki/Add_a_wiki#Start and that you've taken the necessary steps on the ops... [20:18:41] * halfak waits for uwsgi to restart [20:19:20] halfak: we gotta debug that someday soon it's getting annoying [20:19:45] Yeah. [20:19:58] YuviPanda, seems to be working [20:20:07] halfak: \o/ [20:20:27] halfak: all hail strace! [20:21:29] halfak: can you run a precached at staging if it isn't already? [20:22:34] Will do. [20:23:38] YuviPanda, running [20:23:50] Looks like we're slowly getting behind. [20:23:56] TOo many models for 1 machine maybe [20:24:13] * YuviPanda nods [20:24:19] we should make it real cluster soon [20:24:25] halfak: ok, so I see both redises are being used [20:24:32] halfak: do you think we should flip the switch for prod too? [20:24:56] YuviPanda, so, can you walk me through what will happen if all goes well and what our biggest risk is? [20:25:09] We have both redises available now, right? [20:25:13] yeah [20:25:30] all goes well, we see no service disruption anywhere [20:25:34] So, the only meaningful change is that we'll be making requests for cached scores from a new redis. [20:25:46] no, both the new redises are on one new machine [20:25:52] it's redis-multi-instance [20:26:02] they're just different so one crashing Doesn't kill the other [20:26:15] Biggest risk, I suppose, is that we messed something up on merged config. [20:26:19] the old-redis machine has puppet disabled to prevent redis thrashing about [20:26:27] halfak: well, for prod we aren't doing any config merging at all [20:26:34] Oh yeah. [20:27:06] halfak: so let's depool a web node and a worker node and deploy to them both? [20:27:10] +1 [20:27:32] halfak: ok, so I'll do the depooling now and ping you to do deploy? [20:27:34] Should make an alternative fabfile.py for the two? [20:27:49] halfak: you can just edit the hostnames in the current fabfile for now? [20:28:03] halfak: since this shouldn't take long and we'll have completely switched over or reverted in less than 20-30min [20:28:12] Sure [20:28:22] How about web-01 and worker-01 [20:28:24] ? [20:28:28] yup [20:29:24] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1004-a instance [puppet] - 10https://gerrit.wikimedia.org/r/260438 [20:30:14] (03PS2) 10Filippo Giunchedi: cassandra: add restbase1004-a instance [puppet] - 10https://gerrit.wikimedia.org/r/260438 [20:30:19] OK. Ready [20:30:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1004-a instance [puppet] - 10https://gerrit.wikimedia.org/r/260438 (owner: 10Filippo Giunchedi) [20:30:46] halfak: ok, depooled. deploy go! :) [20:30:52] halfak: make sure it's only hitting web-01 and worker-01 [20:31:10] * halfak confirms very carefully [20:31:49] * halfak updates virtualenv on web-01 [20:32:09] "error: cannot fork() for fetch-pack: Cannot allocate memory" [20:32:14] Weird. [20:32:19] hmm [20:32:21] what host is that on? [20:32:25] web-01 [20:32:29] try again? [20:32:34] While running "sudo git fetch origin" [20:32:34] could be a transient condition [20:32:35] kk [20:32:40] could be too many uwsgi processes :) [20:32:49] Same error [20:34:24] halfak: I'm killing uwsgi on that host [20:34:47] OK [20:34:59] This could be a problem if we need to do it for deploys though. :/ [20:35:29] yeah [20:35:45] there isn't enough memory to stop uwsgi hehe [20:35:48] let me force kill it [20:35:53] we should reduce the number of workers [20:36:07] jynus: still busY? [20:36:20] halfak: try now? [20:36:21] PROBLEM - Restbase root url on restbase1004 is CRITICAL: Connection refused [20:37:02] PROBLEM - cassandra CQL 10.64.32.160:9042 on restbase1004 is CRITICAL: Connection refused [20:37:21] PROBLEM - cassandra service on restbase1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is inactive [20:37:53] * halfak tries and succeeds. [20:38:05] worker-01 went without issue. [20:38:11] Running the deploy now. [20:38:31] PROBLEM - restbase endpoints health on restbase1004 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.160, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [20:38:43] worker-01 is ready [20:39:00] web-01 is ready. Somehow the restart was REALLY fast [20:39:11] Maybe because it was already stopped [20:39:13] YuviPanda, ^ [20:39:24] halfak: curl some stuff at worker-01? :) [20:39:39] halfak: or rather, set a precached at it? [20:39:47] * YuviPanda monitors redis [20:40:41] 6operations, 10Salt: salt minions need 'wake up' test.ping after idle period before they respond properly to commands - https://phabricator.wikimedia.org/T120831#1895906 (10ArielGlenn) bleah, I had broken packages and not sure how I managed it. But piles of code from the singleton auth change were missing in... [20:40:46] nuria, send me a mail (or preferably, a ticket). I will handle it tomorrow morning [20:41:49] YuviPanda, port 8080 on web-01? [20:42:03] halfak: ya [20:42:14] 500 [20:42:18] looking at log [20:42:25] hmm [20:42:32] 6operations: setup/deploy tegmen/WMF6381 as monitoring host - https://phabricator.wikimedia.org/T121583#1895918 (10Papaul) [20:42:33] 6operations, 10ops-codfw: tegmen/wmf6381: apply hostname label & update visible label field in racktables - https://phabricator.wikimedia.org/T122065#1895916 (10Papaul) 5Open>3Resolved Complete . [20:43:00] halfak: aha, the redises aren't listening to traffic from elsewhere [20:43:02] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: puppet fail [20:43:02] * YuviPanda fixes [20:43:18] Cool. Also, the log seems to end on Nov 15th :S [20:43:18] jynus: np, it was the eventlogging db work, you can update ticket as needed be. [20:45:14] (03PS1) 10Yuvipanda: ores: Have the redises listen on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/260440 [20:45:16] halfak: ^ should fix it [20:45:18] * YuviPanda merges [20:45:32] (03PS2) 10Yuvipanda: ores: Have the redises listen on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/260440 [20:45:34] (03CR) 10jenkins-bot: [V: 04-1] ores: Have the redises listen on 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/260440 (owner: 10Yuvipanda) [20:45:52] YuviPanda: btw, what's the status of migrating tools redis to redis::instance? [20:46:22] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) [20:46:30] ori: I haven't given any thought to it given all the other raging fires, tbh. [20:46:43] hopefully sometime this week [20:47:18] ori: I will ping you when it happens and give you the satisfaction of rming redis::legacy [20:47:21] :) [20:47:24] it probably won't survive the year [20:47:34] it's not just that, i have some follow-ups [20:47:36] no biggie tho [20:47:43] ori: do you want me to ping you before I do the move? [20:47:58] (03PS1) 10RobH: setting up shell access for Casey Dentinger [puppet] - 10https://gerrit.wikimedia.org/r/260442 [20:48:09] ori: I was going to use the opportunity to also move them to jessie and untangle them from the terrible puppet inheritance tree that forces NFS on the redises [20:48:22] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [20:48:37] ori: followups speecific to tools or? [20:48:42] (03PS2) 10RobH: setting up shell access for Casey Dentinger [puppet] - 10https://gerrit.wikimedia.org/r/260442 [20:48:53] no, redis stuff in general [20:49:02] PROBLEM - Disk space on restbase1008 is CRITICAL: DISK CRITICAL - free space: /srv 73400 MB (3% inode=99%) [20:49:02] (03CR) 10Yuvipanda: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/260440 (owner: 10Yuvipanda) [20:49:08] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for cwdent - https://phabricator.wikimedia.org/T121916#1895956 (10RobH) 5Open>3stalled Sounds good. All access requests additionally have a 3 business day wait. So this access can go live on Wednesday. I'll keep it assigned to me and merg... [20:49:13] ori: ah, ok. [20:49:17] !log restbase1004 bootstrap failed, restbase1007-a is down java.lang.RuntimeException: A node required to move the data consistently is down (/10.64.0.230). [20:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:49:35] urandom gwicke ^ [20:49:56] I have to go now though [20:50:07] (03CR) 10Yuvipanda: [C: 032 V: 032] "FINE JENKINS" [puppet] - 10https://gerrit.wikimedia.org/r/260440 (owner: 10Yuvipanda) [20:51:43] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for cwdent - https://phabricator.wikimedia.org/T121916#1895965 (10RobH) I also want to note (and I've sent @cwdent a PM regarding this), if this access is needed as part of the fundraiser, and the 3 day wait cannot be observed, we can ask @mark... [20:52:45] godog: oh. [20:52:48] YuviPanda, should I be waiting or trying again? [20:52:56] halfak: yeah, in about 30s [20:52:59] kk [20:53:12] halfak: ok try now? [20:53:24] godog: perhaps you should suffix logs like that with FML [20:53:40] YuviPanda, all we really need is a restart on the services, right? [20:53:43] halfak: yeah [20:53:44] urandom: indeed! I've disabled puppet too [20:53:48] kk. [20:54:28] * halfak waits for uwsgi [20:54:56] godog: well, unless your comfort level in that array has increased on the last couple of hours, maybe we should do 1007 next [20:55:08] s/increased on/increased in/ [20:55:21] 6operations, 10ops-codfw, 10fundraising-tech-ops: bellatrix hardware RAID predictive failure - https://phabricator.wikimedia.org/T122026#1895987 (10Jgreen) root@bellatrix:~# /usr/sbin/hpssacli ctrl slot=0 pd 1I:1:12 show detail Smart Array P420i in Slot 0 (Embedded) physicaldrive 1I:1:12 Port... [20:55:30] halfak: ok I see traffic from celery now [20:55:42] halfak: but nothing from precached [20:55:52] * halfak kicks off precached [20:55:59] Timings look right. [20:56:14] apergos: try again now, if you want to? [20:56:25] YuviPanda, check now? [20:56:26] sure [20:56:40] halfak: yup, I see it! [20:56:43] halfak: wooo! [20:56:47] halfak: move another worker over? [20:56:51] Yeah. [20:56:53] -02? [20:56:54] halfak: then I can flip the lb to use this, and then we can move the rest [20:56:56] halfak: yup [20:57:07] andrewbogott: "failed to create instance." [20:57:21] over quota? [20:59:12] that's all it said [20:59:14] I have 4 instances all small and all with not much in em [20:59:16] shall I delete it again? [20:59:16] YuviPanda, ready when you are [20:59:16] halfak: ok! [20:59:16] halfak: did you do the deploy? [20:59:16] 6operations, 6Performance-Team: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1895995 (10aaron) The first change only effected JobChron. From http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1450730658.495&target=jobrunner.memory.*.count&from=-7days jobchron and jobrunner... [20:59:26] YuviPanda, running [20:59:33] halfak: ok! [20:59:37] halfak: let me know when done [20:59:53] and I'll flip over the lb to web-01 [20:59:56] and we can do the rest [21:00:03] apergos: hang on, I’ll check again [21:00:05] k [21:01:00] Weird. The fabfile script is hanging and I can't ^C [21:01:35] OK. Trying again [21:01:46] YuviPanda, done [21:01:50] halfak: okkk [21:02:04] halfak: I'm going to flip lb, then we run precached on lb to make sure stuff's ok and then we deploy to rest. [21:02:23] flipping [21:02:23] OK [21:02:42] halfak: run precached against lb? [21:02:53] oh there is one running [21:02:55] looks good [21:03:04] halfak: deploy the rest? [21:03:11] Should already be running. Let me confirm [21:03:37] OK. Running. [21:03:40] Looks good [21:03:54] Looks like we're getting a little bit behind though. [21:03:57] With just two workers. [21:04:07] apergos: it is working for me. I don’t know what’s happening, unless you’re re-using a name for an instance that already exists, or over quota [21:04:07] yeah, should be fine when we flip 'em all [21:04:23] OK. I'll shut it down while we flip [21:04:50] * halfak gets ready to deploy to the rest of the workers. [21:04:58] well I have 3 instances in this projecty [21:05:04] so I can see their names [21:05:14] this fourth one definitely has a different name [21:05:27] unless quotas have gotten microscopic I don't see how I can be over [21:05:35] what project? [21:05:38] salt [21:05:44] I deleted the 4th bad one again [21:05:45] YuviPanda, should I do web-02 as well? [21:05:57] halfak: yeah [21:06:01] I want salt-precise with a precise image [21:06:02] halfak: they're all out of rotation now [21:06:54] unless somehow the delete didnt really clean up everything? [21:08:08] PROBLEM - cassandra-a CQL 10.64.32.192:9042 on restbase1004 is CRITICAL: Connection refused [21:08:19] PROBLEM - cassandra-a service on restbase1004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [21:08:54] apergos: works for me, sorry :( [21:09:10] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [21:09:17] wow [21:09:18] that sucks [21:09:26] I mean thanks for the image but that seriously sucks [21:09:32] Well, one thing -- [21:09:40] The wikitech gui removes an instance as soon as you request to delete it [21:09:46] but of course it actually takes time to delete [21:09:50] right [21:09:53] so if you click delete and then quickly recreate it probably collides [21:09:59] I thought I had waited long enough [21:10:02] that’s my only guess :/ [21:10:05] but maybe not. I'll take that as a possibility [21:10:16] anyways, there's an instance now and that's the main thing [21:10:17] thanks! [21:10:32] TAKES SO LONG TO RESTART ARG [21:11:54] * halfak waits in web-02 to restart [21:11:57] *on [21:12:49] YuviPanda, should be done [21:12:52] Testing out -02 [21:13:33] Looks good [21:13:35] \o/ [21:13:40] I'll repool -02 [21:15:13] halfak: all done [21:15:23] * halfak starts up precached again [21:17:26] OK. Looks good. [21:17:40] halfak: \o/ [21:17:48] halfak: 0 downtime deploy I guess :) [21:18:06] \o/ [21:18:20] Thanks YuviPanda. :) [21:18:44] halfak: want to be the one to ceremoniously delete ores-redis-02? :) [21:18:47] (not -01! :D) [21:19:09] Heh. Sure :) [21:19:16] How did it get named -02? [21:19:26] halfak: since we had ores-redis-01 back when it was in the revscoring project [21:19:31] halfak: that's also why our lb is ores-lb-02 [21:19:34] Oh! Gotcha [21:19:51] (03PS12) 10Andrew Bogott: nova-network: have dnsmasq advertise pxe-boot options [puppet] - 10https://gerrit.wikimedia.org/r/259788 [21:20:36] * halfak confirms that redis-02 was built on 07-18 [21:20:49] has served us well for 5 months! [21:21:05] To vallhala [21:21:17] halfak: ores-redis-01 ha [21:21:19] s [21:21:21] > /dev/mapper/vd-second--local--disk 21G 48M 19G 1% /srv [21:21:28] while old one only had about 8G of space in / [21:21:29] no /sv [21:21:32] *srv [21:21:53] (03CR) 10Andrew Bogott: "A dhcp dump from this latest patch is here: https://phabricator.wikimedia.org/P2448" [puppet] - 10https://gerrit.wikimedia.org/r/259788 (owner: 10Andrew Bogott) [21:22:08] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1896101 (10coren) @Krenair: I hereby notify myself. :-) More seriously, that doc should probably be amended as I should no longer be a point of contact starting Jan 1. [21:22:09] halfak: alright, I think I can go nutrition now, I think [21:22:13] Yeah [21:22:13] halfak: sorry it took longer! [21:22:26] No worries. Thanks for your help and zero downtime :) [21:22:42] \o/ [21:22:58] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1896111 (10Krenair) I would really prefer we remove both people from it and document exactly which steps need to be taken instead. Previous question still blocks this. [21:25:19] RECOVERY - Disk space on restbase1008 is OK: DISK OK [21:26:53] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1896120 (10coren) I can tell you what the Labs side of things is: run maintain-replicas after the database exists to create the views. AFAIK, the prod side involves adding the... [21:27:10] 6operations, 10ops-codfw, 10fundraising-tech-ops: bellatrix hardware RAID predictive failure - https://phabricator.wikimedia.org/T122026#1896122 (10Papaul) Hi Papaul Tshibamba, This is regarding Case Number 4765319505 for Proliant Server DL380P GEN8 Issue : Failed HDD **********PART DETAILS********** Par... [21:31:34] halfak: anyway, thanks for moving the ORES meeting time 'up' :) [21:31:37] * YuviPanda goes afk for real now. [21:31:50] No prob. o/ [21:40:35] MatmaRex, thcipriani: got approval in -releng earlier, here's the proposed patches: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=237668&oldid=237319 [21:42:21] right. i'll be around to verify the deployment [21:44:02] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1896194 (10Krenair) Have clarified with @coren what the status of this is (there should be an announcement out which I can verify this from before we even have a chance to add... [21:44:13] puppet master vs puppet server... what are we using in prod? [21:45:18] master [21:45:35] I've only heard puppetmaster [21:46:09] same for labs, and I'd assume frack is the same [21:46:17] does corp have some sort of puppet setup then cajoel? [21:46:31] Krenair: https://puppetlabs.com/blog/puppet-server-bringing-soa-to-a-puppet-master-near-you [21:46:40] yeah, corp has a small puppet deployment [21:46:50] and we're contemplating upgrading... [21:47:26] And this seems to be the most recent hotness WRT puppet master performace.. [21:47:48] (now using Clojure!) [21:50:53] (03PS1) 10Alex Monk: mediawiki: Add wikimania2017 to the wikimania apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/260494 (https://phabricator.wikimedia.org/T122062) [22:02:46] (03PS1) 10Yuvipanda: Revert "Revert "base: Allow https debian repositories!"" [puppet] - 10https://gerrit.wikimedia.org/r/260497 [22:03:06] (03CR) 10jenkins-bot: [V: 04-1] Revert "Revert "base: Allow https debian repositories!"" [puppet] - 10https://gerrit.wikimedia.org/r/260497 (owner: 10Yuvipanda) [22:05:44] (03Abandoned) 10Yuvipanda: Revert "Revert "base: Allow https debian repositories!"" [puppet] - 10https://gerrit.wikimedia.org/r/260497 (owner: 10Yuvipanda) [22:07:15] (03PS1) 10Yuvipanda: base: Add support for https apt repos [puppet] - 10https://gerrit.wikimedia.org/r/260498 [22:08:39] (03CR) 10Yuvipanda: [C: 032] base: Add support for https apt repos [puppet] - 10https://gerrit.wikimedia.org/r/260498 (owner: 10Yuvipanda) [22:10:37] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1896288 (10Smalyshev) We talked on the meeting about this - once we have Kafka setup that can: - Have parallel consumers... [22:18:06] (03PS1) 10Yuvipanda: k8s: Have docker service start after the flannel one [puppet] - 10https://gerrit.wikimedia.org/r/260501 [22:19:44] (03CR) 10jenkins-bot: [V: 04-1] k8s: Have docker service start after the flannel one [puppet] - 10https://gerrit.wikimedia.org/r/260501 (owner: 10Yuvipanda) [22:21:29] (03PS2) 10Yuvipanda: k8s: Have docker service start after the flannel one [puppet] - 10https://gerrit.wikimedia.org/r/260501 [22:36:08] (03PS1) 10Aaron Schulz: Remove $wgMaxSquidPurgeTitles non-setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260506 [22:38:34] (03PS1) 10Aaron Schulz: [WIP] Set initial $wgMaxUserDBWriteDuration value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260507 [22:40:33] (03PS2) 10Krinkle: Remove unused $wgMaxSquidPurgeTitles setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260506 (owner: 10Aaron Schulz) [22:40:45] (03CR) 10Krinkle: [C: 031] Remove unused $wgMaxSquidPurgeTitles setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260506 (owner: 10Aaron Schulz) [23:00:04] (03PS1) 10Andrew Bogott: Promethium is in the labs subnet now. [puppet] - 10https://gerrit.wikimedia.org/r/260508 [23:13:37] (03PS13) 10Andrew Bogott: nova-network: have dnsmasq advertise pxe-boot options [puppet] - 10https://gerrit.wikimedia.org/r/259788 [23:13:38] (03PS2) 10Andrew Bogott: Promethium is in the labs subnet now. [puppet] - 10https://gerrit.wikimedia.org/r/260508 [23:14:40] (03PS2) 10Dzahn: mediawiki: Add wikimania2017 to the wikimania apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/260494 (https://phabricator.wikimedia.org/T122062) (owner: 10Alex Monk) [23:15:22] (03CR) 10Andrew Bogott: [C: 032] Promethium is in the labs subnet now. [puppet] - 10https://gerrit.wikimedia.org/r/260508 (owner: 10Andrew Bogott) [23:15:32] 6operations, 10Traffic: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#1896477 (10AndyRussG) [23:16:16] (03PS3) 10Dzahn: mediawiki: Add wikimania2017 to the wikimania apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/260494 (https://phabricator.wikimedia.org/T122062) (owner: 10Alex Monk) [23:21:15] 6operations, 6WMF-Legal, 7domains: wikipedia.lol - https://phabricator.wikimedia.org/T88861#1896513 (10Dzahn) @Mschon we want to in the future but we are not there yet for production. (for which one of the issues is that with letsencrypt you can't have SANs and add multiple domains in one cert). The first us... [23:21:36] 6operations, 6WMF-Legal, 7domains: wikipedia.lol - https://phabricator.wikimedia.org/T88861#1896515 (10Dzahn) 5Open>3Resolved [23:23:26] (03CR) 10Dzahn: [C: 032] mediawiki: Add wikimania2017 to the wikimania apache ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/260494 (https://phabricator.wikimedia.org/T122062) (owner: 10Alex Monk) [23:34:58] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1896571 (10Dzahn) do you want this -> T96564 for 2017 wiki as well or no? [23:37:18] (03PS1) 10Dzahn: add wikimania2017 wiki to InitSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260514 (https://phabricator.wikimedia.org/T122062) [23:38:59] 6operations, 10Wikimedia-Site-Requests, 5Patch-For-Review: Create the wikimania2017 wiki - https://phabricator.wikimedia.org/T122062#1896584 (10Dzahn) https://bits.wikimedia.org/static/images/project-logos/wikimania2017wiki.png is needed for the logo setting [23:44:19] (03PS1) 10Dzahn: add wikimania2017 wiki to db lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/260515 (https://phabricator.wikimedia.org/T122062)