[00:00:02] Aw, crap. The puppet compiler won't work with labs instances. :-( [00:02:01] (03PS4) 10coren: Labs: puppetize gridengine [puppet] - 10https://gerrit.wikimedia.org/r/167126 [00:03:21] (03CR) 10coren: [C: 032] "First debugging step; this should be a (functional) no-op at this time as the actual configuration commands are commented out." [puppet] - 10https://gerrit.wikimedia.org/r/167126 (owner: 10coren) [00:03:25] Coren: yes, i have been wishing for that too [00:04:01] (03PS1) 10Dzahn: Bugzilla - disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167129 [00:04:24] mutante: I can understand why; having to get the node info from the labs LDAP may not be wise as a dependency from jenkins. [00:05:06] yes [00:06:00] It's still sad. [00:09:29] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:13:38] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 70599 bytes in 5.481 second response time [00:15:08] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:16:08] Please keep mw1189 pooled, I'm taking a look right now [00:16:08] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.368 second response time [00:20:28] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:58] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:21:28] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 6.825 second response time [00:21:52] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 70599 bytes in 0.292 second response time [00:25:01] (03PS1) 10coren: Labs: decouple exec_host and submit_host [puppet] - 10https://gerrit.wikimedia.org/r/167135 [00:25:18] PROBLEM - puppet last run on analytics1024 is CRITICAL: CRITICAL: Puppet has 1 failures [00:25:36] ori: I take it you found the issue? [00:26:08] (03CR) 10coren: [C: 032] "Part of the WIP puppetizing gridengine." [puppet] - 10https://gerrit.wikimedia.org/r/167135 (owner: 10coren) [00:28:35] Coren: nope [00:29:08] Well, it work up; do you trust it enough I should keep it pooled? [00:30:04] woke* [00:38:39] RECOVERY - puppet last run on analytics1024 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [00:38:51] (03PS1) 10coren: Labs: remove $configsource from gridengine::master [puppet] - 10https://gerrit.wikimedia.org/r/167138 [00:39:42] (03CR) 10coren: [C: 032] "+trivial" [puppet] - 10https://gerrit.wikimedia.org/r/167138 (owner: 10coren) [00:41:48] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:59] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:43:38] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.057 second response time [00:43:49] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 70599 bytes in 0.175 second response time [00:45:46] (03PS1) 10coren: Labs: fix err in toollabs module structure [puppet] - 10https://gerrit.wikimedia.org/r/167139 [00:46:39] (03CR) 10coren: [C: 032] "+trivial" [puppet] - 10https://gerrit.wikimedia.org/r/167139 (owner: 10coren) [00:47:23] !log hoo Synchronized php-1.25wmf4/extensions/Wikidata/: Fix ORMTable usage, IE 11 freeze bug and adopt to further core changes (duration: 00m 14s) [00:47:32] Logged the message, Master [00:48:36] ... least useful error message evar. "Cannot reassign variable name on [server]". Doesn't say which variable, or where in the manifest. [00:49:08] aude: Ok, exceptions seem to have gone :) [00:49:20] Going to be around for a bit longer and then go to bed [00:50:08] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:50:58] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 70599 bytes in 0.200 second response time [00:51:49] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 333 seconds [00:52:19] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 358 seconds [00:53:19] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [00:53:49] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:54:56] (03PS1) 10coren: Labs: minor fixes to toollabs module [puppet] - 10https://gerrit.wikimedia.org/r/167140 [00:55:45] thanks hoo [00:55:47] (03CR) 10coren: [C: 032] "I really do." [puppet] - 10https://gerrit.wikimedia.org/r/167140 (owner: 10coren) [00:58:36] (03PS1) 10coren: Revert "Labs: minor fixes to toollabs module" [puppet] - 10https://gerrit.wikimedia.org/r/167141 [01:07:06] (03PS2) 10coren: Labs: further fixes to toollabs and gridengine [puppet] - 10https://gerrit.wikimedia.org/r/167141 [01:08:27] (03CR) 10coren: [C: 032] "Moar betta." [puppet] - 10https://gerrit.wikimedia.org/r/167141 (owner: 10coren) [01:09:10] (03PS1) 10Dzahn: add mgmt for radium (formerly cp1001) [dns] - 10https://gerrit.wikimedia.org/r/167142 [01:11:10] (03PS1) 10coren: Tool Labs: fix c&p fail in hostgroup::collector [puppet] - 10https://gerrit.wikimedia.org/r/167143 [01:12:55] (03PS2) 10Dzahn: add mgmt for radium (formerly cp1001) [dns] - 10https://gerrit.wikimedia.org/r/167142 [01:13:33] (03CR) 10coren: [C: 032] "More in a long series of tiny fixes." [puppet] - 10https://gerrit.wikimedia.org/r/167143 (owner: 10coren) [01:14:40] (03PS1) 10coren: Labs: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/167144 [01:15:10] The lack of puppet compiler for labs really hurtz. [01:15:41] (03CR) 10coren: [C: 032] Labs: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/167144 (owner: 10coren) [01:16:44] !log powering up server formerly known as cp1001 [01:16:52] Logged the message, Master [01:21:22] (03PS1) 10coren: Labs: more fixes to gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167146 [01:22:13] (03PS1) 10Dzahn: add public IP for radium [dns] - 10https://gerrit.wikimedia.org/r/167147 [01:23:23] (03PS2) 10Dzahn: add public IP for radium [dns] - 10https://gerrit.wikimedia.org/r/167147 [01:24:10] (03CR) 10coren: [C: 032] Labs: more fixes to gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167146 (owner: 10coren) [01:28:15] (03PS1) 10coren: Labs: further fixes to gridengine puppetization [puppet] - 10https://gerrit.wikimedia.org/r/167148 [01:29:14] (03CR) 10coren: [C: 032] Labs: further fixes to gridengine puppetization [puppet] - 10https://gerrit.wikimedia.org/r/167148 (owner: 10coren) [01:30:14] (03PS1) 10Dzahn: add radium to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/167149 [01:31:17] (03PS1) 10coren: Labs: more explicit paths [puppet] - 10https://gerrit.wikimedia.org/r/167150 [01:34:27] (03CR) 10coren: [C: 032] Labs: more explicit paths [puppet] - 10https://gerrit.wikimedia.org/r/167150 (owner: 10coren) [01:34:50] (03PS1) 10Dzahn: add node radium to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/167151 [01:37:13] (03PS1) 10coren: Labs: more gridengine class fixes [puppet] - 10https://gerrit.wikimedia.org/r/167152 [01:37:33] (03PS1) 10Dzahn: add IPv6 interface to radium [puppet] - 10https://gerrit.wikimedia.org/r/167153 [01:38:17] (03CR) 10coren: "This is getting old real fast." [puppet] - 10https://gerrit.wikimedia.org/r/167152 (owner: 10coren) [01:38:32] (03CR) 10coren: [C: 032] Labs: more gridengine class fixes [puppet] - 10https://gerrit.wikimedia.org/r/167152 (owner: 10coren) [01:44:54] (03PS1) 10Dzahn: CNAME tor-eqiad-1 -> radium [dns] - 10https://gerrit.wikimedia.org/r/167155 [01:49:27] (03PS1) 10coren: Labs: more gridengine puppetness [puppet] - 10https://gerrit.wikimedia.org/r/167156 [01:50:04] (03CR) 10jenkins-bot: [V: 04-1] Labs: more gridengine puppetness [puppet] - 10https://gerrit.wikimedia.org/r/167156 (owner: 10coren) [01:52:41] (03PS2) 10coren: Labs: more gridengine puppetness [puppet] - 10https://gerrit.wikimedia.org/r/167156 [01:53:31] (03CR) 10coren: [C: 032] Labs: more gridengine puppetness [puppet] - 10https://gerrit.wikimedia.org/r/167156 (owner: 10coren) [02:09:44] (03PS1) 10coren: Labs: Tweak for toollabs gridengine config [puppet] - 10https://gerrit.wikimedia.org/r/167158 [02:20:35] (03CR) 10Krinkle: "The rule for oojs-ui is pointless since that's the default. Gerrit is pushing these twice now (one which will do a no-op sync to a Git rep" [puppet] - 10https://gerrit.wikimedia.org/r/94063 (owner: 10Catrope) [02:21:09] (03CR) 10Jforrester: "Cool follow-up, Timo. :-)" [puppet] - 10https://gerrit.wikimedia.org/r/94063 (owner: 10Catrope) [02:21:55] (03CR) 10coren: [C: 032] "Last one for tonight." [puppet] - 10https://gerrit.wikimedia.org/r/167158 (owner: 10coren) [02:23:39] (03PS1) 10Dzahn: tor-relay - add firewalling [puppet] - 10https://gerrit.wikimedia.org/r/167159 [02:24:19] (03CR) 10jenkins-bot: [V: 04-1] tor-relay - add firewalling [puppet] - 10https://gerrit.wikimedia.org/r/167159 (owner: 10Dzahn) [02:25:48] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: Puppet has 1 failures [02:25:53] (03PS2) 10Dzahn: tor-relay - add firewalling [puppet] - 10https://gerrit.wikimedia.org/r/167159 [02:27:53] (03CR) 10Dzahn: "i realize it has been said before to add ferm rule to roles, but isn't it nicer this way when i don't hardcode any ports and just use the " [puppet] - 10https://gerrit.wikimedia.org/r/167159 (owner: 10Dzahn) [02:33:22] (03PS1) 10coren: Labs: Further tweaks to class gridengine [puppet] - 10https://gerrit.wikimedia.org/r/167160 [02:33:53] (03PS1) 10Dzahn: apply role::tor on radium [puppet] - 10https://gerrit.wikimedia.org/r/167161 [02:33:59] (03PS1) 10Krinkle: gerrit: Remove duplicate mirrors [puppet] - 10https://gerrit.wikimedia.org/r/167162 (https://bugzilla.wikimedia.org/68054) [02:34:15] (03PS2) 10Krinkle: gerrit: Remove duplicate mirrors [puppet] - 10https://gerrit.wikimedia.org/r/167162 (https://bugzilla.wikimedia.org/68054) [02:34:23] (03CR) 10coren: [C: 032] Labs: Further tweaks to class gridengine [puppet] - 10https://gerrit.wikimedia.org/r/167160 (owner: 10coren) [02:39:26] !log LocalisationUpdate completed (1.25wmf3) at 2014-10-17 02:39:26+00:00 [02:39:34] Logged the message, Master [02:42:04] (03PS1) 10coren: Labs: quoting and explicit bash -c in gridengine [puppet] - 10https://gerrit.wikimedia.org/r/167163 [02:42:52] (03CR) 10coren: [C: 032] Labs: quoting and explicit bash -c in gridengine [puppet] - 10https://gerrit.wikimedia.org/r/167163 (owner: 10coren) [02:44:09] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [02:51:53] (03CR) 10Dzahn: [C: 032] Bugzilla - disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167129 (owner: 10Dzahn) [02:56:14] (03CR) 10Dzahn: "test rating back to A-. This server is not vulnerable to the POODLE attack because it doesn't support SSL 3." [puppet] - 10https://gerrit.wikimedia.org/r/167129 (owner: 10Dzahn) [02:59:13] (03PS1) 10Dzahn: add annual.wm to misc varnish config [puppet] - 10https://gerrit.wikimedia.org/r/167165 [02:59:44] (03PS2) 10Dzahn: add annual.wm to misc varnish config [puppet] - 10https://gerrit.wikimedia.org/r/167165 [03:13:34] !log LocalisationUpdate completed (1.25wmf4) at 2014-10-17 03:13:33+00:00 [03:13:40] Logged the message, Master [03:33:16] (03PS1) 10Springle: depool es1007 and es1010 while cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167166 [03:35:08] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 274658 msg: ocg_render_job_queue 547 msg (=500 critical) [03:36:39] (03CR) 10Springle: [C: 032] depool es1007 and es1010 while cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167166 (owner: 10Springle) [03:36:46] (03Merged) 10jenkins-bot: depool es1007 and es1010 while cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167166 (owner: 10Springle) [03:37:19] PROBLEM - Disk space on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:37:19] PROBLEM - OCG health on ocg1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:37:29] PROBLEM - OCG health on ocg1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:37:39] !log springle Synchronized wmf-config/db-eqiad.php: depool es1007 and es1010 (duration: 00m 09s) [03:37:48] Logged the message, Master [03:40:24] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 275058 msg: ocg_render_job_queue 0 msg [03:40:30] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 275081 msg: ocg_render_job_queue 0 msg [03:43:30] PROBLEM - puppet last run on es2008 is CRITICAL: CRITICAL: Puppet has 1 failures [03:44:30] RECOVERY - puppet last run on es2008 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [03:48:42] (03PS1) 10Springle: prepare es2006 and es2008 [puppet] - 10https://gerrit.wikimedia.org/r/167167 [03:49:30] (03CR) 10Springle: [C: 032] prepare es2006 and es2008 [puppet] - 10https://gerrit.wikimedia.org/r/167167 (owner: 10Springle) [03:53:16] !log xtrabackup clone es1007 to es2006 [03:53:21] Logged the message, Master [03:54:02] !log xtrabackup clone es1010 to es2008 [03:54:07] Logged the message, Master [04:05:26] !log upgrade es1010 to trusty (clone failed, needs trusty) [04:05:34] Logged the message, Master [04:11:37] PROBLEM - MySQL Processlist on db1059 is CRITICAL: CRIT 112 unauthenticated, 0 locked, 0 copy to table, 0 statistics [04:17:12] !log springle Synchronized wmf-config/db-eqiad.php: reduce load on db1059 (duration: 00m 21s) [04:17:20] Logged the message, Master [04:17:38] RECOVERY - MySQL Processlist on db1059 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 1 statistics [04:18:55] (03PS2) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/166393 [04:41:08] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:42:57] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.697 second response time [04:43:47] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:44:38] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 70599 bytes in 4.268 second response time [04:49:58] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:51:03] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Oct 17 04:51:02 UTC 2014 (duration 51m 1s) [04:51:07] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:51:09] Logged the message, Master [04:51:57] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 70599 bytes in 8.670 second response time [04:52:08] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.721 second response time [04:52:10] !log xtrabackup clone es1010 to es2008 [04:55:14] morebots: ? [04:55:14] I am a logbot running on tools-exec-14. [04:55:14] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [04:55:14] To log a message, type !log . [05:06:38] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:07:28] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.873 second response time [05:15:53] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:18:50] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 70599 bytes in 6.060 second response time [05:22:12] goddamn txstatsd, re-creating dead instances [05:32:36] How do I make sure that en_US.UTF-8 is available on my instance. Any way via Puppet? [05:33:16] akosiaris: We may need this for Apertium service (else, result will have ???? instead of UTF-8 chars) :) [05:37:41] <_joe_> !log depooling both hhvm api appservers [05:37:48] Logged the message, Master [06:00:09] (03PS1) 10Chmarkine: Wikitech - disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167169 [06:26:28] RECOVERY - Disk space on ocg1002 is OK: DISK OK [06:28:57] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:17] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:28] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:37] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:38] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:48] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:48] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:58] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:19] (03PS1) 10Chmarkine: OTRS - disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167170 [06:32:49] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:18] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:58] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:58] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:59] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:17] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:18] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:29] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:07] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:08] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:20] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:58] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:19] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:29] PROBLEM - puppet last run on mw1097 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:28] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:45:55] (03PS1) 10Chmarkine: RT - Disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167171 [06:46:00] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:01] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [06:47:48] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:47:48] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:48:08] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:48:38] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:48:49] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:49:21] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:49:28] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:49:45] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:50:08] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:50:18] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:50:20] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:51:18] RECOVERY - puppet last run on mw1097 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:52:25] kart_: good morning, what are you referring to ? backlog doesn't help much... [06:52:50] heh, sorry, my morning ... I suppose it is already noon over there ? [06:54:02] akosiaris: yeah, 12:24 :) [06:55:39] PROBLEM - puppet last run on db1038 is CRITICAL: CRITICAL: Puppet has 2 failures [07:01:53] akosiaris: બદહ you will see this :) [07:02:31] ah. irssi+konsole is okay with Gujarati. Wanted to show some junk chars :) [07:02:42] UTF8 ftw. btw I can do the same. Βλέπεις; [07:02:48] :) [07:02:58] Lunch. Lets talk in PM later. [07:03:28] ok. ping me when you want. My lunch is in about 3 hours btw [07:03:48] (03CR) 10Alexandros Kosiaris: [C: 032] tor-relay - add firewalling [puppet] - 10https://gerrit.wikimedia.org/r/167159 (owner: 10Dzahn) [07:04:30] heh [07:06:18] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [07:10:57] (03PS1) 10Chmarkine: tendril - Disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167172 [07:13:18] RECOVERY - puppet last run on db1038 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [07:46:29] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian packaging [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/163577 (owner: 10KartikMistry) [07:52:27] (03PS3) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/166393 [07:55:17] (03PS2) 10Yuvipanda: graphite: Add labs archiver script [puppet] - 10https://gerrit.wikimedia.org/r/166902 [08:04:00] (03PS9) 10KartikMistry: Apertium service configuration for Beta [puppet] - 10https://gerrit.wikimedia.org/r/165485 [08:06:07] (03PS2) 10KartikMistry: Add .gitreview [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/163614 [08:09:14] (03PS3) 10Yuvipanda: graphite: Add labs archiver script [puppet] - 10https://gerrit.wikimedia.org/r/166902 [08:10:39] (03Abandoned) 10Giuseppe Lavagetto: labmon: Collect metrics from Toolsbeta [puppet] - 10https://gerrit.wikimedia.org/r/157620 (owner: 10Tim Landscheidt) [08:23:50] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A few remarks." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/166902 (owner: 10Yuvipanda) [08:27:52] (03CR) 10Yuvipanda: graphite: Add labs archiver script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/166902 (owner: 10Yuvipanda) [08:28:31] (03PS4) 10Yuvipanda: graphite: Add labs archiver script [puppet] - 10https://gerrit.wikimedia.org/r/166902 [08:42:57] paravoid: no explosions in sight with the thumb prerendering now deployed on all wikis except commons? [08:44:46] !log uploaded cg3 and lttoolbox on apt.wikimedia.org [08:44:58] Logged the message, Master [08:51:07] akosiaris: should apt.wikimedia.org page have description about packages there? [08:51:19] blank page now. [08:51:30] no it shouldn't [08:51:50] http://apt.wikimedia.org/wikimedia/ [08:51:59] akosiaris: okay! [08:53:14] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 274269 msg: ocg_render_job_queue 1755 msg (=500 critical) [08:53:31] <_joe_> and ocg we restart again, hey-ho [08:53:32] oh the joy, ocg malfunctioned again [08:53:34] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 274316 msg: ocg_render_job_queue 1696 msg (=500 critical) [08:53:54] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 274339 msg: ocg_render_job_queue 1635 msg (=500 critical) [08:54:21] _joe_: any news on solving that issue ? [08:56:10] <_joe_> akosiaris: they are working on it I guess [08:56:16] <_joe_> !log restarted the ocg cluster [08:56:23] Logged the message, Master [09:14:52] morebots: ping [09:14:52] I am a logbot running on tools-exec-14. [09:14:53] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [09:14:53] To log a message, type !log . [09:17:23] <_joe_> !log manually killed long-running stuck processes on ocg1001, moving to the rest of the cluster [09:17:30] Logged the message, Master [09:17:32] <_joe_> I was way too optimistic earlier [09:17:41] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 286316 msg: ocg_render_job_queue 1859 msg (=500 critical) [09:20:31] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 286620 msg: ocg_render_job_queue 832 msg (=500 critical) [09:22:21] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 286797 msg: ocg_render_job_queue 0 msg [09:22:31] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 286827 msg: ocg_render_job_queue 0 msg [09:22:51] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 286850 msg: ocg_render_job_queue 0 msg [09:32:32] (03CR) 10JanZerebecki: [C: 031] Wikitech - disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167169 (owner: 10Chmarkine) [09:34:03] (03CR) 10JanZerebecki: [C: 04-1] "This changes wikitech instead of what the commit message says." [puppet] - 10https://gerrit.wikimedia.org/r/167170 (owner: 10Chmarkine) [09:34:58] (03CR) 10JanZerebecki: [C: 031] RT - Disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167171 (owner: 10Chmarkine) [09:39:16] (03CR) 10Yuvipanda: [C: 032] Make extraction more tolerant of HTML changes [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/164992 (owner: 10BryanDavis) [09:39:23] (03Merged) 10jenkins-bot: Make extraction more tolerant of HTML changes [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/164992 (owner: 10BryanDavis) [09:51:40] RECOVERY - MySQL Replication Heartbeat on db1061 is OK: OK replication delay 93 seconds [09:51:58] (03PS1) 10Giuseppe Lavagetto: sudo: create module, remove old files [puppet] - 10https://gerrit.wikimedia.org/r/167183 [09:52:02] RECOVERY - MySQL Slave Delay on db1061 is OK: OK replication delay 0 seconds [09:52:36] (03CR) 10jenkins-bot: [V: 04-1] sudo: create module, remove old files [puppet] - 10https://gerrit.wikimedia.org/r/167183 (owner: 10Giuseppe Lavagetto) [09:55:42] <_joe_> puppet-tabs??? for REALS? [09:56:48] (03PS2) 10Giuseppe Lavagetto: sudo: create module, remove old files [puppet] - 10https://gerrit.wikimedia.org/r/167183 [10:02:24] (03PS2) 10Yuvipanda: Include ganglia in standard only for production [puppet] - 10https://gerrit.wikimedia.org/r/165360 [10:02:49] (03CR) 10Yuvipanda: [C: 031] "Find. Me. Somebody to merge. Find me somebody to merge. Merge. Meeeeeerrrrrggggeeeeeeeee!" [puppet] - 10https://gerrit.wikimedia.org/r/165360 (owner: 10Yuvipanda) [10:03:43] <_joe_> YuviPanda: I do agree with andrewb [10:04:04] <_joe_> but let's say this is good enough [10:04:13] yeah [10:04:13] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Include ganglia in standard only for production [puppet] - 10https://gerrit.wikimedia.org/r/165360 (owner: 10Yuvipanda) [10:04:17] _joe_: ty [10:16:09] (03CR) 10ArielGlenn: "it might be that '@repo' in the provider is what's borking it; that may be turning into the empty string since there's no instance var def" [puppet] - 10https://gerrit.wikimedia.org/r/166736 (owner: 10Ori.livneh) [10:41:50] !log uploaded apertium 3.3 on apt.wikimedia.org (trusty-wikimedia) [10:41:58] Logged the message, Master [13:07:32] (03CR) 10Mark Bergsma: "It seems Bugzilla is using compatnossl as well now" [puppet] - 10https://gerrit.wikimedia.org/r/167015 (owner: 10BBlack) [13:16:24] (03PS1) 10Yuvipanda: androidsdk: Add class to set up wikipedia app build [puppet] - 10https://gerrit.wikimedia.org/r/167198 [13:35:44] (03PS1) 10coren: Labs: clean exec resources in gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167200 [13:36:27] (03CR) 10jenkins-bot: [V: 04-1] Labs: clean exec resources in gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167200 (owner: 10coren) [13:37:12] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [13:41:05] (03PS2) 10coren: Labs: clean exec resources in gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167200 [13:41:44] (03CR) 10jenkins-bot: [V: 04-1] Labs: clean exec resources in gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167200 (owner: 10coren) [13:45:32] (03PS3) 10coren: Labs: clean exec resources in gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167200 [13:46:18] (03CR) 10jenkins-bot: [V: 04-1] Labs: clean exec resources in gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167200 (owner: 10coren) [13:46:58] (03PS4) 10coren: Labs: clean exec resources in gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167200 [13:48:39] (03CR) 10coren: [C: 032] Labs: clean exec resources in gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167200 (owner: 10coren) [13:50:40] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [13:51:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] Added initial Debian packaging (035 comments) [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/166393 (owner: 10KartikMistry) [14:06:47] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Add missing link to init with upstart-job [puppet] - 10https://gerrit.wikimedia.org/r/166535 (owner: 10KartikMistry) [14:10:41] (03CR) 10Alexandros Kosiaris: "3 final comments and this is done" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165485 (owner: 10KartikMistry) [14:11:30] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add .gitreview [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/163614 (owner: 10KartikMistry) [14:12:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] Apertium service configuration for Beta [puppet] - 10https://gerrit.wikimedia.org/r/165485 (owner: 10KartikMistry) [14:13:27] (03CR) 10Alexandros Kosiaris: [C: 032] "Coordinate with me please on when we should merge this" [puppet] - 10https://gerrit.wikimedia.org/r/163841 (owner: 10KartikMistry) [14:15:25] (03PS1) 10Andrew Bogott: /fully/qualify/path to mwscript [puppet] - 10https://gerrit.wikimedia.org/r/167202 [14:17:55] Coren: ^ [14:18:41] (03CR) 10coren: [C: 032] "More quality." [puppet] - 10https://gerrit.wikimedia.org/r/167202 (owner: 10Andrew Bogott) [14:21:04] thx [14:22:00] (03PS1) 10coren: Labs: more fixes to gridengine puppetization [puppet] - 10https://gerrit.wikimedia.org/r/167203 [14:22:09] (03PS2) 10Yuvipanda: androidsdk: Add class to set up wikipedia app build [puppet] - 10https://gerrit.wikimedia.org/r/167198 [14:22:59] (03CR) 10coren: [C: 032] Labs: more fixes to gridengine puppetization [puppet] - 10https://gerrit.wikimedia.org/r/167203 (owner: 10coren) [14:26:46] <_joe_> !log load test on the hhvm cluster [14:26:53] Logged the message, Master [14:28:10] (03CR) 10Andrew Bogott: [C: 031] graphite: Add labs archiver script [puppet] - 10https://gerrit.wikimedia.org/r/166902 (owner: 10Yuvipanda) [14:39:00] andrewbogott: y u no merge [14:39:51] Since _joe_ reviewed previously, would be polite to get his sign-off as well [14:40:11] andrewbogott: ok! [14:42:29] <_joe_> oh right [14:42:46] <_joe_> yuvi: I think you should catch the exception one level deeper [14:42:56] <_joe_> one dir not being copied is not a real issue [14:43:09] <_joe_> blocking the archive process repeatedly can be [14:44:16] <_joe_> !log load test on hhvm done [14:44:23] Logged the message, Master [14:44:41] (03PS1) 10coren: Labs: more tweaks to the gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167209 [14:45:05] (03PS3) 10Yuvipanda: androidsdk: Add class to set up wikipedia app build [puppet] - 10https://gerrit.wikimedia.org/r/167198 [14:45:37] (03CR) 10coren: [C: 032] Labs: more tweaks to the gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167209 (owner: 10coren) [14:46:29] (03PS3) 10BBlack: Disable SSLv3 completely [puppet] - 10https://gerrit.wikimedia.org/r/167015 [14:48:48] (03CR) 10BBlack: [C: 04-1] "We're switching "compat" to -SSLv3 shortly in I17d41e7208051cf8501b354a0f254f1669c0059a" [puppet] - 10https://gerrit.wikimedia.org/r/167170 (owner: 10Chmarkine) [14:49:13] there's a bunch of these that need -2 [14:49:27] (03CR) 10BBlack: [C: 04-1] "We're switching "compat" to -SSLv3 shortly in I17d41e7208051cf8501b354a0f254f1669c0059a" [puppet] - 10https://gerrit.wikimedia.org/r/167171 (owner: 10Chmarkine) [14:49:28] I left them to just abandon them after you merge it :) [14:49:55] well I figure being informative and polite is nice instead of mass-abandon :) [14:50:11] meh, I'm a grumpy old man [14:50:14] ;) [14:50:18] (03CR) 10BBlack: [C: 04-1] "We're switching "compat" to -SSLv3 shortly in I17d41e7208051cf8501b354a0f254f1669c0059a" [puppet] - 10https://gerrit.wikimedia.org/r/167169 (owner: 10Chmarkine) [14:50:30] (03CR) 10BBlack: "We're switching "compat" to -SSLv3 shortly in I17d41e7208051cf8501b354a0f254f1669c0059a" [puppet] - 10https://gerrit.wikimedia.org/r/167172 (owner: 10Chmarkine) [14:50:58] (03CR) 10BBlack: [C: 04-1] tendril - Disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167172 (owner: 10Chmarkine) [14:51:05] (03PS4) 10Yuvipanda: androidsdk: Add class to set up wikipedia app build [puppet] - 10https://gerrit.wikimedia.org/r/167198 [14:51:50] (03PS6) 10Faidon Liambotis: Enable mobile redirect for old Wikisource (http://wikisource.org) [puppet] - 10https://gerrit.wikimedia.org/r/167032 (https://bugzilla.wikimedia.org/69765) (owner: 10MaxSem) [14:51:57] (03CR) 10Faidon Liambotis: [C: 032] Enable mobile redirect for old Wikisource (http://wikisource.org) [puppet] - 10https://gerrit.wikimedia.org/r/167032 (https://bugzilla.wikimedia.org/69765) (owner: 10MaxSem) [14:53:56] (03CR) 10JanZerebecki: [C: 031] Disable SSLv3 completely [puppet] - 10https://gerrit.wikimedia.org/r/167015 (owner: 10BBlack) [14:54:37] (03PS1) 10coren: Labs: more tweaks to gridengine puppetization [puppet] - 10https://gerrit.wikimedia.org/r/167211 [14:55:23] Coren: you should have more descriptive commit messages :P [14:56:06] YuviPanda: There's a point, after the 50th patch or so, where anything more than "more small fixes and tweaks" becomes pointless. :-) [14:56:28] heh :) [14:56:28] ok [14:56:46] With luck, though, this will be pretty much the last tweak to the basic skeleton so I'll be able to get more substantive patches in. :_) [14:57:10] (03CR) 10coren: [C: 032] Labs: more tweaks to gridengine puppetization [puppet] - 10https://gerrit.wikimedia.org/r/167211 (owner: 10coren) [14:59:21] Puppetizing gridengine involves working around no resource collection via NFS, constructing config files from bits and pieces, and invoking the actual gridengine configuration commands in at least three different ways. [14:59:43] Once the basic mechanisms for that are in place, the rest becomes "easy". :-) [14:59:44] perhaps some initial testing first and then reviews by others would be good [15:00:02] yeah, disabling puppet on tools and then doing lots of patches that are hard to revert makes me feel a bit scared [15:00:06] toolsbeta! [15:00:34] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet last ran 761581 seconds ago, expected 14400 [15:00:39] mark: Part of the problem is that it's basically impossible to actually test this without having instances actually push stuff onto an NFS directory and actually trying to construct the configuration from it. [15:00:45] Oh, nothing for SWAT? :P [15:01:03] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet last ran 761131 seconds ago, expected 14400 [15:01:09] good thing instances are basically free, Coren ;) [15:01:11] Also, the puppet compiler don't work for labs. :-( [15:01:35] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [15:01:38] * Coren would have saved at least 30 patches if it did. [15:02:06] (03PS1) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 [15:02:06] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:02:08] (03PS5) 10Yuvipanda: androidsdk: Add class to set up wikipedia app build [puppet] - 10https://gerrit.wikimedia.org/r/167198 [15:03:13] Coren: you've 30+ merged patches?! :( [15:03:39] that's going to be fun to revert if we need to. testing in production on this massive a scale feels bad, Coren. Are we in *that* much of a hurry we can't test at all? [15:04:09] (03PS4) 10BBlack: Disable SSLv3 completely [puppet] - 10https://gerrit.wikimedia.org/r/167015 [15:04:23] YuviPanda: It's literally impossible that this needs to be reverted - absolutely everything right now is a functional noop because it constructs configuration but applies none of it. [15:04:44] (03CR) 10BBlack: [C: 032 V: 032] Disable SSLv3 completely [puppet] - 10https://gerrit.wikimedia.org/r/167015 (owner: 10BBlack) [15:04:59] Disabling puppet wasn't to prevent breaking the running config, but to prevent "omg puppet is breaking randomly" alerts. [15:05:00] still, are we in that much a hurry we can't construct a few well done commits, test / amend them till they work fine, and then apply/merge them? [15:06:33] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Puppet last ran 174705 seconds ago, expected 14400 [15:07:29] Coren: as mark said, instances are free. I strongly think we should do it slowly and test 'em somewhere else... [15:07:57] marktraceur: I think the POODLE fixes qualify as SWAT on friday ;-) [15:08:33] mark: I'm just glad I don't have to do it today [15:08:34] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:08:36] Three days straight [15:09:29] Well, yes. It would be possible to construct a new tool labs, I suppose, puppetmaster::self it, configure it with enough of a configuration that all the edge cases are replicated (~20 instances), test to make sure that it does what's expected, then apply on the real tool labs and do this all over again. [15:10:10] toolsbeta! [15:10:31] this won't be the last time we'll need to test things that extensively change things, and if toolsbeta isn't up to the task we should fix it so it is. [15:10:37] Toolsbeta doesn't even resemble tools anymore. (Yes, that's also technical debt) [15:11:33] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 63, down: 0, dormant: 0, excluded: 1, unused: 0 [15:11:43] (that's me adding no-mon to Telia transit) [15:11:50] ah thanks [15:11:58] codfw too, it's coming up [15:12:07] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 108, down: 0, dormant: 0, excluded: 1, unused: 0 [15:12:13] But also, even if it did, we'd be in the same situation. [15:12:27] The only effective difference would be the number of gerrit patches. [15:12:37] s/patches/changesets/ [15:12:46] and better teamwork [15:12:59] mark: I'm not sure I follow. [15:13:22] the way you're doing it now, doesn't really allow anyone to help you or review your changes [15:13:34] * Coren hm. [15:15:29] (03CR) 10Nemo bis: "Updated: https://wikitech.wikimedia.org/w/index.php?title=HTTPS&diff=131342&oldid=130024" [puppet] - 10https://gerrit.wikimedia.org/r/167015 (owner: 10BBlack) [15:16:27] That's true, but it's not clear how doing it with a self-hosted puppetmaster would help /that/. Well, except that I could emerge with a massive changeset at the end (but which would be known to not break the trial area) [15:17:17] Allright, if you see a benefit to it which I just might not - I'm one patch away from a point where I can reenable puppet with no errors; this is a good time to switch the work approach if any. [15:18:09] (03PS1) 10Giuseppe Lavagetto: gerrit: move to module [puppet] - 10https://gerrit.wikimedia.org/r/167215 [15:18:25] (03Abandoned) 10Chmarkine: tendril - Disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167172 (owner: 10Chmarkine) [15:19:18] (03Abandoned) 10Chmarkine: Wikitech - disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167169 (owner: 10Chmarkine) [15:19:55] (03PS1) 10coren: Labs: final fix of gridengine class [puppet] - 10https://gerrit.wikimedia.org/r/167216 [15:20:04] (03Abandoned) 10Chmarkine: OTRS - disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167170 (owner: 10Chmarkine) [15:20:18] YuviPanda: ^^ this make puppet have no errors anymore, so we can reenable it. [15:20:26] yay to enabled puppt :) [15:20:34] *puppt [15:20:50] Now, if you have the bandwidth and you see a better approach to doing this, I"m all ears. [15:21:22] (03CR) 10coren: [C: 032] "typo fix" [puppet] - 10https://gerrit.wikimedia.org/r/167216 (owner: 10coren) [15:21:30] (03Abandoned) 10Chmarkine: RT - Disable SSL3 [puppet] - 10https://gerrit.wikimedia.org/r/167171 (owner: 10Chmarkine) [15:21:32] write series of biggish patches, review by a couple of people to catch errors, and then test / merge? [15:21:58] if testing on toolsbeta is too hard, we can just directly test on toollabs. the icinga errors are ok there in that case, since that means we'll have to be extra careful (good!) [15:22:11] (and one of us should set aside some time in the near future to fix toolsbeta) [15:23:05] To be honest, I was hoping that the finalized gridengine/toollabs classes would be used to recreate it. That was a planned side benefit. :-) [15:24:21] Especially since deploying it on tools will only show us that it /maintains/ a config updated not that it sucessfully creates it - which yes - is actually done differently in effing gridengine. [15:24:52] yeah, so doesn't that make even more sense to just test it from scratch on toolsbeta to begin with? [15:25:04] since otherwise deploying on tools will actually not tell us too much... [15:25:11] since it already is a hand-done config [15:25:29] and we won't know if the reason it works is because the puppetization worked or if some hand-done optimization was masking it not working [15:25:40] if we delete the current toolsbeta exec instances and do it properly there, then we'll know for sure [15:26:16] We'll know "it works to create a new grid" not "It won't break the current grid" - which IMO is the bigger immediate requirement. [15:26:59] ah, hmm. we can't fully do a rolling roll out of the new puppet stuff to tools because it depends on the master [15:27:24] Exactly. [15:27:43] In fact, that's the biggest problem: the master holds the configuration of the nodes. [15:28:09] Hence my careful, careful testing on tools to make sure that generated configuration exactly matches the live one. :-) [15:29:03] hmm [15:29:07] Once I know that the class cannot break the current config, making sure it can also create it from scratch is step 2. :-) [15:29:23] That's why my approach may have seemed backwards. [15:29:27] see this is where emailing out the plan beforehand would've been useful :) [15:29:46] PROBLEM - puppet last run on mw1151 is CRITICAL: CRITICAL: puppet fail [15:30:06] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Puppet has 1 failures [15:30:07] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: puppet fail [15:30:13] There's still a (Very minor) error on tools-master, but nothing that justifies not turning puppet back on. [15:30:15] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: puppet fail [15:30:16] PROBLEM - puppet last run on mc1014 is CRITICAL: CRITICAL: puppet fail [15:30:16] PROBLEM - puppet last run on mw1055 is CRITICAL: CRITICAL: Puppet has 45 failures [15:30:22] I can't think of a simple way to do a rolling restart of the grid. [15:30:25] PROBLEM - puppet last run on mw1049 is CRITICAL: CRITICAL: puppet fail [15:30:39] so I guess we've to do it this way. [15:30:50] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: puppet fail [15:30:51] !log restarted puppetmasters [15:30:56] PROBLEM - puppet last run on mw1125 is CRITICAL: CRITICAL: puppet fail [15:30:57] Logged the message, Master [15:31:00] There isn't one. Gridengine can't do that at all. I hates sun engineering. ;-) [15:31:01] ^(which is probably the source of the random puppetfails above) [15:31:06] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: Puppet has 4 failures [15:31:06] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Puppet has 36 failures [15:31:06] PROBLEM - puppet last run on mc1005 is CRITICAL: CRITICAL: Puppet has 8 failures [15:31:11] Coren: so I guess current approach is fine, but still do have more detailed commit messages. Doing a git blame and finding a 'fix stuff' commit message isn't fun. [15:31:15] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Puppet has 10 failures [15:31:15] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: Puppet has 2 failures [15:31:16] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: Puppet has 2 failures [15:31:16] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 3 failures [15:31:26] Coren: an email to ops@ describing what you told me would also be nice :) [15:31:35] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Puppet has 4 failures [15:31:36] PROBLEM - puppet last run on analytics1022 is CRITICAL: CRITICAL: Puppet has 1 failures [15:31:36] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: Puppet has 4 failures [15:31:45] PROBLEM - puppet last run on amssq34 is CRITICAL: CRITICAL: Puppet has 2 failures [15:31:52] Coren: however, this will still cause problems when we rebuild toolsbeta [15:32:06] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 3 failures [15:32:08] YuviPanda: That's a nasty side effect of working out exactly how to do some of the really basic stuff. I'm already in a good position to do more substantive patches now - like I told you some time ago. [15:32:22] Coren: since we don't know if the change to re-create from scratch will fuck up the current grid [15:32:26] Well no, because toolsbeta we can simply scrap and rebuild from fresh instances. [15:32:27] and no way to test it. [15:32:39] no, but something that worked on toolsbeta might fuck up tools [15:33:26] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:33:30] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:33:30] Oh, thankfully, we know that can't happen - so long as we don't touch the 'maintain the config' bits, anything we do to 'create it in the first place' can't mess tools up. [15:33:41] hmmm [15:34:12] Yeah, gridengine actually has different mechanisms for both actions. [15:34:38] I should investigate alternatives at some point [15:34:48] To what, gridengine? [15:34:58] but with better commit messages, the current approach seems the easiest way [15:35:02] yeah [15:35:21] That really wouldn't be plausible. It's a bitch to configure, but it works fine and the users have all coded against it. [15:35:38] heh. anyway let's see how this ends way [15:35:50] I'm off for dinner now [15:35:55] Good eats. [15:36:21] !log killed tampa config remnants on all cr1/cr2s [15:36:27] Logged the message, Master [15:40:35] <_joe_> I'm away for now, bbl [15:40:41] * _joe_ away [15:40:59] (03PS3) 10Chad: Decom deployment-elastic01 from beta [puppet] - 10https://gerrit.wikimedia.org/r/167010 [15:41:01] (03PS3) 10Chad: Configure Elasticsearch for statsd [puppet] - 10https://gerrit.wikimedia.org/r/166690 [15:47:04] (03CR) 10Manybubbles: [C: 031] Configure Elasticsearch for statsd [puppet] - 10https://gerrit.wikimedia.org/r/166690 (owner: 10Chad) [15:47:36] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:47:45] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:47:55] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:48:16] RECOVERY - puppet last run on amssq34 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:48:26] RECOVERY - puppet last run on mw1151 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:48:45] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:48:49] RECOVERY - puppet last run on mc1005 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:48:55] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:48:55] RECOVERY - puppet last run on mw1055 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:48:55] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:48:55] RECOVERY - puppet last run on mc1014 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:49:16] RECOVERY - puppet last run on analytics1022 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:49:16] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:49:16] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:49:25] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:49:35] RECOVERY - puppet last run on mw1125 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:49:55] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:49:55] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 65 seconds ago with 0 failures [15:50:06] RECOVERY - puppet last run on mw1049 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:15:31] (03CR) 10Jforrester: [C: 031] gerrit: Remove duplicate mirrors [puppet] - 10https://gerrit.wikimedia.org/r/167162 (https://bugzilla.wikimedia.org/68054) (owner: 10Krinkle) [16:33:47] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 281730 msg: ocg_render_job_queue 630 msg (=500 critical) [16:33:58] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 281950 msg: ocg_render_job_queue 694 msg (=500 critical) [16:34:28] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 282373 msg: ocg_render_job_queue 842 msg (=500 critical) [16:58:27] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 295326 msg: ocg_render_job_queue 93 msg [16:58:38] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 295352 msg: ocg_render_job_queue 0 msg [16:59:07] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 295388 msg: ocg_render_job_queue 0 msg [17:12:00] (03PS10) 10KartikMistry: Apertium service configuration for Beta [puppet] - 10https://gerrit.wikimedia.org/r/165485 [17:13:03] (03PS1) 10Ori.livneh: HHVM: provision debug symbols for libraries used by HHVM [puppet] - 10https://gerrit.wikimedia.org/r/167230 [17:13:35] (03PS2) 10Ori.livneh: HHVM: provision debug symbols for libraries used by HHVM [puppet] - 10https://gerrit.wikimedia.org/r/167230 [17:16:39] (03CR) 10Ori.livneh: [C: 032] HHVM: provision debug symbols for libraries used by HHVM [puppet] - 10https://gerrit.wikimedia.org/r/167230 (owner: 10Ori.livneh) [17:32:39] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 1 failures [17:42:40] PROBLEM - DPKG on osmium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:43:41] * ori looks at osmium [17:44:22] (03PS1) 10Ori.livneh: hhvm: make hhvm-dump-debug only dump core with '--core' [puppet] - 10https://gerrit.wikimedia.org/r/167236 [17:44:48] RECOVERY - DPKG on osmium is OK: All packages OK [17:49:47] (03CR) 10Faidon Liambotis: HHVM: provision debug symbols for libraries used by HHVM (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167230 (owner: 10Ori.livneh) [17:50:08] ori: I was too late, but you were too fast :) [17:50:35] i can easily amend [17:51:46] (03CR) 10Faidon Liambotis: "Policy says "all but unusual configurations". Some packages abuse Recommends but I don't think we should disable them just because of this" [puppet] - 10https://gerrit.wikimedia.org/r/167020 (owner: 10Ori.livneh) [17:51:49] (03PS2) 10Ori.livneh: hhvm: make hhvm-dump-debug only dump core with '--core' [puppet] - 10https://gerrit.wikimedia.org/r/167236 [17:55:24] (03CR) 10KartikMistry: Added initial Debian packaging (033 comments) [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/166393 (owner: 10KartikMistry) [17:55:52] (03PS4) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/166393 [17:59:33] (03PS1) 10Ori.livneh: hhvm: include .::packages rather than require [puppet] - 10https://gerrit.wikimedia.org/r/167239 [18:02:17] are you debugging it in realtime? [18:03:49] paravoid: what do you mean? am i looking at it right this second? no, but i'm about to. or do you mean: is it useful to get bts, etc. from a running, pooled instance? if so: yes. [18:05:08] k [18:41:00] (03CR) 10Ori.livneh: [C: 032] hhvm: include .::packages rather than require [puppet] - 10https://gerrit.wikimedia.org/r/167239 (owner: 10Ori.livneh) [18:43:38] (03PS8) 10Chad: More elasticsearch tools [puppet] - 10https://gerrit.wikimedia.org/r/164270 [18:45:31] (03CR) 10Dzahn: [C: 031] gerrit: Remove duplicate mirrors [puppet] - 10https://gerrit.wikimedia.org/r/167162 (https://bugzilla.wikimedia.org/68054) (owner: 10Krinkle) [18:49:55] (03CR) 10Dzahn: [C: 031] "looks reasonable. did you want this now?" [puppet] - 10https://gerrit.wikimedia.org/r/164270 (owner: 10Chad) [18:50:17] (03CR) 10Manybubbles: [C: 031] More elasticsearch tools (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/164270 (owner: 10Chad) [18:52:38] (03PS9) 10Chad: More elasticsearch tools [puppet] - 10https://gerrit.wikimedia.org/r/164270 [18:53:22] (03PS3) 10Ori.livneh: hhvm: make hhvm-dump-debug only dump core with '--core' [puppet] - 10https://gerrit.wikimedia.org/r/167236 [18:53:29] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm: make hhvm-dump-debug only dump core with '--core' [puppet] - 10https://gerrit.wikimedia.org/r/167236 (owner: 10Ori.livneh) [18:54:39] (03CR) 10Chad: "Not needed urgently, just trying to wrap this stuff up :)" [puppet] - 10https://gerrit.wikimedia.org/r/164270 (owner: 10Chad) [18:54:43] <^demon|away> mutante: ^ [18:55:02] ok:) [19:15:25] (03PS1) 10John F. Lewis: Vanadium access for milimetric [puppet] - 10https://gerrit.wikimedia.org/r/167269 [19:15:43] (03PS2) 10John F. Lewis: Vanadium access for milimetric [puppet] - 10https://gerrit.wikimedia.org/r/167269 [19:17:54] (03CR) 10Dzahn: [C: 031] "technically just fine, pending some kind of approval" [puppet] - 10https://gerrit.wikimedia.org/r/167269 (owner: 10John F. Lewis) [19:20:26] thanks JohnLewis [19:20:51] milimetric: welcome. Just get your manager to +1 it and all is good :) [19:21:57] (03CR) 10Dzahn: "the change in manifests/webserver.pp is related and needed? just cause it might touch a bunch of other things (still?) using it" [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [19:27:29] PROBLEM - puppet last run on amssq43 is CRITICAL: CRITICAL: Puppet has 1 failures [19:35:43] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: puppet fail [19:42:24] (03CR) 10Tnegrin: [C: 031] "Approved adding Milimetric" [puppet] - 10https://gerrit.wikimedia.org/r/167269 (owner: 10John F. Lewis) [19:43:10] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [19:54:20] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:01:06] (03PS1) 10Giuseppe Lavagetto: Add https://github.com/facebook/hhvm/pull/4000 [debs/hhvm] - 10https://gerrit.wikimedia.org/r/167302 [20:01:46] <_joe_> paravoid: forgive the format of the patch in my change here ^^ just building a new package for ori in a hurry [20:02:14] <_joe_> (well, not because of him, because of being very late here) [20:02:17] are you pulling my patch too? [20:02:23] <_joe_> tomorrow [20:02:34] <_joe_> I'll pull your patch and fix a few other details [20:02:48] k [20:02:50] <_joe_> now ori wanted to test the package with this patch [20:03:17] <_joe_> which seem to be quite relevant, if you read https://github.com/facebook/hhvm/issues/3999 [20:03:40] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add https://github.com/facebook/hhvm/pull/4000 [debs/hhvm] - 10https://gerrit.wikimedia.org/r/167302 (owner: 10Giuseppe Lavagetto) [20:05:34] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [20:06:10] <_joe_> can someone take a look? [20:06:22] <_joe_> ocg ain't happy ^^ [20:12:48] _joe_: I'm looking [20:13:20] there's a gig of logs in /var/log/upstart. no idea if that's normal or not, I'll compare with other ocg boxes [20:15:24] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 1 failures [20:16:26] !log disabled puppet on osmium to debug hhvm [20:16:37] Logged the message, Master [20:17:25] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:24:05] (03CR) 10Dzahn: [C: 032] add mgmt for radium (formerly cp1001) [dns] - 10https://gerrit.wikimedia.org/r/167142 (owner: 10Dzahn) [20:26:46] (03CR) 10Dzahn: "eqiad, row A, rack A3, RU 1" [dns] - 10https://gerrit.wikimedia.org/r/167142 (owner: 10Dzahn) [20:27:13] (03PS4) 10Hashar: Get betalabs localsettings.js file from deploy repo (just like prod) [puppet] - 10https://gerrit.wikimedia.org/r/166610 (owner: 10Subramanya Sastry) [20:28:54] (03CR) 10Hashar: [C: 031] "Seems fine for beta cluster. No clue about production parsoid servers though." [puppet] - 10https://gerrit.wikimedia.org/r/166610 (owner: 10Subramanya Sastry) [20:29:39] (03CR) 10Ori.livneh: "@Dzahn: yes, it's needed, but it's safe." [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [20:32:27] chasemp or bblack, can one of you look in /tmp on ocg1002 [20:32:35] This is more 'curiousity' than 'serious problem' [20:32:59] but I'm confused by how there can be a 5G file in a dir when du tells me that the dir only takes up 700M [20:33:07] (03CR) 10Hashar: [V: 031] "On beta deployment-parsoid04.eqiad.wmflabs:" [puppet] - 10https://gerrit.wikimedia.org/r/166610 (owner: 10Subramanya Sastry) [20:34:05] (03PS1) 10Aaron Schulz: Remove temp zone rewrite logic since that zone should be private [puppet] - 10https://gerrit.wikimedia.org/r/167310 [20:34:46] !log removed some stray .zip files from /tmp on ocg1002 [20:34:54] Logged the message, Master [20:35:24] RECOVERY - Disk space on ocg1002 is OK: DISK OK [20:53:06] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 320301 msg: ocg_render_job_queue 764 msg (=500 critical) [20:53:25] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 320329 msg: ocg_render_job_queue 503 msg (=500 critical) [20:54:05] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 320369 msg: ocg_render_job_queue 25 msg [20:54:25] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 320406 msg: ocg_render_job_queue 0 msg [21:09:34] (03CR) 10Dzahn: [C: 032] puppetmaster - use ssl_ciphersuite [puppet] - 10https://gerrit.wikimedia.org/r/153986 (owner: 10Dzahn) [21:13:33] !log graceful Apache on puppetmaster [21:13:39] Logged the message, Master [21:15:55] (03CR) 10Dzahn: [C: 032] "should fix all these: smokeping, librenms, statistics, etherpad, OTRS, torrus, releases" [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [21:18:46] (03PS1) 10Ori.livneh: webserver::php5-mysql: use require_package() [puppet] - 10https://gerrit.wikimedia.org/r/167315 [21:19:02] (03CR) 10Dzahn: "nah, more has been converted meanwhile. but it did affect librenms for sure.. watching it" [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [21:19:47] !log graceful Apache on netmon1001 [21:19:53] Logged the message, Master [21:20:36] all fine except minor "[warn] Useless use of AllowOverride" [21:23:31] (03PS1) 10Springle: repool es1004, es1007, es1010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167316 [21:23:33] (03PS2) 10Ori.livneh: webserver::php5-mysql: use require_package() [puppet] - 10https://gerrit.wikimedia.org/r/167315 [21:23:39] (03CR) 10Ori.livneh: [C: 032 V: 032] webserver::php5-mysql: use require_package() [puppet] - 10https://gerrit.wikimedia.org/r/167315 (owner: 10Ori.livneh) [21:24:04] (03CR) 10Springle: [C: 032] repool es1004, es1007, es1010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167316 (owner: 10Springle) [21:24:12] (03Merged) 10jenkins-bot: repool es1004, es1007, es1010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167316 (owner: 10Springle) [21:25:22] !log springle Synchronized wmf-config/db-eqiad.php: repool es1004, es1007, es1010 (duration: 00m 07s) [21:25:27] Logged the message, Master [21:27:30] (03CR) 10Dzahn: "all apache sites on netmon1001: servermon, librenms, observium, smokeping, torrus. either they now have "-SSLv3" or meanwhile they don't h" [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [21:40:08] (03PS1) 10Cmjohnson: remove this line [puppet] - 10https://gerrit.wikimedia.org/r/167318 [21:40:59] (03Abandoned) 10Cmjohnson: remove this line [puppet] - 10https://gerrit.wikimedia.org/r/167318 (owner: 10Cmjohnson) [21:41:22] (03PS1) 10Dzahn: stats.wm.org - disabled SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/167319 [21:41:38] (03PS2) 10Dzahn: stats.wikimedia.org - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/167319 [21:42:36] (03CR) 10Dzahn: [C: 032] "this is the quick fix, should also use ssl_ciphersuite and new Apache module" [puppet] - 10https://gerrit.wikimedia.org/r/167319 (owner: 10Dzahn) [21:46:57] !log graceful Apache on stat1001 [21:47:03] Logged the message, Master [21:51:26] (03PS1) 10Dzahn: icinga - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/167323 [21:52:49] (03CR) 10Dzahn: [C: 032] "the other virtual hosts on neon are behind misc-web, icinga is not" [puppet] - 10https://gerrit.wikimedia.org/r/167323 (owner: 10Dzahn) [21:59:51] (03PS1) 10Dzahn: icinga-admin - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/167324 [22:00:22] (03CR) 10Dzahn: [C: 032] icinga-admin - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/167324 (owner: 10Dzahn) [22:02:11] (03PS1) 10Dzahn: delete blog apache site [puppet] - 10https://gerrit.wikimedia.org/r/167325 [22:05:09] (03PS9) 10Ori.livneh: iegreview: Create module and role for deployment [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [22:05:57] (03CR) 10Ori.livneh: "@mutante: I applied the webserver::php5-mysql change in a separate patch, so it's no longer part of this change." [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [22:09:07] !log graceful Apache on neon - icinga and tendril done, ishmael = misc-web [22:09:13] Logged the message, Master [22:12:37] (03PS1) 10Dzahn: svn - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/167327 [22:13:40] (03CR) 10Dzahn: [C: 032] svn - disable SSLv3 [puppet] - 10https://gerrit.wikimedia.org/r/167327 (owner: 10Dzahn) [22:13:48] (03PS1) 10Dzahn: svn - retab Apache template [puppet] - 10https://gerrit.wikimedia.org/r/167328 [22:14:20] (03CR) 10Dzahn: [C: 032] svn - retab Apache template [puppet] - 10https://gerrit.wikimedia.org/r/167328 (owner: 10Dzahn) [22:17:47] (03PS1) 10Dzahn: tendril - set SSL/TLS protocol versions [puppet] - 10https://gerrit.wikimedia.org/r/167330 [22:18:46] !log graceful Apache on antimony - svn fixed, gitblit behind varnish [22:18:47] hola [22:18:54] Logged the message, Master [22:19:35] no entiendo hablenme español [22:21:07] hola carmela [22:22:08] (03Abandoned) 10Dzahn: tendril - set SSL/TLS protocol versions [puppet] - 10https://gerrit.wikimedia.org/r/167330 (owner: 10Dzahn) [22:27:33] (03PS1) 10Dzahn: Revert "svn - disable SSLv3" [puppet] - 10https://gerrit.wikimedia.org/r/167332 [22:27:38] (03CR) 10jenkins-bot: [V: 04-1] Revert "svn - disable SSLv3" [puppet] - 10https://gerrit.wikimedia.org/r/167332 (owner: 10Dzahn) [22:32:34] why does 'git commit' suddenly open nano instead of vim? [22:32:38] (03PS2) 10Dzahn: Revert "svn - disable SSLv3" [puppet] - 10https://gerrit.wikimedia.org/r/167332 [22:32:58] andrewbogott: echo $VISUAL [22:33:05] on which host? [22:33:18] mutante: well, just now, on two new labs instances [22:33:27] andrewbogott: export VISUAL=vim [22:33:31] then try again [22:33:58] hm, yes. [22:34:03] So, I wonder what changed... [22:34:07] but why it changed.. i dunno [22:34:23] lost the environment variables? [22:35:53] Maybe something to do with the new labs base image :( [22:36:01] (03CR) 10Dzahn: [C: 032] "this made it duplicate. the protocol versions are already added here by @ssl_settings.join further below" [puppet] - 10https://gerrit.wikimedia.org/r/167332 (owner: 10Dzahn) [22:36:12] andrewbogott: that would make sense, yea, different image, different defaults [22:36:35] maybe let puppet fix it? [22:36:57] I'll have to test more to see where it's happening and for who [22:38:35] update-alternatives --config editor [22:39:34] or you can put the EXPORT command into user's .profile file [22:41:00] yeah, easy enough to fix… but mysterious [22:50:57] (03CR) 10Dzahn: [C: 032] add public IP for radium [dns] - 10https://gerrit.wikimedia.org/r/167147 (owner: 10Dzahn) [22:52:55] (03CR) 10Dzahn: [C: 032] add radium to DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/167149 (owner: 10Dzahn) [22:54:09] (03PS2) 10Dzahn: add node radium to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/167151 [22:54:51] (03CR) 10Dzahn: [C: 032] add node radium to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/167151 (owner: 10Dzahn) [22:56:04] (03PS2) 10Dzahn: add IPv6 interface to radium [puppet] - 10https://gerrit.wikimedia.org/r/167153 [22:56:09] (03CR) 10jenkins-bot: [V: 04-1] add IPv6 interface to radium [puppet] - 10https://gerrit.wikimedia.org/r/167153 (owner: 10Dzahn) [22:56:31] (03PS3) 10Dzahn: add IPv6 interface to radium [puppet] - 10https://gerrit.wikimedia.org/r/167153 [23:02:12] (03PS5) 10Dzahn: puppetmaster Apache template - retab [puppet] - 10https://gerrit.wikimedia.org/r/153987 [23:05:15] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:05:35] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:07:44] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [23:08:15] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:09:17] (03PS6) 10Dzahn: puppetmaster Apache template - retab [puppet] - 10https://gerrit.wikimedia.org/r/153987 [23:09:47] (03Abandoned) 10Dzahn: puppetmaster Apache template - retab [puppet] - 10https://gerrit.wikimedia.org/r/153987 (owner: 10Dzahn) [23:18:47] (03PS1) 10Dzahn: puppetmaster - retab Apache template [puppet] - 10https://gerrit.wikimedia.org/r/167340 [23:23:30] (03CR) 10Dzahn: [C: 032] "..while touching it because i introduced spaces with earlier change.. noop" [puppet] - 10https://gerrit.wikimedia.org/r/167340 (owner: 10Dzahn) [23:25:05] (03CR) 10Dzahn: [C: 032] add IPv6 interface to radium [puppet] - 10https://gerrit.wikimedia.org/r/167153 (owner: 10Dzahn) [23:28:56] (03CR) 10Dzahn: [C: 032] CNAME tor-eqiad-1 -> radium [dns] - 10https://gerrit.wikimedia.org/r/167155 (owner: 10Dzahn) [23:28:57] !log updating hhvm app servers to 3.3.0-20140925+wmf3 [23:29:03] Logged the message, Master [23:29:22] (03PS2) 10Dzahn: CNAME tor-eqiad-1 -> radium [dns] - 10https://gerrit.wikimedia.org/r/167155 [23:33:49] (03CR) 10Dzahn: [C: 032] "ori: thanks" [puppet] - 10https://gerrit.wikimedia.org/r/165231 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [23:35:08] !log pooled mw1114 (hhvm api server) to test whether new package resolves overload behavior [23:35:13] Logged the message, Master [23:37:59] (03PS6) 10Dzahn: iegreview: Provision iegreview application [puppet] - 10https://gerrit.wikimedia.org/r/165232 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [23:40:02] (03CR) 10Dzahn: [C: 032] iegreview: Provision iegreview application [puppet] - 10https://gerrit.wikimedia.org/r/165232 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis) [23:45:05] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: Puppet has 1 failures [23:46:36] !log Ran trebuchet to create initial tag for iegreview/iegreview [23:46:42] Logged the message, Master [23:46:48] we should see RECOVER [23:46:54] puppet fine after that [23:47:04] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [23:47:35] (03CR) 10Dzahn: [C: 032] iegreview: Put iegreview.wikimedia.org behind misc-web-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/165236 (https://bugzilla.wikimedia.org/71597) (owner: 10BryanDavis)