[00:01:28] RECOVERY - LVS HTTP IPv4 on citoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 745 bytes in 0.012 second response time [00:06:58] RoanKattouw: ^ see the last icinga-wm :) [00:07:05] Yay :) [00:07:06] mutante: Yay. [00:07:07] that was your monitoring fix [00:07:44] <^d> Ugh, stupid python. [00:07:45] (03CR) 10Dzahn: "<+icinga-wm> RECOVERY - LVS HTTP IPv4 on citoid.svc.eqiad.wmnet is OK: HTTP" [puppet] - 10https://gerrit.wikimedia.org/r/165731 (owner: 10Catrope) [00:08:16] <^d> mutante: Did you end up finding that thing about gmond you mentioned? [00:08:58] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:08] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:11:47] (03CR) 10Dzahn: [C: 031] Kill old (skins|live)-1.5 stuff [puppet] - 10https://gerrit.wikimedia.org/r/162768 (owner: 10MaxSem) [00:13:08] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 67713 bytes in 2.422 second response time [00:13:48] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.046 second response time [00:15:28] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=72%): [00:20:09] (03PS2) 10Ori.livneh: Turn off spammy message cache log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167731 (owner: 10Aaron Schulz) [00:20:13] (03CR) 10Ori.livneh: [C: 032] Turn off spammy message cache log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167731 (owner: 10Aaron Schulz) [00:20:58] (03Merged) 10jenkins-bot: Turn off spammy message cache log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167731 (owner: 10Aaron Schulz) [00:25:26] (03CR) 10Dzahn: "php5-memcached:" [puppet] - 10https://gerrit.wikimedia.org/r/158023 (owner: 10Reedy) [00:30:59] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [00:31:59] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [00:32:09] !log ori Synchronized wmf-config: Id01fe7aac: Turn off spammy message cache log (duration: 00m 05s) [00:32:14] Logged the message, Master [00:33:41] (03PS1) 10Ori.livneh: hhvm-dump-debug: use quickstack [puppet] - 10https://gerrit.wikimedia.org/r/167733 [00:34:20] hi guys, I've got a user in the help channel reporting an issue with the Atom feed on Google Chrome. Any known issues or bug IDs to file the issue under ? [00:35:55] NotASpy: somewhere under https://bugzilla.wikimedia.org/show_bug.cgi?id=3646 seems reasonable to me [00:36:32] Product: MediaWiki Component:Special pages [00:36:51] thanks [00:37:39] yw [00:38:35] Hey, any of you guys know anything about a payments logrotate issue about seven hours ago? [00:39:53] It's just that it's still going on. [00:40:59] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:08] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:45:08] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 67713 bytes in 0.165 second response time [00:45:58] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [00:57:18] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:08] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 67695 bytes in 0.251 second response time [01:05:05] (03PS2) 10Ori.livneh: hhvm-dump-debug: use quickstack [puppet] - 10https://gerrit.wikimedia.org/r/167733 [01:05:16] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm-dump-debug: use quickstack [puppet] - 10https://gerrit.wikimedia.org/r/167733 (owner: 10Ori.livneh) [01:07:46] (03PS1) 10Ori.livneh: osmium: add appserver role [puppet] - 10https://gerrit.wikimedia.org/r/167735 [01:10:01] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet has 2 failures [01:11:10] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [01:16:41] (03PS20) 10Dzahn: turn RT from misc/* into puppet module [puppet] - 10https://gerrit.wikimedia.org/r/116064 [01:18:14] (03CR) 10Ori.livneh: [C: 032] osmium: add appserver role [puppet] - 10https://gerrit.wikimedia.org/r/167735 (owner: 10Ori.livneh) [01:27:49] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [01:28:32] springle, you saw the labs-l@ thread on DB issues with the enwiki replicas? [01:28:52] Eloquence: Coren pointed it out, yes [01:30:07] (03CR) 10MZMcBride: "This seems fine." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167555 (https://bugzilla.wikimedia.org/72239) (owner: 10Glaisher) [01:30:48] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: puppet fail [01:31:58] (03PS21) 10Dzahn: turn RT from misc/* into puppet module [puppet] - 10https://gerrit.wikimedia.org/r/116064 [01:36:11] springle: I'm guessing that's what you were concerned about earlier about replication missing beats when the server OOM'ed? [01:37:19] Coren: right, i'm wondering about https://mariadb.atlassian.net/browse/MDEV-6551 [01:38:00] we've already upgraded, so theoretically it's recent past data affected [01:38:21] presently running sync on labsdb1001, after which we'll know more. [01:38:48] * Coren nods. [01:39:03] Yeah, that bug report sounds like what we're observing, though we've no way to be sure. [01:49:47] (03PS22) 10Dzahn: turn RT from misc/* into puppet module [puppet] - 10https://gerrit.wikimedia.org/r/116064 [01:56:08] PROBLEM - puppet last run on wtp1010 is CRITICAL: CRITICAL: puppet fail [01:56:18] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Puppet has 37 failures [01:56:18] PROBLEM - puppetmaster backend https on strontium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [01:56:39] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 81 failures [01:56:39] PROBLEM - puppet last run on mw1016 is CRITICAL: CRITICAL: Puppet has 9 failures [01:56:48] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: puppet fail [01:56:48] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 62 failures [01:56:48] PROBLEM - puppet last run on dysprosium is CRITICAL: CRITICAL: puppet fail [01:56:49] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: puppet fail [01:56:49] PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 38 failures [01:56:49] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Puppet has 19 failures [01:56:49] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: puppet fail [01:56:49] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Puppet has 52 failures [01:56:58] PROBLEM - puppet last run on mw1107 is CRITICAL: CRITICAL: Puppet has 42 failures [01:56:58] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: puppet fail [01:56:58] PROBLEM - puppet last run on mw1143 is CRITICAL: CRITICAL: Puppet has 47 failures [01:56:58] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: Puppet has 29 failures [01:56:58] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 32 failures [01:56:59] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: puppet fail [01:56:59] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: puppet fail [01:56:59] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 25 failures [01:56:59] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Puppet has 20 failures [01:57:00] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: puppet fail [01:57:00] PROBLEM - puppet last run on mw1037 is CRITICAL: CRITICAL: puppet fail [01:57:01] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 60 failures [01:57:01] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 74 failures [01:57:02] PROBLEM - puppet last run on mw1064 is CRITICAL: CRITICAL: Puppet has 37 failures [01:57:02] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Puppet has 12 failures [01:57:03] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [01:57:03] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Puppet has 25 failures [01:57:08] PROBLEM - puppet last run on search1012 is CRITICAL: CRITICAL: Puppet has 50 failures [01:57:11] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: puppet fail [01:57:11] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: puppet fail [01:57:11] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Puppet has 49 failures [01:57:11] PROBLEM - puppet last run on cp1040 is CRITICAL: CRITICAL: Puppet has 17 failures [01:57:11] PROBLEM - puppet last run on mw1071 is CRITICAL: CRITICAL: Puppet has 50 failures [01:57:18] PROBLEM - puppet last run on amssq39 is CRITICAL: CRITICAL: Puppet has 9 failures [01:57:18] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: puppet fail [01:57:19] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: puppet fail [01:57:19] PROBLEM - puppet last run on search1004 is CRITICAL: CRITICAL: Puppet has 43 failures [01:57:19] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Puppet has 61 failures [01:57:19] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Puppet has 82 failures [01:57:19] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Puppet has 27 failures [01:57:19] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 24 failures [01:57:20] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 22 failures [01:57:31] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: puppet fail [01:57:32] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: puppet fail [01:57:33] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: puppet fail [01:57:33] PROBLEM - puppet last run on mw1066 is CRITICAL: CRITICAL: Puppet has 55 failures [01:57:33] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: puppet fail [01:57:34] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: puppet fail [01:57:34] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Puppet has 19 failures [01:57:39] PROBLEM - puppet last run on mw1090 is CRITICAL: CRITICAL: Puppet has 59 failures [01:57:48] PROBLEM - puppet last run on virt1008 is CRITICAL: CRITICAL: Puppet has 29 failures [01:57:48] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: puppet fail [01:57:50] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 21 failures [01:57:50] PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: puppet fail [01:57:50] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: puppet fail [01:57:50] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [01:57:50] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: puppet fail [01:57:50] PROBLEM - puppet last run on analytics1021 is CRITICAL: CRITICAL: puppet fail [01:57:50] PROBLEM - puppet last run on ssl1003 is CRITICAL: CRITICAL: puppet fail [01:57:51] PROBLEM - puppet last run on cp3011 is CRITICAL: CRITICAL: Puppet has 19 failures [01:57:58] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 23 failures [01:57:59] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: puppet fail [01:57:59] PROBLEM - puppet last run on mw1027 is CRITICAL: CRITICAL: Puppet has 75 failures [01:58:08] PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: Puppet has 73 failures [01:58:14] PROBLEM - puppet last run on osm-cp1001 is CRITICAL: CRITICAL: Puppet has 23 failures [01:58:14] PROBLEM - puppet last run on es1004 is CRITICAL: CRITICAL: puppet fail [01:58:15] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [01:58:15] PROBLEM - puppet last run on virt1005 is CRITICAL: CRITICAL: puppet fail [01:58:15] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: Puppet has 63 failures [01:58:15] PROBLEM - puppet last run on db1037 is CRITICAL: CRITICAL: puppet fail [01:58:15] PROBLEM - puppet last run on amssq50 is CRITICAL: CRITICAL: puppet fail [01:58:15] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: puppet fail [01:58:16] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [01:58:18] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: puppet fail [01:58:18] PROBLEM - puppet last run on mw1113 is CRITICAL: CRITICAL: Puppet has 70 failures [01:58:24] !log restarting apache on puppetmaster, temp. stopping icinga-wm [01:58:33] Logged the message, Master [01:58:47] :( [01:59:10] Unexpected error in mod_passenger: Could not send data to the ApplicationPool server: write() failed: Broken pipe (32) [01:59:26] in 'virtual Passenger::Application::SessionPtr Passenger::ApplicationPoolServer::Client::get(const Passenger::PoolOptions&)' (ApplicationPoolServer.h:402) [01:59:31] in 'int Hooks::handleRequest(request_rec*)' (Hooks.cpp:523) [02:07:26] jgage: arr.. 500 Internal Server Error [02:07:28] still? [02:07:58] at the same time i see it compiling catalogs [02:09:34] " resuming normal operations" [02:09:44] should start to recover soonish [02:10:34] oh, it's strontium [02:12:08] !log restarting Apache on strontium [02:12:17] Logged the message, Master [02:20:03] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [02:20:03] RECOVERY - puppet last run on search1014 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [02:20:03] RECOVERY - puppet last run on cp1069 is OK: OK: Puppet is currently enabled, last run 63 seconds ago with 0 failures [02:20:04] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [02:20:13] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [02:20:13] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [02:20:13] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [02:20:13] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [02:20:21] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [02:20:21] RECOVERY - puppet last run on db2033 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [02:20:22] RECOVERY - puppet last run on es1009 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [02:20:22] RECOVERY - puppet last run on analytics1019 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [02:20:26] (03PS23) 10Dzahn: turn RT from misc/* into puppet module [puppet] - 10https://gerrit.wikimedia.org/r/116064 [02:20:31] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [02:20:41] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [02:20:41] RECOVERY - puppet last run on mw1036 is OK: OK: Puppet is currently enabled, last run 65 seconds ago with 0 failures [02:20:42] RECOVERY - puppet last run on mw1096 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [02:20:42] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [02:20:42] RECOVERY - puppet last run on db1010 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [02:20:42] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [02:20:42] RECOVERY - puppet last run on mw1035 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [02:20:43] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [02:23:58] (03CR) 10Dzahn: [C: 032] "for fun before we switch to phab :) the only differences are expected ones, the pathes to renamed files and role name. compiler link: ht" [puppet] - 10https://gerrit.wikimedia.org/r/116064 (owner: 10Dzahn) [02:26:52] !log rebooting iron for upgrades [02:27:00] Logged the message, Master [02:33:35] !log LocalisationUpdate completed (1.25wmf3) at 2014-10-21 02:33:35+00:00 [02:33:42] Logged the message, Master [02:44:24] PROBLEM - NTP on iron is CRITICAL: NTP CRITICAL: Offset unknown [02:47:33] RECOVERY - NTP on iron is OK: NTP OK: Offset 0.0004962682724 secs [02:55:33] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 72, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqiad:xe-4/2/1 (Giglinx/Zayo, ETYX/084858//ZYO) {#1062} [10Gbps MPLS]BR [03:06:47] !log LocalisationUpdate completed (1.25wmf4) at 2014-10-21 03:06:47+00:00 [03:06:57] Logged the message, Master [04:10:20] PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: Puppet has 1 failures [04:21:19] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Oct 21 04:21:19 UTC 2014 (duration 21m 18s) [04:21:29] Logged the message, Master [04:24:34] RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [04:49:20] (03PS1) 10Chad: Don't enable ganglia config on beta [puppet] - 10https://gerrit.wikimedia.org/r/167744 [04:57:52] <^d> mutante: It was https://gerrit.wikimedia.org/r/#/c/165360/ :) [04:57:52] <^d> elasticsearch::ganglia notifies gmond, which breaks beta instances. [04:57:52] <^d> 167744 should work around it, perhaps. [05:19:59] (03PS7) 10Chad: Adding tools for banning/unbanning an ES node [puppet] - 10https://gerrit.wikimedia.org/r/164617 [05:20:01] (03PS7) 10Chad: Another es-tool function: restart a node the fast & easy way [puppet] - 10https://gerrit.wikimedia.org/r/164401 [05:20:03] (03PS1) 10Chad: Improve error handling [puppet] - 10https://gerrit.wikimedia.org/r/167745 [05:30:17] PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: Puppet has 1 failures [05:30:55] (03PS8) 10Chad: Adding tools for banning/unbanning an ES node [puppet] - 10https://gerrit.wikimedia.org/r/164617 [05:42:11] (03PS1) 10KartikMistry: Beta: Change Apertium-APY port to 80 [puppet] - 10https://gerrit.wikimedia.org/r/167747 [05:42:54] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [05:42:54] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [05:42:54] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [05:43:54] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [05:43:54] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [05:43:54] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [05:48:05] (03PS2) 10KartikMistry: Beta: Change Apertium-APY port to 80 [puppet] - 10https://gerrit.wikimedia.org/r/167747 [05:52:04] (03PS1) 10KartikMistry: Beta: No CI for Apertium [puppet] - 10https://gerrit.wikimedia.org/r/167748 [06:04:34] RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:13:24] (03PS1) 10KartikMistry: Use port 80 for apertium-apy service [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167749 [06:26:45] RECOVERY - Disk space on ocg1002 is OK: DISK OK [06:28:26] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: puppet fail [06:28:44] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: puppet fail [06:28:44] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: puppet fail [06:28:45] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: puppet fail [06:28:45] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: puppet fail [06:29:44] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:44] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:05] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:35] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:35] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:35] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:44] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:44] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:45] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:14] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:41:26] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Puppet has 2 failures [06:45:14] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet has 2 failures [06:45:14] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:34] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:35] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:45:35] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:45:35] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:45:44] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:54] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:14] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:46:25] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:44] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:54] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:54] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:55] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:47:04] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:27] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [06:49:25] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [06:50:20] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet has 2 failures [06:55:08] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:59:09] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [07:00:08] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: Puppet has 1 failures [07:02:27] (03PS5) 10Giuseppe Lavagetto: graphite: Add labs archiver script [puppet] - 10https://gerrit.wikimedia.org/r/166902 (owner: 10Yuvipanda) [07:02:37] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] graphite: Add labs archiver script [puppet] - 10https://gerrit.wikimedia.org/r/166902 (owner: 10Yuvipanda) [07:03:19] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 350829 msg: ocg_render_job_queue 1077 msg (=500 critical) [07:03:38] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 350846 msg: ocg_render_job_queue 986 msg (=500 critical) [07:04:10] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 351025 msg: ocg_render_job_queue 751 msg (=500 critical) [07:04:24] <_joe_> so, it's clear this limit is way too low [07:05:08] RECOVERY - check_puppetrun on db1025 is OK: OK: Puppet is currently enabled, last run 286 seconds ago with 0 failures [07:06:51] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 352099 msg: ocg_render_job_queue 51 msg [07:07:20] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 352180 msg: ocg_render_job_queue 0 msg [07:07:41] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 352212 msg: ocg_render_job_queue 0 msg [07:10:10] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: puppet fail [07:12:50] (03PS1) 10Yuvipanda: graphite: Fix location of archiver class [puppet] - 10https://gerrit.wikimedia.org/r/167750 [07:13:13] _joe_: if you're around, can you merge ^? Puppet fail in the last patch... [07:13:34] <_joe_> oh lol [07:13:40] <_joe_> didn't notice, sorry [07:13:48] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] graphite: Fix location of archiver class [puppet] - 10https://gerrit.wikimedia.org/r/167750 (owner: 10Yuvipanda) [07:13:50] _joe_: me neither :) [07:14:16] <_joe_> yeah I usually take a better look at consequences of my merges [07:16:11] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [07:18:59] _joe_: is fine now. thanks! [07:20:08] <_joe_> YuviPanda: I am saying, in this case, I am merely a proxy for you, as I trust your judgement and operating ability on this [07:20:20] <_joe_> YuviPanda: we really do need to be able to write hiera data for labs [07:20:28] yeah... [07:20:51] <_joe_> if we want to use a mysql db instead of the yaml files as it's easier to manage from php [07:20:52] _joe_: I need to check how far Horizon is, and if it is very far, consider a hiera solution... [07:20:54] <_joe_> fine [07:21:01] <_joe_> it is very fare [07:21:02] what's Horizon? [07:21:09] <_joe_> if it's farther than 1 week [07:21:11] <_joe_> :) [07:21:24] ori: https://github.com/openstack/horizon [07:21:28] _joe_: hah :) [07:22:13] _joe_: well, if we just need a canonical place for appropriate people to edit YAML files and have that be available from puppet, I guess a simple Namespace on wikitech wiki with appropriate access rights + Content Model would be enough... [07:22:17] <_joe_> YuviPanda: I'm pretty serious, I can bake some mysql backend [07:22:31] <_joe_> YuviPanda: what about locking? [07:22:41] 'locking' as in? [07:22:55] <_joe_> YuviPanda: my idea is - people get the right to write hiera files just for their own hosts and for their projects [07:23:11] _joe_: yeah, we can do ACLs on the MediaWiki side. [07:23:14] <_joe_> else, people would possibly be editing the same yaml file in concurrency [07:23:28] <_joe_> if mysql is much much easier, no problem [07:23:32] _joe_: it would be one YAML file per project, I would think. [07:23:58] i wouldn't do mysql [07:24:01] use the mediawiki api [07:24:06] no, if there's to be an interface in wikitech where people are editing, I'd rather not have it be mysql. we already have infrastructure for editing JSON files in mediawiki. [07:24:12] and JSON is valid YAML [07:24:26] so we only need to add a namespace, and then figure out the access control [07:24:40] and then use the API... [07:24:45] <_joe_> ori: mysql as a hiera backend, I guessed mysql would be easier to manage from mediawiki [07:24:55] YuviPanda: but hand-writing JSON is awful [07:25:03] _joe_: yeah, mediawiki api as a backend [07:25:11] <_joe_> oh ok [07:25:25] ori: shouldn't be too hard to have a YAMLContentHandler too [07:25:26] <_joe_> good idea maybe, what scares me is that it's gonna be slow [07:25:35] why would it? [07:25:58] it'd add <100ms per request [07:26:11] not even, because it caches [07:27:23] yeah... [07:27:25] YuviPanda: i think a YAMLContentHandler would be good. It's true that JSON is a subset of YAML, but if you close your eyes and think 'YAML' you don't see a JSON blob [07:27:31] heh [07:27:33] YuviPanda: and Hiera is that kind of YAML [07:27:37] true [07:27:47] YAMLContentHandler wouldn't be too hard either [07:27:51] and can just live in wikitech [07:28:10] I wonder how to add php-yaml as a dependency though. [07:28:33] might probably just require it in php and error out, and ensure it exists via puppet [07:29:05] well, the maintainer of that library is bd808, and he's working on dependency management / librarization of core [07:29:14] heh :D [07:29:15] but really, i'd use a pure php implementation [07:29:23] <_joe_> +1 [07:29:31] <_joe_> performance is not the issue here [07:32:40] _joe_: I could do it, but I'm in the middle of writing shinkengen... [07:32:42] does anyone know how I can add a security group to an instance in labs ? [07:32:48] akosiaris: aaah, you can't [07:32:56] you are killing me ... [07:33:00] akosiaris: only allowed at instance creation time [07:33:12] why ? [07:33:13] <_joe_> akosiaris: wait for horizon! [07:33:22] haha [07:33:28] _joe_: I was told it is a OpenStack limitation [07:33:32] and not a wikitech one [07:33:35] ah [07:33:42] <_joe_> YuviPanda: oh maybe [07:33:45] I was about to ask if there is a cli I can do [07:33:55] <_joe_> maybe of the version of openstack we are using [07:34:06] that's possible. [07:41:28] (03CR) 10Alexandros Kosiaris: [C: 04-2] "I 'd rather we kept apertium-apy at its own port and not port 80" [puppet] - 10https://gerrit.wikimedia.org/r/167747 (owner: 10KartikMistry) [07:42:19] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: No CI for Apertium [puppet] - 10https://gerrit.wikimedia.org/r/167748 (owner: 10KartikMistry) [07:44:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I have no problem with actually supporting in init changing the port apertium-apy listens on, quite the contrary, but I do not like listen" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167749 (owner: 10KartikMistry) [08:03:20] (03PS5) 10Alexandros Kosiaris: remove 10.0.0.0/16 Tampa subnet [puppet] - 10https://gerrit.wikimedia.org/r/164241 (owner: 10Dzahn) [08:04:10] (03PS1) 10Giuseppe Lavagetto: ocg: sanitize role::ocg::production, change alerts [puppet] - 10https://gerrit.wikimedia.org/r/167751 [08:13:42] akosiaris: the apertium-apy packages fails for me :( [08:13:43] 08:11:42 dpkg-source: info: local changes detected, the modified files are: [08:13:43] 08:11:42 source/.gitreview [08:13:54] https://integration.wikimedia.org/ci/job/operations-debs-apertium-apy-debian-glue/1/console [08:14:58] hashar: yeah I know. I am thinking of how to fix that [08:15:06] I had to bypass it to build the package [08:15:06] jenkins-debian-glue/debian/source/options:extend-diff-ignore = '^\.gitreview$' [08:15:11] python-voluptuous/debian/source/options:extend-diff-ignore = '^\.gitreview$' [08:15:16] oh [08:15:25] hmm thanks [08:15:26] both in packages I worked with [08:15:30] probably devised by faidon [08:15:45] cause there is exactly ZERO % chance I could have figured it out by myself hehe [08:23:52] hashar: thanks, it works like a charm [08:27:19] (03Abandoned) 10KartikMistry: Beta: Change Apertium-APY port to 80 [puppet] - 10https://gerrit.wikimedia.org/r/167747 (owner: 10KartikMistry) [08:27:58] (03Restored) 10KartikMistry: Beta: Change Apertium-APY port to 80 [puppet] - 10https://gerrit.wikimedia.org/r/167747 (owner: 10KartikMistry) [08:28:06] (03PS3) 10KartikMistry: Beta: Change Apertium-APY port to 80 [puppet] - 10https://gerrit.wikimedia.org/r/167747 [08:29:02] hashar: thanks! [08:32:24] (03PS4) 10KartikMistry: Beta: Apertium: Remove unused comments for parameters [puppet] - 10https://gerrit.wikimedia.org/r/167747 [08:33:11] akosiaris: https://gerrit.wikimedia.org/r/#/c/167747/ is now probably useful :) [08:33:19] (not urgent at all!) [08:33:47] (03CR) 10Alexandros Kosiaris: [C: 032] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/167747 (owner: 10KartikMistry) [08:34:24] (03Abandoned) 10KartikMistry: Use port 80 for apertium-apy service [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167749 (owner: 10KartikMistry) [08:38:32] (03PS1) 10Alexandros Kosiaris: Ignore .gitreview when building source [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167756 [08:42:02] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167756 (owner: 10Alexandros Kosiaris) [08:42:31] akosiaris: kart_ progress!!!! https://integration.wikimedia.org/ci/job/operations-debs-apertium-apy-debian-glue/2/console :D [08:43:24] (03CR) 10Hashar: [C: 031] "Works for me :-)" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167756 (owner: 10Alexandros Kosiaris) [08:44:10] (03CR) 10Alexandros Kosiaris: [C: 032] remove 10.0.0.0/16 Tampa subnet [puppet] - 10https://gerrit.wikimedia.org/r/164241 (owner: 10Dzahn) [08:47:20] (03CR) 10Hashar: "The Jenkins job now complains with lintian error:" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167756 (owner: 10Alexandros Kosiaris) [08:47:32] commute time [08:47:44] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: puppet fail [08:47:44] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: puppet fail [08:47:45] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: puppet fail [08:47:54] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail [08:47:55] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: puppet fail [08:47:55] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: puppet fail [08:48:03] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: puppet fail [08:48:03] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: puppet fail [08:48:04] PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: puppet fail [08:48:13] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: puppet fail [08:48:14] PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: puppet fail [08:48:45] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: puppet fail [08:48:45] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: puppet fail [08:48:59] PROBLEM - puppet last run on polonium is CRITICAL: CRITICAL: puppet fail [08:49:14] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: puppet fail [08:49:24] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: puppet fail [08:49:25] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: puppet fail [08:49:35] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: puppet fail [08:49:36] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [08:49:36] PROBLEM - puppet last run on mw1076 is CRITICAL: CRITICAL: puppet fail [08:50:04] !log temporarily killed icinga-wm [08:50:12] Logged the message, Master [08:50:41] ok my bad [08:51:01] Error 400 on SERVER: $$$$all_network_subnets["production"]["pmtpa"]["private"]["private"] is :undef, not a hash or array at /etc/puppet/manifests/network.pp:252 [08:51:02] fixing [08:54:31] (03PS2) 10KartikMistry: Ignore .gitreview when building source [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167756 (owner: 10Alexandros Kosiaris) [08:54:36] (03PS1) 10Alexandros Kosiaris: Fix regression introduced in 9b8ead0 [puppet] - 10https://gerrit.wikimedia.org/r/167758 [08:56:12] (03CR) 10Alexandros Kosiaris: [C: 032] Fix regression introduced in 9b8ead0 [puppet] - 10https://gerrit.wikimedia.org/r/167758 (owner: 10Alexandros Kosiaris) [08:58:32] (03PS1) 10Alexandros Kosiaris: Ignore .gitreview when building source [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/167759 [08:58:35] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: puppet fail [08:58:35] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:58:45] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: puppet fail [08:58:48] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: puppet fail [08:58:54] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: puppet fail [08:58:54] PROBLEM - puppet last run on amssq45 is CRITICAL: CRITICAL: puppet fail [08:59:13] let's wait another 10 mins before reenabling icinga-wm [09:01:00] (03CR) 10Filippo Giunchedi: [C: 031] gerrit: move to module [puppet] - 10https://gerrit.wikimedia.org/r/167215 (owner: 10Giuseppe Lavagetto) [09:01:17] godog: thanks! :-) [09:01:26] no, _joe_: thanks! [09:01:29] actually both [09:01:52] haha yeah _joe_ did most of the work :) [09:02:17] _joe_: btw I think after having the puppet compiler output in https://gerrit.wikimedia.org/r/#/c/167183/ we're set [09:06:30] <_joe_> godog: yeah I'm kinda completely immersed in hhvm packaging right now [09:06:38] <_joe_> might do that later/whenever [09:07:04] (03PS1) 10Alexandros Kosiaris: Ignore .gitreview when building source [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/167763 [09:07:57] RECOVERY - puppet last run on polonium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [09:08:04] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:08:06] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 64 seconds ago with 0 failures [09:08:06] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:08:14] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [09:08:15] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:08:15] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:08:34] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:08:34] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:08:45] RECOVERY - puppet last run on plutonium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:08:45] RECOVERY - puppet last run on mw1190 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:09:00] !log enabled icinga-wm again [09:09:05] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:09:05] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:09:08] Logged the message, Master [09:09:18] RECOVERY - puppet last run on mw1049 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:09:18] RECOVERY - puppet last run on amssq51 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:09:24] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:09:25] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:09:34] RECOVERY - puppet last run on mw1051 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:09:35] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:09:36] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:09:36] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:09:54] RECOVERY - puppet last run on mw1057 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:10:01] RECOVERY - puppet last run on mw1125 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:10:04] RECOVERY - puppet last run on mw1098 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:10:04] RECOVERY - puppet last run on mw1050 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:10:14] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [09:10:15] RECOVERY - puppet last run on mw1156 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:10:15] RECOVERY - puppet last run on mw1034 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:10:17] (03PS1) 10Alexandros Kosiaris: Ignore .gitreview when building source [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/167764 [09:10:24] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [09:10:24] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 64 seconds ago with 0 failures [09:10:25] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:10:34] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:10:54] RECOVERY - puppet last run on mw1087 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:10:55] RECOVERY - puppet last run on mw1188 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:11:14] RECOVERY - puppet last run on amssq62 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:11:25] RECOVERY - puppet last run on mw1023 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:11:46] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:11:47] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:11:54] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:11:55] RECOVERY - puppet last run on mw1032 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:11:55] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:12:04] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:12:06] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:12:14] RECOVERY - puppet last run on mw1029 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:12:15] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 63 seconds ago with 0 failures [09:12:36] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:12:42] note to self: next time wait 20 mins ... [09:12:45] RECOVERY - puppet last run on mw1033 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:12:45] (03CR) 10Filippo Giunchedi: Introduce LLDP facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167644 (owner: 10Alexandros Kosiaris) [09:13:05] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:13:15] RECOVERY - puppet last run on mw1091 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:13:15] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:13:15] RECOVERY - puppet last run on mw1024 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:13:34] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:14:05] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:14:05] RECOVERY - puppet last run on cp1037 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [09:14:05] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:14:14] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:14:14] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:14:15] RECOVERY - puppet last run on amssq31 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:14:34] RECOVERY - puppet last run on mw1027 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:14:34] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:14:35] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [09:14:36] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:14:45] RECOVERY - puppet last run on cp1044 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:14:52] (03CR) 10Filippo Giunchedi: "yep I was interested in how the temp containers are used by mw" [puppet] - 10https://gerrit.wikimedia.org/r/167310 (owner: 10Aaron Schulz) [09:15:05] RECOVERY - puppet last run on mw1066 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:15:06] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:15:24] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:15:24] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:15:25] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:15:25] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [09:15:45] RECOVERY - puppet last run on amssq39 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:15:52] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:15:54] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:15:54] RECOVERY - puppet last run on mw1071 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:16:05] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:16:05] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:16:05] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:16:14] RECOVERY - puppet last run on mw1090 is OK: OK: Puppet is currently enabled, last run 64 seconds ago with 0 failures [09:16:14] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:16:15] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:16:15] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:16:24] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [09:16:25] RECOVERY - puppet last run on cp3011 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [09:16:25] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:16:34] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:16:35] RECOVERY - puppet last run on mw1037 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [09:16:35] RECOVERY - puppet last run on mw1113 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:17:05] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:17:05] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:17:14] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:17:14] RECOVERY - puppet last run on amssq50 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:17:20] re [09:17:24] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:17:31] (03PS2) 10Filippo Giunchedi: Don't enable ganglia config on beta [puppet] - 10https://gerrit.wikimedia.org/r/167744 (owner: 10Chad) [09:17:34] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:17:35] RECOVERY - puppet last run on mw1103 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:17:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Don't enable ganglia config on beta [puppet] - 10https://gerrit.wikimedia.org/r/167744 (owner: 10Chad) [09:17:45] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:17:46] RECOVERY - puppet last run on amssq45 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [09:17:47] RECOVERY - puppet last run on mw1058 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:18:14] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:18:14] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:18:15] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [09:18:15] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:18:34] RECOVERY - puppet last run on mw1095 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:18:35] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [09:19:04] RECOVERY - puppet last run on mw1157 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:21:16] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 67424 bytes in 1.598 second response time [09:27:04] (03CR) 10Alexandros Kosiaris: Introduce LLDP facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167644 (owner: 10Alexandros Kosiaris) [09:27:43] (03CR) 10Filippo Giunchedi: Another es-tool function: restart a node the fast & easy way (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/164401 (owner: 10Chad) [09:29:33] _joe_: https://wikitech.wikimedia.org/wiki/User:Yuvipanda/Wikitech_hiera wrote up a small plan for hiera on wikitech. Shouldn't take too long to implement, I'm just not sure when to spend time on it [09:30:20] (03CR) 10Filippo Giunchedi: [C: 031] Improve error handling [puppet] - 10https://gerrit.wikimedia.org/r/167745 (owner: 10Chad) [09:33:17] PROBLEM - puppet last run on ssl3002 is CRITICAL: CRITICAL: Puppet has 1 failures [09:35:57] kart_: http://shinken.wmflabs.org/host/deployment-apertium02 that host is down [09:37:12] YuviPanda: yes. deleted. [09:37:18] kart_: ah, cool [09:37:20] new is apertium01 [09:37:26] right [09:37:32] shinkengen isn't running on a cron yet [09:40:56] (03CR) 10Filippo Giunchedi: Adding tools for banning/unbanning an ES node (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/164617 (owner: 10Chad) [09:41:27] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 341192 msg: ocg_render_job_queue 516 msg (=500 critical) [09:41:47] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 341405 msg: ocg_render_job_queue 545 msg (=500 critical) [09:43:36] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 341655 msg: ocg_render_job_queue 0 msg [09:43:58] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 341719 msg: ocg_render_job_queue 0 msg [09:48:51] hrm [09:48:57] Oct 21 03:12:05 re1.cr2-eqiad mib2d[2065]: SNMP_TRAP_LINK_DOWN: ifIndex 661, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-4/2/0 [09:49:00] Oct 21 06:02:44 re1.cr2-eqiad mib2d[2065]: SNMP_TRAP_LINK_DOWN: ifIndex 661, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-4/2/0 [09:49:03] Telia mailed us about that [09:49:10] but there was no alert as far as I can see [09:49:16] icinga alert I mean [09:50:47] RECOVERY - puppet last run on ssl3002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:51:26] mark: cr2-ulsfo xe-1/2/0 is down since 02:52 UTC [09:51:53] mark: maybe we should debug before off/lineing again [09:52:36] (03CR) 10Yuvipanda: [C: 032 V: 032] Add hosts & hostgroups generator [software/shinkengen] - 10https://gerrit.wikimedia.org/r/167595 (owner: 10Yuvipanda) [09:53:50] (03CR) 10Hashar: "Wonderful the job is happy. You might want to mention in the commit summary message that you added dpkg-dev as a build dep :-D Otherwise" (031 comment) [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/167756 (owner: 10Alexandros Kosiaris) [10:01:32] (03CR) 10Filippo Giunchedi: Introduce LLDP facts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167644 (owner: 10Alexandros Kosiaris) [10:09:03] paravoid: so same issue after the card swap? [10:20:26] mark: yes [10:20:34] hm [10:20:42] so that leaves the XFP after all, or the router itself [10:20:48] or perhaps we should upgrade junos [10:21:43] how can it be junos, it used to work fine for quite a while [10:26:19] you don't know what may have changed on the other end [10:26:23] but i agree, not very likely [10:32:14] (03CR) 10KartikMistry: "Please add dpkg-dev dependency and changelog entry." [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/167759 (owner: 10Alexandros Kosiaris) [10:41:21] (03PS1) 10Alexandros Kosiaris: Reorganize backup roles [puppet] - 10https://gerrit.wikimedia.org/r/167771 [10:43:01] hashar: there's an alert firing up named "BetaLabs: Puppet freshness check" [10:43:05] CRITICAL: deployment-prep.deployment-logstash1.puppetagent.time_since_last_run.value (>100.00%) WARN: deployment-prep.deployment-elastic06.puppetagent.time_since_last_run.value (>100.00%) deployment-prep.deployment-elastic07.puppetagent.time_since_last_run.value (>100.00%) [10:43:15] that's a weird prod check... [10:44:09] (03CR) 10Hashar: "Nit: the modules/sudo/files/sudoers.appserver file has a comment pointing to the old puppet path:" [puppet] - 10https://gerrit.wikimedia.org/r/167183 (owner: 10Giuseppe Lavagetto) [10:45:05] paravoid: yeah they are running on labsmon host which is in production. The checks look for some graphite metrics [10:45:18] I am supposed to receive mail notifications for them [10:45:50] deployment-logstash1 has puppet disabled :D [10:46:06] why are we running labs check from production? [10:46:21] YuviPanda: !!! ^^ :D [10:46:42] checks even [10:46:53] elastic06 and 07 are fresh instances [10:46:58] I don't even have access to deployment-prep, let alone intimate knowledge of how it works [10:47:02] how am I supposed to fix those [10:48:35] ugh [10:49:51] <_joe_> paravoid: I can look into that maybe? [10:50:01] <_joe_> I now beta quite well nowadays [10:50:22] that's besides the point :) [10:50:29] I can probably figure it out myself too [10:50:31] <_joe_> I know :) [10:50:48] but I don't think we should be polluting prod icinga with labs alerts, even if it's betalabs... [10:51:23] I'm looking at puppet and seems even more complicated than that, there's a second graphite that this check polls [10:52:11] <_joe_> I think I explicitly said that was a bad idea, IDK how that went on afterwards [10:53:01] (03CR) 10Giuseppe Lavagetto: "@filippo: yes, we should get rid of the whole sudo mess probably; I'll take a look adn amend this patch" [puppet] - 10https://gerrit.wikimedia.org/r/167183 (owner: 10Giuseppe Lavagetto) [10:58:01] paravoid: where have you received those alarms? [10:58:17] cause they are supposed to poke #wikimedia-qa and mail a few folks (greg me maybe one or two others) [10:58:26] btw elastic06 and elastic07 are fixed now :] [11:00:58] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [11:01:24] robh: Are we having SFO issues again? [11:01:40] Request: GET http://meta.wikimedia.org/w/index.php?title=Special:AbuseLog&wpSearchFilter=72, from 10.128.0.116 via cp4010 cp4010 ([10.128.0.110]:3128), Varnish XID 3464671913 [11:01:40] Forwarded for: 103.254.5.215, 10.128.0.116, 10.128.0.116 [11:01:40] Error: 503, Service Unavailable at Tue, 21 Oct 2014 11:00:34 GMT [11:02:08] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [11:03:09] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [11:06:00] <_joe_> sDrewth: thanks for reporting it [11:06:35] (03PS1) 10Springle: repool db1066, depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167776 [11:08:33] (03CR) 10Springle: [C: 032] repool db1066, depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167776 (owner: 10Springle) [11:08:41] (03Merged) 10jenkins-bot: repool db1066, depool db1065 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167776 (owner: 10Springle) [11:09:57] !log springle Synchronized wmf-config/db-eqiad.php: repool db1066, warm up (duration: 00m 06s) [11:10:03] Logged the message, Master [11:10:38] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [11:13:49] thx for your attention _joe_ [11:14:50] (03PS1) 10Filippo Giunchedi: backup ganglia data on the collector [puppet] - 10https://gerrit.wikimedia.org/r/167778 [11:18:28] moar! :-/ [11:18:53] (03PS3) 10QChris: Configure gerrit's its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/167524 [11:19:53] (03CR) 10QChris: Configure gerrit's its-phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167524 (owner: 10QChris) [11:20:43] (03CR) 10QChris: Configure gerrit's its-phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167524 (owner: 10QChris) [11:20:59] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [11:21:20] (03CR) 10QChris: [C: 04-1] "Blocking until certificate has been filled in." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167524 (owner: 10QChris) [11:25:33] off for lunch [11:27:08] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [11:27:16] Request: GET http://meta.wikimedia.org/wiki/MassMessage, from 10.128.0.108 via cp4010 cp4010 ([10.128.0.110]:3128), Varnish XID 3465358068 [11:27:22] intermittent issues [11:28:59] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [11:30:20] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [11:30:20] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [11:32:18] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [11:35:16] (03PS1) 10Filippo Giunchedi: depool ulsfo from DNS [dns] - 10https://gerrit.wikimedia.org/r/167779 [11:36:39] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [11:39:49] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [11:41:49] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [11:42:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] depool ulsfo from DNS [dns] - 10https://gerrit.wikimedia.org/r/167779 (owner: 10Filippo Giunchedi) [11:42:39] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [11:43:19] !log drained ulsfo via DNS, GTT link problems [11:43:19] PROBLEM - Host text-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1 [11:43:24] Logged the message, Master [11:43:27] PROBLEM - Host bits-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1:a [11:43:34] PROBLEM - Host upload-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::2:b [11:43:39] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:863:ed1a::1:c [11:43:41] grrr [11:44:02] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:02] PROBLEM - Host cr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [11:44:02] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:02] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:02] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:02] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:02] PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:03] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:03] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:04] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:04] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:05] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:05] PROBLEM - Host text-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [11:44:09] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:10] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:10] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:10] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:10] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:10] PROBLEM - Host lvs4004 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:10] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:11] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:11] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:12] PROBLEM - Host upload-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [11:44:17] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:17] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:17] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:17] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [11:44:23] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:24] PROBLEM - Host cr2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [11:44:24] PROBLEM - Host bits-lb.ulsfo.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [11:45:48] eek [11:45:52] everything down now [11:45:54] RECOVERY - Host lvs4004 is UP: PING WARNING - Packet loss = 93%, RTA = 73.37 ms [11:45:54] RECOVERY - Host cp4017 is UP: PING WARNING - Packet loss = 93%, RTA = 73.32 ms [11:45:54] RECOVERY - Host cp4007 is UP: PING WARNING - Packet loss = 93%, RTA = 73.36 ms [11:45:55] RECOVERY - Host cp4004 is UP: PING WARNING - Packet loss = 93%, RTA = 73.33 ms [11:45:56] RECOVERY - Host bast4001 is UP: PING WARNING - Packet loss = 93%, RTA = 73.45 ms [11:46:13] PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [11:46:46] RECOVERY - Host cp4016 is UP: PING WARNING - Packet loss = 93%, RTA = 77.32 ms [11:46:46] RECOVERY - Host cp4012 is UP: PING WARNING - Packet loss = 93%, RTA = 72.05 ms [11:46:46] RECOVERY - Host cp4020 is UP: PING WARNING - Packet loss = 93%, RTA = 75.98 ms [11:46:46] RECOVERY - Host cp4008 is UP: PING WARNING - Packet loss = 93%, RTA = 75.99 ms [11:46:46] RECOVERY - Host cp4009 is UP: PING WARNING - Packet loss = 93%, RTA = 73.90 ms [11:46:46] RECOVERY - Host cp4001 is UP: PING WARNING - Packet loss = 93%, RTA = 73.91 ms [11:46:46] RECOVERY - Host lvs4002 is UP: PING WARNING - Packet loss = 93%, RTA = 77.42 ms [11:47:36] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [11:47:38] PROBLEM - Host backup4001 is DOWN: PING CRITICAL - Packet loss = 100% [11:48:39] RECOVERY - Host cp4019 is UP: PING WARNING - Packet loss = 80%, RTA = 74.65 ms [11:48:39] RECOVERY - Host cp4006 is UP: PING WARNING - Packet loss = 80%, RTA = 74.34 ms [11:48:49] RECOVERY - Host cp4012 is UP: PING WARNING - Packet loss = 80%, RTA = 73.49 ms [11:48:50] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [11:48:50] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [11:48:50] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [11:48:51] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [11:49:24] RECOVERY - Host cp4014 is UP: PING WARNING - Packet loss = 93%, RTA = 73.90 ms [11:49:24] RECOVERY - Host cp4009 is UP: PING WARNING - Packet loss = 93%, RTA = 73.86 ms [11:49:24] RECOVERY - Host cp4016 is UP: PING WARNING - Packet loss = 93%, RTA = 73.45 ms [11:49:24] RECOVERY - Host cp4013 is UP: PING WARNING - Packet loss = 93%, RTA = 72.12 ms [11:49:25] RECOVERY - Host lvs4003 is UP: PING WARNING - Packet loss = 93%, RTA = 74.24 ms [11:49:25] RECOVERY - Host cp4002 is UP: PING WARNING - Packet loss = 93%, RTA = 72.08 ms [11:49:25] RECOVERY - Host lvs4001 is UP: PING WARNING - Packet loss = 93%, RTA = 72.76 ms [11:49:25] RECOVERY - Host cp4005 is UP: PING WARNING - Packet loss = 93%, RTA = 72.05 ms [11:49:25] RECOVERY - Host cp4011 is UP: PING WARNING - Packet loss = 93%, RTA = 73.92 ms [11:49:26] RECOVERY - Host cp4018 is UP: PING WARNING - Packet loss = 86%, RTA = 73.85 ms [11:49:26] RECOVERY - Host cp4008 is UP: PING WARNING - Packet loss = 86%, RTA = 73.83 ms [11:49:27] RECOVERY - Host cp4015 is UP: PING WARNING - Packet loss = 86%, RTA = 73.81 ms [11:49:27] RECOVERY - Host cp4010 is UP: PING WARNING - Packet loss = 86%, RTA = 72.23 ms [11:49:28] RECOVERY - Host cp4020 is UP: PING WARNING - Packet loss = 86%, RTA = 73.58 ms [11:49:29] RECOVERY - Host text-lb.ulsfo.wikimedia.org is UP: PING WARNING - Packet loss = 93%, RTA = 73.08 ms [11:49:35] RECOVERY - Host cr1-ulsfo is UP: PING WARNING - Packet loss = 93%, RTA = 138.63 ms [11:49:59] RECOVERY - Host cp4003 is UP: PING WARNING - Packet loss = 80%, RTA = 72.11 ms [11:49:59] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org is UP: PING WARNING - Packet loss = 86%, RTA = 77.00 ms [11:51:14] PROBLEM - HTTPS_unified on cp4019 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [11:51:14] PROBLEM - Varnish traffic logger on cp4017 is CRITICAL: Timeout while attempting connection [11:51:14] PROBLEM - DPKG on cp4012 is CRITICAL: Timeout while attempting connection [11:51:14] PROBLEM - puppet last run on cp4017 is CRITICAL: Timeout while attempting connection [11:51:14] PROBLEM - Varnish HTCP daemon on cp4019 is CRITICAL: Timeout while attempting connection [11:51:14] PROBLEM - DPKG on cp4004 is CRITICAL: Timeout while attempting connection [11:51:15] PROBLEM - check configured eth on cp4004 is CRITICAL: Timeout while attempting connection [11:51:15] PROBLEM - Varnish HTTP mobile-frontend on cp4019 is CRITICAL: Connection timed out [11:51:16] PROBLEM - Varnish traffic logger on cp4012 is CRITICAL: Timeout while attempting connection [11:51:16] PROBLEM - puppet last run on cp4012 is CRITICAL: Timeout while attempting connection [11:51:17] PROBLEM - puppet last run on cp4001 is CRITICAL: Timeout while attempting connection [11:51:17] PROBLEM - HTTPS_unified on cp4012 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [11:51:18] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [11:51:18] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [11:51:19] PROBLEM - Varnish HTTP bits on cp4004 is CRITICAL: Connection timed out [11:51:20] PROBLEM - HTTPS_unified on cp4001 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [11:51:24] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [11:51:36] PROBLEM - check configured eth on bast4001 is CRITICAL: Timeout while attempting connection [11:51:46] PROBLEM - DPKG on bast4001 is CRITICAL: Timeout while attempting connection [11:51:55] PROBLEM - RAID on bast4001 is CRITICAL: Timeout while attempting connection [11:51:55] PROBLEM - puppet last run on bast4001 is CRITICAL: Timeout while attempting connection [11:51:56] PROBLEM - puppet last run on cp4008 is CRITICAL: Timeout while attempting connection [11:52:06] PROBLEM - HTTPS_unified on cp4020 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [11:52:20] PROBLEM - SSH on bast4001 is CRITICAL: Connection timed out [11:52:20] RECOVERY - Host text-lb.ulsfo.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 61%, RTA = 72.26 ms [11:52:33] RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 75.37 ms [11:52:34] RECOVERY - Host upload-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 73.53 ms [11:52:39] RECOVERY - Host cp4013 is UP: PING OK - Packet loss = 0%, RTA = 72.29 ms [11:52:39] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 72.29 ms [11:52:39] RECOVERY - Host bits-lb.ulsfo.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 74.00 ms [11:52:54] RECOVERY - Varnish HTTP bits on cp4004 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.147 second response time [11:52:56] RECOVERY - HTTPS_unified on cp4019 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 456 days) [11:52:56] RECOVERY - HTTPS_unified on cp4012 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 456 days) [11:53:04] PROBLEM - puppet last run on lvs4002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:05] PROBLEM - Varnish HTTP mobile-frontend on cp4011 is CRITICAL: Connection timed out [11:53:05] PROBLEM - DPKG on cp4015 is CRITICAL: Timeout while attempting connection [11:53:05] PROBLEM - SSH on cp4011 is CRITICAL: Connection timed out [11:53:05] PROBLEM - Varnish HTTP upload-backend on cp4015 is CRITICAL: Connection timed out [11:53:05] PROBLEM - Varnish HTTP text-frontend on cp4016 is CRITICAL: Connection timed out [11:53:05] PROBLEM - puppet last run on cp4011 is CRITICAL: Timeout while attempting connection [11:53:06] PROBLEM - check configured eth on cp4014 is CRITICAL: Timeout while attempting connection [11:53:06] PROBLEM - check configured eth on cp4020 is CRITICAL: Timeout while attempting connection [11:53:07] PROBLEM - RAID on lvs4002 is CRITICAL: Timeout while attempting connection [11:53:07] PROBLEM - DPKG on cp4011 is CRITICAL: Timeout while attempting connection [11:53:08] PROBLEM - check configured eth on cp4008 is CRITICAL: Timeout while attempting connection [11:53:08] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [11:53:15] PROBLEM - puppet last run on cp4020 is CRITICAL: Timeout while attempting connection [11:53:15] PROBLEM - DPKG on cp4009 is CRITICAL: Timeout while attempting connection [11:53:15] PROBLEM - DPKG on cp4016 is CRITICAL: Timeout while attempting connection [11:53:15] PROBLEM - puppet last run on cp4005 is CRITICAL: Timeout while attempting connection [11:53:15] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [11:53:16] PROBLEM - puppet last run on cp4015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:53:16] PROBLEM - HTTPS_unified on cp4009 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [11:53:17] PROBLEM - HTTPS_unified on cp4008 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [11:53:26] PROBLEM - RAID on cp4020 is CRITICAL: Timeout while attempting connection [11:53:26] PROBLEM - DPKG on cp4008 is CRITICAL: Timeout while attempting connection [11:53:35] PROBLEM - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [11:54:03] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 1416 seconds ago with 0 failures [11:54:04] PROBLEM - DPKG on cp4020 is CRITICAL: Timeout while attempting connection [11:54:04] PROBLEM - puppet last run on lvs4004 is CRITICAL: Timeout while attempting connection [11:54:04] PROBLEM - check if dhclient is running on cp4008 is CRITICAL: Timeout while attempting connection [11:54:04] PROBLEM - Varnishkafka log producer on cp4008 is CRITICAL: Timeout while attempting connection [11:54:04] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 [11:54:04] RECOVERY - Host backup4001 is UP: PING WARNING - Packet loss = 66%, RTA = 73.41 ms [11:54:29] RECOVERY - SSH on bast4001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [11:54:29] PROBLEM - HTTPS_unified on cp4003 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [11:54:29] PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 9.11388505882 [11:54:52] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 941 seconds ago with 0 failures [11:54:52] RECOVERY - HTTPS_unified on cp4008 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 456 days) [11:54:52] RECOVERY - HTTPS_unified on cp4001 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 456 days) [11:54:52] RECOVERY - HTTPS_unified on cp4009 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 456 days) [11:54:52] PROBLEM - check configured eth on cp4010 is CRITICAL: Timeout while attempting connection [11:54:52] PROBLEM - check configured eth on cp4006 is CRITICAL: Timeout while attempting connection [11:54:53] PROBLEM - DPKG on cp4010 is CRITICAL: Timeout while attempting connection [11:55:37] RECOVERY - HTTPS_unified on cp4020 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 456 days) [11:55:37] RECOVERY - HTTPS_unified on cp4003 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 456 days) [11:55:41] RECOVERY - Host upload-lb.ulsfo.wikimedia.org is UP: PING WARNING - Packet loss = 86%, RTA = 74.69 ms [11:56:09] PROBLEM - Host cp4011 is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:56:30] PROBLEM - check configured eth on cp4002 is CRITICAL: Timeout while attempting connection [11:56:30] PROBLEM - puppet last run on cp4002 is CRITICAL: Timeout while attempting connection [11:56:33] !log silenced *-lb.ulsfo.wikimedia.org [11:56:41] Logged the message, Master [11:56:49] PROBLEM - SSH on cp4002 is CRITICAL: Connection timed out [11:56:49] PROBLEM - SSH on cp4004 is CRITICAL: Connection timed out [11:56:49] RECOVERY - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 67654 bytes in 0.460 second response time [11:56:59] RECOVERY - DPKG on bast4001 is OK: All packages OK [11:56:59] PROBLEM - SSH on lvs4001 is CRITICAL: Connection timed out [11:56:59] RECOVERY - RAID on bast4001 is OK: OK: no RAID installed [11:57:00] RECOVERY - check if dhclient is running on cp4008 is OK: PROCS OK: 0 processes with command name dhclient [11:57:00] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 1295 seconds ago with 0 failures [11:57:00] RECOVERY - DPKG on cp4020 is OK: All packages OK [11:57:00] RECOVERY - Varnishkafka log producer on cp4008 is OK: PROCS OK: 1 process with command name varnishkafka [11:57:01] RECOVERY - Host cp4011 is UP: PING WARNING - Packet loss = 66%, RTA = 75.32 ms [11:57:11] PROBLEM - SSH on cp4014 is CRITICAL: Connection timed out [11:57:20] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:25] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:25] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:25] PROBLEM - Host lvs4003 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:25] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:25] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:25] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:25] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:26] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:26] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:27] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [11:57:31] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:57:47] PROBLEM - Host cr1-ulsfo is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:57:47] PROBLEM - Check rp_filter disabled on lvs4001 is CRITICAL: Timeout while attempting connection [11:58:18] PROBLEM - RAID on cp4018 is CRITICAL: Timeout while attempting connection [11:58:18] PROBLEM - check if dhclient is running on lvs4001 is CRITICAL: Timeout while attempting connection [11:58:47] RECOVERY - Host cp4003 is UP: PING WARNING - Packet loss = 80%, RTA = 75.47 ms [11:58:57] RECOVERY - Check rp_filter disabled on lvs4001 is OK: OK: kernel parameters are set to expected value. [11:59:06] RECOVERY - Host cp4001 is UP: PING WARNING - Packet loss = 86%, RTA = 75.11 ms [11:59:07] RECOVERY - Host cp4002 is UP: PING WARNING - Packet loss = 86%, RTA = 80.84 ms [11:59:07] RECOVERY - Host cp4004 is UP: PING WARNING - Packet loss = 86%, RTA = 75.07 ms [11:59:10] RECOVERY - Host lvs4003 is UP: PING WARNING - Packet loss = 86%, RTA = 75.13 ms [11:59:10] RECOVERY - Host cp4020 is UP: PING WARNING - Packet loss = 86%, RTA = 75.33 ms [11:59:12] RECOVERY - Host cp4015 is UP: PING WARNING - Packet loss = 86%, RTA = 76.47 ms [11:59:15] RECOVERY - Host cp4008 is UP: PING WARNING - Packet loss = 86%, RTA = 75.53 ms [11:59:15] RECOVERY - Host cp4010 is UP: PING WARNING - Packet loss = 80%, RTA = 76.27 ms [11:59:15] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org is UP: PING WARNING - Packet loss = 80%, RTA = 76.17 ms [11:59:26] RECOVERY - Host cp4017 is UP: PING WARNING - Packet loss = 80%, RTA = 76.07 ms [11:59:59] RECOVERY - SSH on cp4014 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [12:00:06] RECOVERY - Host bast4001 is UP: PING WARNING - Packet loss = 93%, RTA = 75.55 ms [12:00:26] PROBLEM - HTTPS_unified on cp4019 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [12:00:29] PROBLEM - puppet last run on cp4018 is CRITICAL: Timeout while attempting connection [12:00:29] PROBLEM - HTTPS_unified on cp4018 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [12:00:46] PROBLEM - Varnish HTTP text-backend on cp4009 is CRITICAL: Connection timed out [12:00:46] PROBLEM - Varnish HTTP upload-frontend on cp4006 is CRITICAL: Connection timed out [12:00:46] PROBLEM - RAID on cp4009 is CRITICAL: Timeout while attempting connection [12:00:46] PROBLEM - SSH on cp4019 is CRITICAL: Connection timed out [12:00:46] PROBLEM - SSH on cp4018 is CRITICAL: Connection timed out [12:00:47] PROBLEM - Varnish HTTP mobile-backend on cp4011 is CRITICAL: Connection timed out [12:00:47] PROBLEM - Varnish HTTP upload-backend on cp4006 is CRITICAL: Connection timed out [12:00:48] PROBLEM - puppet last run on cp4019 is CRITICAL: Timeout while attempting connection [12:00:48] PROBLEM - RAID on cp4011 is CRITICAL: Timeout while attempting connection [12:00:49] PROBLEM - SSH on cp4009 is CRITICAL: Connection timed out [12:00:49] PROBLEM - Varnish HTCP daemon on cp4009 is CRITICAL: Timeout while attempting connection [12:00:50] PROBLEM - Varnishkafka log producer on cp4011 is CRITICAL: Timeout while attempting connection [12:00:50] PROBLEM - SSH on cp4006 is CRITICAL: Connection timed out [12:00:51] PROBLEM - RAID on cp4006 is CRITICAL: Timeout while attempting connection [12:00:51] PROBLEM - Disk space on cp4019 is CRITICAL: Timeout while attempting connection [12:00:52] PROBLEM - check if salt-minion is running on lvs4001 is CRITICAL: Timeout while attempting connection [12:00:52] PROBLEM - check configured eth on lvs4001 is CRITICAL: Timeout while attempting connection [12:00:53] PROBLEM - DPKG on cp4018 is CRITICAL: Timeout while attempting connection [12:00:57] PROBLEM - RAID on cp4019 is CRITICAL: Timeout while attempting connection [12:00:57] PROBLEM - puppet last run on cp4006 is CRITICAL: Timeout while attempting connection [12:01:36] RECOVERY - RAID on cp4020 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:01:37] RECOVERY - Varnish HTTP text-backend on cp4009 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.153 second response time [12:01:37] RECOVERY - RAID on cp4009 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:01:37] RECOVERY - Varnish HTTP upload-frontend on cp4006 is OK: HTTP OK: HTTP/1.1 200 OK - 328 bytes in 1.152 second response time [12:01:37] RECOVERY - SSH on cp4019 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [12:01:37] RECOVERY - SSH on cp4004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [12:01:37] RECOVERY - SSH on cp4002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [12:01:38] RECOVERY - SSH on cp4018 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [12:01:47] RECOVERY - Host bits-lb.ulsfo.wikimedia.org is UP: PING WARNING - Packet loss = 80%, RTA = 76.66 ms [12:02:02] RECOVERY - Host cr2-ulsfo is UP: PING WARNING - Packet loss = 80%, RTA = 75.71 ms [12:02:03] PROBLEM - Host lvs4003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [12:02:03] PROBLEM - Host cp4014 is DOWN: CRITICAL - Plugin timed out after 15 seconds [12:02:03] PROBLEM - Host cp4015 is DOWN: CRITICAL - Plugin timed out after 15 seconds [12:02:12] PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 23.0722363964 [12:02:25] PROBLEM - Disk space on cp4017 is CRITICAL: Timeout while attempting connection [12:02:25] PROBLEM - RAID on cp4010 is CRITICAL: Timeout while attempting connection [12:02:25] PROBLEM - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [12:02:33] PROBLEM - Host lvs4004 is DOWN: PING CRITICAL - Packet loss = 100% [12:02:33] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [12:02:33] PROBLEM - Host lvs4001 is DOWN: PING CRITICAL - Packet loss = 100% [12:02:33] PROBLEM - Host lvs4002 is DOWN: PING CRITICAL - Packet loss = 100% [12:02:34] RECOVERY - Host cp4014 is UP: PING WARNING - Packet loss = 37%, RTA = 75.69 ms [12:02:34] RECOVERY - Host cp4015 is UP: PING WARNING - Packet loss = 37%, RTA = 76.44 ms [12:02:56] RECOVERY - Host lvs4004 is UP: PING OK - Packet loss = 0%, RTA = 76.51 ms [12:03:03] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 63, down: 0, dormant: 0, excluded: 1, unused: 0 [12:03:03] RECOVERY - Host bast4001 is UP: PING OK - Packet loss = 0%, RTA = 76.50 ms [12:03:03] RECOVERY - Host lvs4002 is UP: PING OK - Packet loss = 0%, RTA = 76.63 ms [12:03:03] RECOVERY - Host lvs4003 is UP: PING OK - Packet loss = 0%, RTA = 75.30 ms [12:03:03] RECOVERY - Host lvs4001 is UP: PING OK - Packet loss = 0%, RTA = 76.39 ms [12:03:03] RECOVERY - Varnish HTTP mobile-frontend on cp4019 is OK: HTTP OK: HTTP/1.1 200 OK - 371 bytes in 0.152 second response time [12:05:15] PROBLEM - puppet last run on cp4003 is CRITICAL: Timeout while attempting connection [12:05:27] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail [12:05:46] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: puppet fail [12:06:06] RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: 0.0 [12:06:08] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:06:08] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: Puppet has 1 failures [12:06:08] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [12:06:09] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 119 seconds ago with 0 failures [12:06:09] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:07:05] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: puppet fail [12:07:49] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [12:08:06] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: puppet fail [12:08:15] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:08:15] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [12:09:05] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [12:09:15] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [12:10:07] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [12:10:25] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [12:13:06] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: puppet fail [12:13:15] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: puppet fail [12:13:15] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [12:15:25] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: puppet fail [12:15:26] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [12:15:26] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: puppet fail [12:15:26] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:16:05] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [12:16:35] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [12:17:15] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:18:07] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:18:45] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:19:30] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:20:48] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [12:21:17] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [12:21:25] RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 2.49366398305 [12:25:50] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [12:27:01] RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 1.85878847458 [12:28:09] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [12:31:51] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 129 seconds ago with 0 failures [12:33:41] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [12:35:19] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:36:50] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: Puppet has 1 failures [12:36:50] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:38:30] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [12:38:39] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [12:40:29] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 1 failures [12:41:29] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [12:46:00] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:47:19] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [12:49:19] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [12:51:23] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [12:55:19] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [12:55:20] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [12:55:49] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:56:51] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:58:11] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [12:59:02] PROBLEM - Host cr2-ulsfo is DOWN: CRITICAL - Plugin timed out after 15 seconds [12:59:02] PROBLEM - Host cp4003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [12:59:03] PROBLEM - Host cp4008 is DOWN: CRITICAL - Plugin timed out after 15 seconds [12:59:09] RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 75.73 ms [12:59:09] RECOVERY - Host cp4008 is UP: PING OK - Packet loss = 0%, RTA = 87.36 ms [12:59:41] RECOVERY - Host cr2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 76.09 ms [13:00:29] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:39] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [13:18:19] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:19:30] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 68 seconds ago with 0 failures [13:20:19] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [13:24:22] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 1 failures [13:25:41] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [13:28:51] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [13:29:45] ACKNOWLEDGEMENT - RAID on ms-be2007 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) Filippo Giunchedi RT 8697 [13:30:24] ACKNOWLEDGEMENT - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi RT 8697 [13:32:51] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet last ran 407942 seconds ago, expected 14400 [13:33:12] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 [13:33:21] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:21] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:21] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.193 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [13:33:32] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:32] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:32] PROBLEM - Host cp4016 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:32] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:32] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:33] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [13:34:02] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 73.28 ms [13:34:02] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 74.11 ms [13:34:02] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 73.31 ms [13:34:02] RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 73.42 ms [13:34:02] RECOVERY - Host cp4013 is UP: PING OK - Packet loss = 0%, RTA = 72.67 ms [13:34:02] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:34:11] RECOVERY - Host cp4016 is UP: PING OK - Packet loss = 0%, RTA = 72.36 ms [13:34:11] RECOVERY - Host cp4014 is UP: PING OK - Packet loss = 0%, RTA = 72.16 ms [13:34:12] RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 72.35 ms [13:34:13] (03PS2) 10Alexandros Kosiaris: Reorganize backup roles [puppet] - 10https://gerrit.wikimedia.org/r/167771 [13:34:22] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 [13:35:52] PROBLEM - puppet last run on elastic1015 is CRITICAL: CRITICAL: Puppet last ran 4044384 seconds ago, expected 14400 [13:37:51] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [13:38:52] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: puppet fail [13:39:13] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [13:39:22] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 2 failures [13:39:40] (03PS2) 10Ottomata: Require 2 ACKs from kafka brokers for mobile caches [puppet] - 10https://gerrit.wikimedia.org/r/167550 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [13:41:13] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [13:48:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM. Actually a +2 but giving it a -1 for blocking it while testing https://gerrit.wikimedia.org/r/#/c/167771/. I 'll +2 this and merge i" [puppet] - 10https://gerrit.wikimedia.org/r/167778 (owner: 10Filippo Giunchedi) [13:50:03] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [13:50:42] RECOVERY - puppet last run on elastic1015 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:54:12] (03PS3) 10Alexandros Kosiaris: Introduce LLDP facts [puppet] - 10https://gerrit.wikimedia.org/r/167644 [13:54:14] (03PS3) 10Alexandros Kosiaris: Introduce rack/rackrow facts based on LLDP facts [puppet] - 10https://gerrit.wikimedia.org/r/167645 [13:54:42] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [13:54:45] (03CR) 10Alexandros Kosiaris: Introduce LLDP facts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167644 (owner: 10Alexandros Kosiaris) [13:55:23] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [13:55:23] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 1 [13:55:42] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:56:23] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [13:56:23] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [13:56:40] (03PS9) 10Ottomata: Initial commit of Cassandra puppet module [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/166888 [13:57:03] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [13:57:22] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [13:59:51] (03CR) 10Alexandros Kosiaris: "You are obviously correct Faidon, hence my comment at the start of this fact. Of the cases you mention, I am more concerned about rack/row" [puppet] - 10https://gerrit.wikimedia.org/r/167645 (owner: 10Alexandros Kosiaris) [14:10:29] (03PS1) 10Cscott: Update PediaPress endpoint; re-enable PediaPress in labs for testing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167806 (https://bugzilla.wikimedia.org/71675) [14:23:20] !log springle Synchronized wmf-config/db-eqiad.php: depool db1065 (duration: 00m 06s) [14:23:28] Logged the message, Master [14:24:35] (03PS2) 10Giuseppe Lavagetto: ocg: sanitize role::ocg::production, change alerts [puppet] - 10https://gerrit.wikimedia.org/r/167751 [14:28:02] !log set vm.dirty_writeback_centisecs = 200 (was 500) on analytics1021 [14:28:08] Logged the message, Master [14:31:02] (03CR) 10Giuseppe Lavagetto: [C: 032] ocg: sanitize role::ocg::production, change alerts [puppet] - 10https://gerrit.wikimedia.org/r/167751 (owner: 10Giuseppe Lavagetto) [14:35:12] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: puppet fail [14:35:36] <_joe_> ^^ mw [14:35:39] <_joe_> (me [14:36:36] (03PS10) 10Ottomata: Initial commit of Cassandra puppet module [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/166888 [14:36:43] (03PS1) 10Giuseppe Lavagetto: Correct hieradata check [puppet] - 10https://gerrit.wikimedia.org/r/167809 [14:37:28] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Correct hieradata check [puppet] - 10https://gerrit.wikimedia.org/r/167809 (owner: 10Giuseppe Lavagetto) [14:41:33] I'll take SWAT today. [14:41:44] (03CR) 10Chad: "Anything else to fix here or can this be merged?" [puppet] - 10https://gerrit.wikimedia.org/r/167745 (owner: 10Chad) [14:41:46] Are there any last-minute patch additions? There are currently no takers. [14:43:28] (03PS1) 10Giuseppe Lavagetto: ocg: use the correct variable [puppet] - 10https://gerrit.wikimedia.org/r/167810 [14:43:45] (03CR) 10Manybubbles: [C: 031] Improve error handling [puppet] - 10https://gerrit.wikimedia.org/r/167745 (owner: 10Chad) [14:43:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ocg: use the correct variable [puppet] - 10https://gerrit.wikimedia.org/r/167810 (owner: 10Giuseppe Lavagetto) [14:45:02] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [14:45:45] !log start catch up swiftrepl on ms-fe1003 for 'notcommons' containers [14:45:52] Logged the message, Master [14:46:41] (03PS6) 10Chad: Decom deployment-elastic0[1-3] from beta [puppet] - 10https://gerrit.wikimedia.org/r/167010 [14:47:03] (03PS2) 10Filippo Giunchedi: Improve error handling [puppet] - 10https://gerrit.wikimedia.org/r/167745 (owner: 10Chad) [14:47:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "LGTM, merging" [puppet] - 10https://gerrit.wikimedia.org/r/167745 (owner: 10Chad) [14:49:18] (03PS1) 10Giuseppe Lavagetto: ocg: yet another variable name change [puppet] - 10https://gerrit.wikimedia.org/r/167812 [14:49:33] (03CR) 10Giuseppe Lavagetto: [C: 032] ocg: yet another variable name change [puppet] - 10https://gerrit.wikimedia.org/r/167812 (owner: 10Giuseppe Lavagetto) [14:49:52] <_joe_> is zuul not working? [14:49:59] (03CR) 10Giuseppe Lavagetto: [V: 032] ocg: yet another variable name change [puppet] - 10https://gerrit.wikimedia.org/r/167812 (owner: 10Giuseppe Lavagetto) [14:50:00] oh really [14:50:13] _joe_: seems to work for me [14:50:25] <_joe_> hashar: ok only "too slow for my errors" [14:50:37] * anomie sees no patches for SWAT this morning [14:51:40] <^d> sssshhh you'll jinx it. [14:55:38] anomie: Yeah, I claimed it and then saw nobody was interested [14:56:26] <^d> We could make up something interesting. [14:57:35] <^d> godog: Any chance you could also merge https://gerrit.wikimedia.org/r/#/c/167010/ ? [14:57:44] <^d> Trying to wrap up the migration -> trusty in beta today [15:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141021T1500). Please do the needful. [15:01:01] OK, no patches, I declare SWAT a brilliant success [15:01:17] Any latecomers feel free to ping me about things [15:02:09] <^d> marktraceur: Another flawless execution. How /do/ you do it? [15:02:21] ^d: Just in the right place at the right time, I guess. [15:02:31] * marktraceur starts his motorcycle and rides down the highway [15:04:02] ^d: sure [15:04:14] <^d> thx! [15:05:35] (03PS7) 10Filippo Giunchedi: Decom deployment-elastic0[1-3] from beta [puppet] - 10https://gerrit.wikimedia.org/r/167010 (owner: 10Chad) [15:05:45] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Decom deployment-elastic0[1-3] from beta [puppet] - 10https://gerrit.wikimedia.org/r/167010 (owner: 10Chad) [15:08:28] ^d: you're set :) I wasn't a reviewer for that one otherwise I'd usually look durinh european mornings [15:08:50] (03PS1) 10Alexandros Kosiaris: Backup /var/lib/carbon/whisper on graphite [puppet] - 10https://gerrit.wikimedia.org/r/167817 [15:08:52] <^d> Whoops forgot to add you :) [15:09:30] _joe_: if you're still working, want to give https://gerrit.wikimedia.org/r/#/c/167713/ a glance? (It's pretty brainless) [15:12:27] ^d: np! I think git review can add default reviwers too [15:12:53] <^d> i don't git review ;-) [15:15:22] <_joe_> andrewbogott: brainless patches tend to scare me [15:15:41] _joe_: great, then you're the perfect reviewer [15:16:54] <_joe_> andrewbogott: why did files/openstack/folsom/virtscripts/logstat.py get removed? [15:17:12] It didn't -- that's a weird figment of the gerrit diff. [15:17:23] It shows one file as being removed, and then another (identical file) being copied into the same place. [15:19:14] (03PS2) 10Ottomata: Add cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/167700 [15:20:46] (03CR) 10Ottomata: "Ok, this is ready for review!" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/166888 (owner: 10Ottomata) [15:23:06] (03PS1) 10Mforns: Add centralauth to puppet db_config.yaml [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/167821 [15:23:20] (03CR) 10Ottomata: "nodetool status" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/166888 (owner: 10Ottomata) [15:31:45] (03CR) 10Milimetric: "*bump* - any reason this shouldn't be merged?" [puppet] - 10https://gerrit.wikimedia.org/r/167269 (owner: 10John F. Lewis) [15:33:46] (03PS1) 10Andrew Bogott: Reduce RAM overprovision ratio. [puppet] - 10https://gerrit.wikimedia.org/r/167822 [15:35:48] (03PS3) 10Andrew Bogott: Vanadium access for milimetric [puppet] - 10https://gerrit.wikimedia.org/r/167269 (owner: 10John F. Lewis) [15:37:41] (03CR) 10Andrew Bogott: [C: 032] Vanadium access for milimetric [puppet] - 10https://gerrit.wikimedia.org/r/167269 (owner: 10John F. Lewis) [15:38:46] (03CR) 10Andrew Bogott: [C: 032] Reduce RAM overprovision ratio. [puppet] - 10https://gerrit.wikimedia.org/r/167822 (owner: 10Andrew Bogott) [15:38:49] thx :) [15:39:55] milimetric: it'll be ~30 minutes before it applies everywhere [15:56:53] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 71, down: 0, shutdown: 1 [16:00:22] (03PS1) 10Filippo Giunchedi: swiftrepl: sync object timestamp [software] - 10https://gerrit.wikimedia.org/r/167828 [16:12:33] RECOVERY - BGP status on cr2-ulsfo is OK: OK: host 198.35.26.193, sessions up: 42, down: 0, shutdown: 0 [16:16:58] hmph two BGP status recovery messages from icinga with no alert in between. that seems wrong. [16:17:04] +on cr2-ulsfo [16:17:14] oh one was cr2-eqiad, nevermind [16:18:23] jgage: I did set some silencing earlier for ulsfo but it should have been expired already, anyways that might be one of the reasons [16:18:49] ah ok [16:38:21] (03PS1) 10Chad: decom elastic04 from beta, last precise host [puppet] - 10https://gerrit.wikimedia.org/r/167832 [16:38:50] (03CR) 10Chad: [C: 04-1] "Not quite ready for this, building 08 right now. Should be ready in an hour or so." [puppet] - 10https://gerrit.wikimedia.org/r/167832 (owner: 10Chad) [16:44:22] ori: what's your bugzilla email? [16:44:40] ori: https://bugzilla.wikimedia.org/show_bug.cgi?id=72319 is for you [16:45:20] cscott: thanks! :) cc'd myself [16:45:23] ori@wikimedia.org fwiw [16:47:21] robh: https://gerrit.wikimedia.org/r/165867 is still waiting for +2. maybe you're still waiting for robla to approve in writing? [16:47:39] ori: huh, bugzilla didn't autocomplete that. wonder why. [16:48:30] hm, it completes for 'ori@', you just don't show up in the autocomplete for 'ori'. too many oris i guess. [16:54:29] <_joe_> !log stopping puppet on mw1114 in order to do some jemalloc debugging [16:54:37] Logged the message, Master [17:07:14] (03CR) 10Chad: [C: 031] "Ok, this is ready to go now." [puppet] - 10https://gerrit.wikimedia.org/r/167832 (owner: 10Chad) [17:10:39] <^d> godog: That's the last one today I promise ^ :) [17:13:48] ^d: haha okay I'll take a look [17:14:27] (03CR) 10Ori.livneh: "@paravoid: Well, Apt::Conf['no-recommends'] isn't introduced in this patch; this patch just ensures it's applied at the right time." [puppet] - 10https://gerrit.wikimedia.org/r/167020 (owner: 10Ori.livneh) [17:14:50] grumbl [17:15:53] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [17:16:00] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] decom elastic04 from beta, last precise host [puppet] - 10https://gerrit.wikimedia.org/r/167832 (owner: 10Chad) [17:16:24] ori: you're right, obviously... [17:16:44] paravoid: IMO we should apply it. it's about granularity and control. [17:16:49] paravoid: if you need it, declare it. [17:16:51] ok, in that case, I'm worried that this adds a dependency to a *lot* of resources and might increase the catalog size [17:16:55] so maybe we should do it in stage.pp? [17:16:56] <_joe_> paravoid: I was sure we weren't doing it as well [17:17:13] paravoid: sure, that works too. [17:17:14] <_joe_> paravoid: apt already runs in the first stage [17:17:19] apt-get update does [17:17:28] <_joe_> oh [17:17:32] <_joe_> ok right [17:17:55] <_joe_> also, this is needed just once, so yes no reason to create a lot of dependencies [17:17:55] I'm not so sure we should do it yet [17:17:57] but let's try it [17:23:32] (03PS1) 10Ori.livneh: Add ::apt to stage => first [puppet] - 10https://gerrit.wikimedia.org/r/167835 [17:23:35] ^ paravoid [17:28:18] (03CR) 10Aaron Schulz: "It's used by upload stash for staging files during upload when they are private to the uploader and not published anywhere yet." [puppet] - 10https://gerrit.wikimedia.org/r/167310 (owner: 10Aaron Schulz) [17:43:14] (03PS4) 10QChris: Configure gerrit's its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/167524 [17:43:54] (03CR) 10QChris: Configure gerrit's its-phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167524 (owner: 10QChris) [17:49:31] :S There's that one plwiki user with an overly large watchlist that always fatals out [17:49:46] * hoo wonders whether we should contact them [17:49:50] maybe they want it cleared [17:51:28] * hoo would volunteer to do so [17:51:47] who? [17:51:47] because it's annoying to see that in fatal logs over and over [17:51:59] MatmaRex: I don't think I should disclose that [17:52:10] understandable [17:52:23] i'm a pl.wp editor/admin, i can help messaging/translating/whatever [17:53:11] That's not an issue according to the babel ;) [17:53:42] hah, okay [17:57:02] 125k+ items... that's a lot [17:57:08] but far from the highscore [17:58:35] (03PS1) 10Ori.livneh: Standardize declaration of nodejs package [puppet] - 10https://gerrit.wikimedia.org/r/167838 [18:00:04] Reedy, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141021T1800). Please do the needful. [18:01:00] (03CR) 10Ori.livneh: [C: 032] Standardize declaration of nodejs package [puppet] - 10https://gerrit.wikimedia.org/r/167838 (owner: 10Ori.livneh) [18:02:17] (03PS3) 10Ori.livneh: mathoid: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/167413 [18:03:37] (03PS1) 10Reedy: Non wikipedias to 1.25wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167841 [18:04:09] (03PS2) 10Dzahn: DHCP - remove Tampa public services subnet [puppet] - 10https://gerrit.wikimedia.org/r/167651 [18:05:05] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.25wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167841 (owner: 10Reedy) [18:05:12] (03Merged) 10jenkins-bot: Non wikipedias to 1.25wmf4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167841 (owner: 10Reedy) [18:07:42] (03CR) 10Dzahn: [C: 032] DHCP - remove Tampa public services subnet [puppet] - 10https://gerrit.wikimedia.org/r/167651 (owner: 10Dzahn) [18:07:53] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.25wmf4 [18:08:02] Logged the message, Master [18:10:42] (03PS2) 10Reedy: Update PediaPress endpoint; re-enable PediaPress in labs for testing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167806 (https://bugzilla.wikimedia.org/71675) (owner: 10Cscott) [18:11:17] (03CR) 10Reedy: [C: 032] Update PediaPress endpoint; re-enable PediaPress in labs for testing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167806 (https://bugzilla.wikimedia.org/71675) (owner: 10Cscott) [18:11:25] (03Merged) 10jenkins-bot: Update PediaPress endpoint; re-enable PediaPress in labs for testing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167806 (https://bugzilla.wikimedia.org/71675) (owner: 10Cscott) [18:11:50] (03PS2) 10Reedy: Enable uploads on Hungarian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167439 (https://bugzilla.wikimedia.org/72231) (owner: 10Nemo bis) [18:11:54] (03CR) 10Reedy: [C: 032] Enable uploads on Hungarian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167439 (https://bugzilla.wikimedia.org/72231) (owner: 10Nemo bis) [18:12:02] (03Merged) 10jenkins-bot: Enable uploads on Hungarian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167439 (https://bugzilla.wikimedia.org/72231) (owner: 10Nemo bis) [18:12:25] thanks [18:13:06] (03PS2) 10Dzahn: Labs: Moar verbose in firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/166221 (owner: 10coren) [18:13:12] Reedy -- if you're SWATting, I could use your help getting https://gerrit.wikimedia.org/r/167818 into production ASAP to fix https://bugzilla.wikimedia.org/72003 [18:13:37] I'm not exactly swating [18:13:44] But there's plenty of time in this deploy window [18:14:04] cscott: Which deployment branch? Or both? [18:14:09] (03CR) 10Dzahn: [C: 032] Labs: Moar verbose in firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/166221 (owner: 10coren) [18:14:21] both, the settings have been broken since last week i think. [18:14:24] (03PS2) 10Reedy: Create "templateeditor" user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167065 (https://bugzilla.wikimedia.org/72146) (owner: 10Calak) [18:14:43] that is, the collections extension which arlo got swatted last week fixed most things, but left the settings dialog broken. [18:15:25] (03CR) 10Reedy: [C: 032] Create "templateeditor" user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167065 (https://bugzilla.wikimedia.org/72146) (owner: 10Calak) [18:15:35] (03Merged) 10jenkins-bot: Create "templateeditor" user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167065 (https://bugzilla.wikimedia.org/72146) (owner: 10Calak) [18:16:16] (03PS2) 10Reedy: Rename project and project talk namespaces on mrwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167447 (https://bugzilla.wikimedia.org/71774) (owner: 10Glaisher) [18:16:19] (03CR) 10Reedy: [C: 032] Rename project and project talk namespaces on mrwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167447 (https://bugzilla.wikimedia.org/71774) (owner: 10Glaisher) [18:16:28] (03Merged) 10jenkins-bot: Rename project and project talk namespaces on mrwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167447 (https://bugzilla.wikimedia.org/71774) (owner: 10Glaisher) [18:16:52] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [18:17:15] (03PS2) 10Reedy: Disable new page patrol on fishbowl/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167555 (https://bugzilla.wikimedia.org/72239) (owner: 10Glaisher) [18:17:19] (03CR) 10Reedy: [C: 032] Disable new page patrol on fishbowl/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167555 (https://bugzilla.wikimedia.org/72239) (owner: 10Glaisher) [18:17:27] (03Merged) 10jenkins-bot: Disable new page patrol on fishbowl/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167555 (https://bugzilla.wikimedia.org/72239) (owner: 10Glaisher) [18:18:36] Reedy: just fyi, I have to go afk for a bit [18:19:10] ook [18:20:22] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: puppet fail [18:20:27] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 18s) [18:20:34] Logged the message, Master [18:21:12] !log reedy Synchronized database lists: (no message) (duration: 00m 26s) [18:21:17] Logged the message, Master [18:22:03] (03PS1) 10coren: Tool Labs: major cleanup of classes and roles [puppet] - 10https://gerrit.wikimedia.org/r/167852 [18:22:11] YuviPanda: ^^ [18:23:42] (03CR) 10jenkins-bot: [V: 04-1] Tool Labs: major cleanup of classes and roles [puppet] - 10https://gerrit.wikimedia.org/r/167852 (owner: 10coren) [18:23:51] (once the synaxies are fix't) [18:24:49] !log reedy Synchronized php-1.25wmf4/extensions/Collection/: (no message) (duration: 00m 14s) [18:24:54] Logged the message, Master [18:25:49] !log reedy Synchronized php-1.25wmf3/extensions/Collection/: (no message) (duration: 00m 13s) [18:25:53] Logged the message, Master [18:26:16] (03PS2) 10coren: Tool Labs: major cleanup of classes and roles [puppet] - 10https://gerrit.wikimedia.org/r/167852 [18:27:28] cscott: both done... [18:27:31] * hoo just mailed that plwiki user [18:27:40] Reedy: thanks! [18:27:51] is everything deployed as well? [18:27:59] ie, i should be able to test this now, right? [18:28:10] yup [18:28:12] well [18:28:18] at worst, it's cached JS [18:28:27] so bits might need a bit of kicking [18:29:49] <_joe_> cscott: can I assume you'll be looking after ocg health and occasional reboots from now on? [18:30:17] _joe_: yup, although help is always appreciated [18:30:41] i'm going to deploy a ocg config change once the train is done that should make the reboots unnecessary [18:30:44] * cscott crosses fingers [18:31:06] Jeff_Green pinged me yesterday with "hey--don't know if you saw RT 8674 re. /tmp filling up on ocg boxes?" [18:31:15] do you know anything about that? I don't have RT access. [18:31:19] hey [18:31:31] cscott: you're missing out on the RT fun! [18:31:45] the executive summary: /tmp filling up on node servers [18:31:49] that's what they always say just before i get subscribed to a new source of inbox cruft ;) [18:31:53] not /mnt/tmpfs, actually /tmp [18:32:01] yeah, that's superweird. [18:32:07] cscott: it's not a scam, it's a real winning of gifts! [18:32:32] It's a trap [18:32:34] it's all dvipdfmx* files [18:32:35] you guys are trying to make me into a real #opsen [18:32:36] * hoo hides [18:32:49] dvipdfmx* whoa. hm. [18:33:13] cscott: I feel the same way about bugzilla and mingle [18:33:18] i wonder if those 100% cpu latex renderer processes are actually tying up LaTeX or dvipdf somehow. [18:33:19] just start to use phab ? [18:33:38] the embargo to create new projects is over [18:33:38] all hail phab, all problems will be solved by phab, don't fear the phab [18:33:59] cscott: yea :) [18:34:13] the RT ticket will be imported anyways :p [18:34:16] (03PS5) 10Rush: Configure gerrit's its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/167524 (owner: 10QChris) [18:34:17] why don't we just put our trouble tickets in git [18:34:18] * cscott is not sure if he was being ironic or not [18:34:50] Jeff_Green: i tried to handle them on gerrit itself, but got called out [18:35:11] hahaha [18:35:38] (03CR) 10Rush: [C: 032] Configure gerrit's its-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/167524 (owner: 10QChris) [18:37:52] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [18:38:03] cscott: imagine project "search" or something, where people from any team can just be added. i mean isn't project per service better than project per team [18:38:25] eh.. or "bugtracker per team" :p [18:38:29] gerrit 503s? [18:38:47] i'm very confused [18:38:49] indeed [18:38:55] gerrit 503 [18:39:14] Someone should warn ops. Oh. Wait. [18:39:20] Cloning into 'extensions/WikiEditor'... [18:39:20] fatal: unable to access 'https://gerrit.wikimedia.org/r/p/mediawiki/extensions/WikiEditor.git/': The requested URL returned error: 502 [18:39:33] Coren: well, i see chase just merged something that mentions gerrit [18:39:34] something broke just now [18:39:38] ... and, it's back from me. [18:39:41] chasemp: gerrit broke? [18:39:41] for* [18:39:56] qchris had me push a change [18:40:01] puppet restarts Gerrit on config change [18:40:03] should have been very momentary? [18:40:11] so that is working as attended :D [18:40:13] yep, it's back [18:40:17] what hashar said :) [18:40:17] I was about to log into it to check what was up; I guess it got scared into shape! [18:40:19] !ran puppet on gerrit which restarted service [18:40:22] !log puppet on gerrit which restarted service [18:40:29] Logged the message, Master [18:40:34] my git submodule update is running again as well. [18:40:38] chasemp: Between 40s and 1m; which is reasonable. [18:41:09] the other way would be schedule the downtime, but for a minute it is not worth the paperwork :D [18:41:34] i think we should hire a skywriting plane to announce downtime [18:41:55] wtf [18:41:56] lol [18:41:58] (03CR) 10coren: [C: 031] "As far as I can tell, except for a few cosmetic changes, this should be a noop on the actual instances." [puppet] - 10https://gerrit.wikimedia.org/r/167852 (owner: 10coren) [18:42:01] Reedy: around? [18:42:09] * hashar looksup skywriting [18:42:19] oh plane [18:43:01] it has to be high enough above SFO that I can see it from here [18:43:21] cscott: http://i.imgur.com/rI6o1rH.jpg [18:43:36] it's real, some comedian hired him for that [18:43:53] hahah [18:44:22] money well spend [18:44:24] (03PS1) 10Vogone: Add import sources for orwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167858 [18:44:33] oh for ... sake... gerrit [18:44:48] hoo: ? [18:44:53] and why on earth did nobody notices that our watchlists are totally screwed [18:44:59] aude: Look at you Wikidata watchlist [18:45:06] mine looks like RC :P [18:45:10] also on meta [18:45:12] ooooo [18:45:33] I think I know the cause [18:45:36] * hoo will revert [18:45:53] * aude has forgotten about deployments since we haven't have anything new in a while [18:46:17] :) [18:46:20] my watchlist looks ok to me [18:46:31] aude: mh... are you on the enhanced one? [18:46:33] what exactly is wrong? [18:46:35] no [18:46:40] * aude checks [18:46:43] (diff | hist) . . Wikidata:Requests for deletions‎; 17:57 . . (+341)‎ . . ‎George.Edward.C (talk | contribs | block)‎ (→‎Q2882215: requested deletion (RD)) [18:46:43] (diff | hist) . . Wikidata:Requests for deletions‎; 17:56 . . (+341)‎ . . ‎George.Edward.C (talk | contribs | block)‎ (→‎Q2863635: requested deletion (RD)) [18:46:53] mine basically has all changes to all pages on my watchlist [18:46:59] so that's quite a bit [18:47:07] nooo [18:47:14] i have enhanced on mediwiki.org [18:47:17] what [18:47:18] looks ok [18:47:30] wikidata has Wikidata:Requests for deletions‎; [18:47:37] hoo: for some reason a checkbox in the prefs got ticked [18:47:41] but probably am watching since i edited it [18:47:42] just uncheck it [18:47:48] and everything will be normal again [18:47:54] (03CR) 10Dzahn: "RoanKattouw: this patch set unfortunately seems doomed" [puppet] - 10https://gerrit.wikimedia.org/r/15561 (owner: 10Catrope) [18:48:00] Vogone: Yep... no tell every other user with the problem that [18:48:00] shouldn't be listed multiple times though [18:48:03] better than a fix? [18:48:21] ok, you're right :p [18:49:24] * aude has Expand watchlist to show all changes, not just the most recent checked [18:49:24] Ok, Thiemo was faster [18:50:06] Away for food [18:50:13] hoo: i don't see the problem [18:50:23] If no one objects (or does it till then), I'll push out the revert [18:50:25] aude: :S [18:50:31] But it can't be only me [18:50:32] i think he submitted it in rage [18:50:49] well, it's broken in some way [18:50:50] (03CR) 10Catrope: "Yeah I should probably just give up on it, it's not like I have a lot of time to work on this any more..." [puppet] - 10https://gerrit.wikimedia.org/r/15561 (owner: 10Catrope) [18:50:52] screenshot? [18:50:55] not sure which, but we can figure that later [18:50:58] (03Abandoned) 10Catrope: Clean up the mess that is SSL certificate installation [puppet] - 10https://gerrit.wikimedia.org/r/15561 (owner: 10Catrope) [18:51:23] * aude looks on commons [18:52:00] i don't see anything strange there [18:52:08] Reedy: all my stuff looks good, thanks. [18:52:19] it does have Expand watchlist to show all changes, not just the most recent checked [18:52:26] i don't know that i checked it [18:52:32] * aude doubts it [18:52:55] but probably has been that way for a while [18:53:22] for heavens sake [18:53:31] that could be the issue? [18:53:31] why can't I screenshot one window only in KDE [18:53:37] on dewiki, it's not checked [18:53:39] * hoo|away has food getting cold right now :S [18:53:40] (03CR) 10Aaron Schulz: "Is this an undocumented feature? http://docs.openstack.org/api/openstack-object-storage/1.0/content/PUT_createOrReplaceObject__v1__account" [software] - 10https://gerrit.wikimedia.org/r/167828 (owner: 10Filippo Giunchedi) [18:54:05] wikivoyage has it checked [18:54:08] ah, got it [18:54:19] (03PS1) 10Dzahn: DHCP - remove Tampa Squid/LVS subnet [puppet] - 10https://gerrit.wikimedia.org/r/167862 [18:54:20] * aude certain i've not gone around and chekced it [18:54:53] no, KDE... I don't want to share my new screenshot on Facebook [18:55:26] heh [18:56:14] (03PS1) 10Ori.livneh: delete deployment::packages [puppet] - 10https://gerrit.wikimedia.org/r/167864 [18:56:24] aude: https://people.wikimedia.org/~hoo/Watchlist.png [18:56:45] so you have the option checked [18:56:49] i think that is a bug [18:56:58] the* [18:57:03] going to have my food real quick [18:57:08] before it gets entirely cold [18:57:11] ok [18:57:15] be back in a moment ;) [18:57:17] i'm going home soon, but that's the bug [18:57:29] not critical like, right now must fix [18:57:31] aude: If you can come up with something nicer [18:57:34] but should look into it [18:57:36] go ahead ;) [18:57:42] Would like to have it fixed today, though [18:57:49] i can't see how enhanced changes did this but who knows [18:58:38] (03PS1) 10Dzahn: DHCP - delete unused linux-host-entry files [puppet] - 10https://gerrit.wikimedia.org/r/167865 [18:58:52] (03PS1) 10Cscott: Re-enable PediaPress POD in production. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167866 (https://bugzilla.wikimedia.org/71675) [19:00:26] aude: https://gerrit.wikimedia.org/r/#/c/124292/4 didn't set extendedwatchlist = 0 [19:02:00] (03PS1) 10Dzahn: Tampa decom - rm 152.80.208.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/167868 [19:03:23] (03CR) 10Dzahn: [C: 04-2] "oh! so we already re-use a "152" subnet in codfw! didn't know - 208.80.152.240/28 sandbox1-a-codfw" [dns] - 10https://gerrit.wikimedia.org/r/167868 (owner: 10Dzahn) [19:04:02] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: puppet fail [19:04:18] * aude was raging last week about some other setting being default for me [19:04:23] now i forget what it was [19:04:53] (03CR) 10Dzahn: "look at all the "; FIXME FIXME FIXME" :) how are we going forward here?" [dns] - 10https://gerrit.wikimedia.org/r/167868 (owner: 10Dzahn) [19:04:55] ah, search namespaces [19:06:32] i don't have extendedwatchlist in my user options [19:07:48] (03PS2) 10Dzahn: Tampa decom - rm 152.80.208.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/167868 [19:08:33] (03PS3) 10Dzahn: Tampa decom - rm 152.80.208.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/167868 [19:10:56] i wonder if Flow is the culprit [19:12:16] !log Added its-phabricator plugin (d425a5ded909ee73df53d5e6d91d28014d0be375) into gerrit [19:12:23] Logged the message, Master [19:15:34] aude: any news? [19:15:50] it's in $wgDefaultUserOptions [19:16:19] (03PS1) 10Dzahn: remove Tampa networks from network.pp [puppet] - 10https://gerrit.wikimedia.org/r/167872 [19:17:12] aude: So... deploy the backport? [19:17:21] i don't know if that's the problem [19:17:27] it was supposed to be a no-op for production [19:17:38] so waiting to actually od that for one more week (or two) wont hurt [19:18:06] what else changed? [19:18:23] git log didn't come up with anything obvious for me, despite of that patch [19:19:04] maybe an extension is doing soemthing? [19:20:01] (03PS1) 10Dzahn: puppet::self - remove pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/167873 [19:20:26] aude: :S [19:20:32] (03PS2) 10Dzahn: puppet::self::master - remove pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/167873 [19:20:35] let me see whether I can reproduce locally [19:20:37] * aude is enabling flow (don't know just a guess) [19:20:56] i've seen them poking at related code but don't know what is merged [19:21:52] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:22:33] manybubbles: are you working with the old search cluster? [19:22:36] (03PS1) 10Dzahn: puppetmaster - remove pmtpa from allow_from [puppet] - 10https://gerrit.wikimedia.org/r/167878 [19:23:29] (03CR) 10Dzahn: [C: 032] puppetmaster - remove pmtpa from allow_from [puppet] - 10https://gerrit.wikimedia.org/r/167878 (owner: 10Dzahn) [19:24:02] aude: What did Thiemo find that made him submit the revert? [19:24:05] No details there :S [19:24:16] ok, I can reproduce locally [19:24:44] i think he generally opposes the change [19:24:52] and the revert fixes it for me [19:24:56] dunno why, but it works [19:25:00] if it's the patch i'm thinking of [19:25:06] (I copied the 3 lines of WMF config as well) [19:25:12] let me try [19:25:33] $wgDefaultUserOptions['watchdefault'] = 0; [19:25:33] $wgDefaultUserOptions['enotifwatchlistpages'] = 0; [19:25:34] $wgDefaultUserOptions['usenewrc'] = 0; [19:25:43] yep [19:25:44] (03PS1) 10Dzahn: puppetmaster - allow_from codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/167880 [19:25:56] other options is [19:26:06] $wgDefaultUserOptions['extendwatchlist'] = 0; [19:26:14] although we ought to understand why [19:26:46] does it work if you see that manually and don't apply the revert? [19:26:51] * set [19:27:11] (03CR) 10Dzahn: "adding codfw instead - Change-Id: Ib26e7f9f83d4dd9" [puppet] - 10https://gerrit.wikimedia.org/r/167878 (owner: 10Dzahn) [19:27:31] (03CR) 10Dzahn: [C: 032] puppetmaster - allow_from codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/167880 (owner: 10Dzahn) [19:29:07] (03CR) 10Dzahn: [C: 032] "there is no more Tampa labs subnet to be used" [puppet] - 10https://gerrit.wikimedia.org/r/167873 (owner: 10Dzahn) [19:29:20] can't reproduce [19:29:31] What? Or at all? [19:29:47] not locally [19:29:49] * hoo scratches head [19:29:58] could be i did somethign wrong [19:30:02] the revert is already merged on master [19:30:13] so using master wont "work" to reproduce [19:30:19] oh [19:30:22] one sec [19:30:51] Hi all. I'm moving some work back from github to gerrit. Pushed all the gh commits back into gerrit OK, but now if I want to git review a new patch, it wants to submit everything from github as well. Is this something to fix locally, or cause for a gerrit cache clear, or something else entirely? [19:31:30] hoo: reproduced [19:31:36] 'extendwatchlist' => 1, [19:31:42] ejegg: Check your git remote -v [19:31:49] (03CR) 10Andrew Bogott: "I'd prefer to leave the switch logic in here so it's more obvious what to change when we add codfw" [puppet] - 10https://gerrit.wikimedia.org/r/167873 (owner: 10Dzahn) [19:31:54] aude: ok [19:31:55] https://gerrit.wikimedia.org/r/#/c/166719/1/includes/DefaultSettings.php [19:31:58] aaaaaaa [19:32:07] they changed the extendwatchlist *and* usenewrc [19:32:12] why? [19:32:23] Nemo_bis: ^ [19:32:31] did someone just updated AbuseFilter code which is running on ptwiki? [19:32:34] may be [19:32:45] Helder: Nope, no [19:32:45] hoo: 'gerrit' remote's pointing to gerrit, and .gitreview is using [gerrit] [19:32:52] no deploys there today ;) [19:32:53] that is what is intended? but for wmf we need extendwatchlist = 0 ? [19:33:05] ejegg: does it have an origin pointing to github maybe? [19:33:05] Reedy: is the mediawiki train done? [19:33:06] 'origin' is still github though [19:33:09] If so, delete ti [19:33:11] doh "Make enhanced recent changes and extended watchlist default" [19:33:11] It seems the value of the variable added_lines is not a subset of new_wikitext anymore, and this is causing quite a few false positives [19:33:13] cool [19:33:14] * aude fails to read [19:33:23] (03PS2) 10Dzahn: DHCP - delete unused linux-host-entry files [puppet] - 10https://gerrit.wikimedia.org/r/167865 [19:33:37] i think it was quite intended but don't know if extended was discussed or decided on [19:33:44] Helder: Open a bug, please [19:34:10] (03CR) 10Dzahn: [C: 031] "one of them is empty, one is "sanger" :p" [puppet] - 10https://gerrit.wikimedia.org/r/167865 (owner: 10Dzahn) [19:34:26] Reedy, greg-g: i'd like to do a quick OCG deploy [19:35:11] aude: In that case, I'd say revert for now, decide later [19:35:18] making a patch [19:35:20] communication, ftw [19:35:21] for the config [19:35:24] mh [19:35:25] (03PS1) 10Aude: Set extendwatchlist = 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167884 [19:35:33] why not just deploy the revert? [19:36:17] (03PS2) 10Aude: Set extendwatchlist = 0 for wmf wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167884 [19:36:24] i don't want to touch that [19:36:29] hehe, I see [19:36:33] up to someone to decide [19:36:46] greg-g: hey :) I would like to quickly push two configuration things [19:36:47] for master vs. wmf etc [19:37:06] aude: yes it was discussed; and I have no idea why this conversation is on this channel [19:37:12] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 348 MB (3% inode=72%): [19:37:12] as much as i'm not ready for enhanced changes (with wikibase) [19:37:21] i think it's the righ tchoice for master [19:37:24] given the consensus [19:37:31] and we'll get it resolved in wikibase [19:37:38] Nemo_bis: if this hits the WPs my watchlist will be pages long [19:37:40] that's not nic [19:37:42] * nice [19:37:44] Nemo_bis: thought it was a bug [19:37:49] (03PS1) 10Dzahn: nfs.pp - remove pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/167885 [19:37:51] ++ [19:37:52] ah, you meant on Wikimedia [19:38:01] I don't care about that, Isarra did :) [19:38:06] a bit surprised, although i failed to notice the "extended watchlist" part of the commit message [19:38:11] * aude guilty [19:38:43] hoo: bug 72329 [19:38:43] we can probably change the default back to enable it but probalby should communicate it some more, inho [19:38:46] imho* [19:38:53] hoo: thanks, that did it! [19:39:29] Helder: CC me, I hopefully look when I got more time [19:39:45] a little busy right now [19:40:11] (03PS3) 10Aude: Set extendwatchlist = 0 for wmf wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167884 [19:40:20] :S [19:40:36] alright, waaaay too late to be at the office [19:40:58] (03PS1) 10Dzahn: role/deployment - remove pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/167888 [19:41:32] aude: Well, depends... :D [19:42:01] * aude hungry :) [19:42:28] yeah... to bad there's no food place near the office at this time [19:42:47] well, Mehringdamm... but that's a bit [19:44:01] Mustafa's Gemueskebap? [19:44:07] Gemuese [19:44:12] mutante: Yeah... but the lines are to long there [19:44:16] * hoo never ate there [19:44:18] i understand, yea [19:44:31] but it was really good and nice website [19:44:55] it's nice but better for breakfast when the lines are shorter :P [19:44:56] Maybe I should go by when I'm in Berlin in November... less tourists then [19:45:01] (03PS1) 10Dzahn: role/labstools: -pmtpa, +codfw, eqiad -> default [puppet] - 10https://gerrit.wikimedia.org/r/167889 [19:45:14] back later probably [19:45:20] see you :) [19:46:18] (03CR) 10Hoo man: [C: 032] "Newly created wiki: Uncontroversial and needed for initial template imports." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167858 (owner: 10Vogone) [19:46:26] (03Merged) 10jenkins-bot: Add import sources for orwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167858 (owner: 10Vogone) [19:46:51] (03PS2) 10Dzahn: role/labstools: -pmtpa, +codfw, eqiad -> default [puppet] - 10https://gerrit.wikimedia.org/r/167889 [19:47:16] !log hoo Synchronized wmf-config/InitialiseSettings.php: Add import sources for orwikisource (duration: 00m 08s) [19:47:21] Vogone: ^ [19:47:23] Logged the message, Master [19:48:09] ty [19:49:24] Nemo_bis: So you didn't mean to role out anything yet? [19:49:35] Can you comment that on either the revert or the configuration change, then? [19:49:40] (we can abandon the other) [19:49:47] whatever you favour [19:50:52] (03CR) 10Nemo bis: [C: 031] "Fine. (I don't have energy to communicate this too, to Wikimedia projects.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167884 (owner: 10Aude) [19:51:11] Yeah I was afk for a moment [19:51:50] ok [19:51:58] Feel free to also revert Thiemo's revert [19:52:06] (03PS1) 10Dzahn: torrus - remove subtree for eqiad-pmtpa link [puppet] - 10https://gerrit.wikimedia.org/r/167892 [19:52:30] I thought he found issues with it, but probably just didn't like it [19:52:34] (he's not reachable atm) [19:54:33] RECOVERY - Disk space on ocg1002 is OK: DISK OK [19:57:13] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: puppet fail [19:58:33] (03CR) 10Jforrester: [C: 031] "Put in the SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167884 (owner: 10Aude) [19:59:50] does anyone here know anything about rdb1002.eqiad.wmnet, the redis server in beta/labs? [20:00:01] it seems to be down, or blackholed. [20:03:48] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: puppet fail [20:05:18] (03CR) 10Multichill: [C: 031] "I agree, also see https://commons.wikimedia.org/wiki/Commons:Village_pump#Watchlist_broken.3F" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167884 (owner: 10Aude) [20:05:49] hm, i need puppet help [20:06:19] ocg on beta is down because apparently puppet just ran on deployment-pdf01 and used overwrote the config with settings from production [20:06:21] cscott: rdb1002.eqiad.wmnet seems fine to me, what is the problem ? [20:06:44] rdb1002 is not reachable from deployment-pdf01 in labs, but that's expected i guess. [20:06:51] ah yes, it is [20:06:57] (03PS4) 10Hoo man: Set extendwatchlist = 0 for wmf wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167884 (https://bugzilla.wikimedia.org/72330) (owner: 10Aude) [20:06:59] :S [20:06:59] <_joe_> cscott: oh snap, I guess it's because of my changes [20:07:05] the problem is that deployment-pdf01 is trying to use rdb1002 in the first place, it should be using deployment-redis01.eqiad.wmflabs [20:07:06] Maybe we should pre-pone the deploy? [20:07:17] <_joe_> I thought you used role::ocg::test [20:07:22] PROBLEM - CI tmpfs disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 30 MB (5% inode=99%): [20:07:23] <_joe_> which I didn't tough [20:07:26] <_joe_> (touch [20:07:36] _joe_: i don't know what we use ;) [20:07:45] _joe_: this was a puppet change? [20:07:57] _joe_: note that puppet is now failing on ocg1001 as well according to icinga [20:08:01] <_joe_> cscott: yes, I moved the ocg module to hiera [20:08:09] <_joe_> cscott: no that's unrelated [20:08:19] <_joe_> my change was done like 8 hours ago :) [20:08:22] well that's good I guess? [20:08:44] maybe puppet didn't run until recently? the timestamp on deployment-pdf01 is Oct 21 15:23 [20:09:00] and i tested rendering on beta a few hours ago [20:09:00] <_joe_> no, I do expect it to break in beta [20:09:04] <_joe_> but we can fix it [20:09:13] <_joe_> only, I'm doing other things right now [20:09:19] <_joe_> and it's already 10 pm [20:09:38] hm, ok. so long as you're sure that it's not going to affect production. [20:09:43] (03PS1) 10Dzahn: move racktables from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/167903 [20:09:52] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 9 below the confidence bounds [20:09:54] i was using beta to test a change before i deployed it to prod, but i guess i can be brave. [20:10:22] i wonder why beta didn't break earlier [20:10:26] oh! i know. [20:10:32] <_joe_> cscott: because rebase [20:10:37] because i just now restarted ocg on beta [20:10:59] the config file was rewritten hours ago, but the service wasn't restarted so it was still using the old correct config [20:11:01] <_joe_> oh ok [20:11:13] <_joe_> cscott: disable puppet and correct it by hand? [20:11:13] PROBLEM - Disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 19 MB (3% inode=99%): [20:11:16] <_joe_> for now I mean [20:11:24] <_joe_> cscott: puppet agent --disable [20:11:29] i can do the latter, but i'm not sure how to do the 'disable puppet' part. [20:11:31] <_joe_> tomorrow I'll fix it, promised [20:11:41] <_joe_> cscott: as root ^^ [20:12:05] <_joe_> cscott: really sorry, I should always check beta :/ [20:12:07] ok, done on pdf01 (which is the only one that matters i think) [20:12:20] <_joe_> yes [20:12:28] (03PS3) 10Alexandros Kosiaris: Reorganize backup roles [puppet] - 10https://gerrit.wikimedia.org/r/167771 [20:12:35] <_joe_> cscott: long story short: we need to add a hiera file on beta [20:13:16] (03PS2) 10Dzahn: move racktables from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/167903 [20:13:20] (03PS5) 10Rush: redirector for bugzilla -> phabricator [puppet] - 10https://gerrit.wikimedia.org/r/166283 (owner: 1020after4) [20:13:44] (03CR) 10Rush: [C: 032 V: 032] "give it a whirl" [puppet] - 10https://gerrit.wikimedia.org/r/166283 (owner: 1020after4) [20:13:57] (03CR) 10jenkins-bot: [V: 04-1] move racktables from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/167903 (owner: 10Dzahn) [20:14:14] !log Deployed deployment/jobrunner to d426235e10edc682b532e7b4f2b02bb9414661ba [20:14:19] Logged the message, Master [20:14:47] (03CR) 10Andrew Bogott: "Looks right to me, overall. A few questions inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167852 (owner: 10coren) [20:14:52] (03CR) 10Alexandros Kosiaris: [C: 032] Reorganize backup roles [puppet] - 10https://gerrit.wikimedia.org/r/167771 (owner: 10Alexandros Kosiaris) [20:15:20] _joe_: where can i look up the redis password for deployment-redis01.eqiad.wmflabs ? [20:15:48] <_joe_> cscott: labs/private repo [20:16:02] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:16:04] the old config file that was on pdf01 is from jul 21, and i don't think it's 100% correct. password in particular is blank. [20:16:29] (03CR) 10Rillke: [C: 031] "I agree, it could be communicated first. OTH, it's not a serve issue, usually." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167884 (https://bugzilla.wikimedia.org/72330) (owner: 10Aude) [20:16:32] it should also be a puppet-controlled config variable for ocg. but is that what you munged? [20:17:23] cscott: it would be puppet-controlled, labs/private would be the equivalent to production/private, just with public fake passwords [20:17:55] the same puppet code should get the actually secret password in prod and the labs one in labs [20:18:07] DRY ftw [20:20:28] i suspect the information i am looking for is available from the 'configure' link at https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000396.eqiad.wmflabs [20:20:42] but it is pretending not to be there [20:20:56] this is some weird permissions thing i thought we'd fixed before [20:21:24] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:22:40] <_joe_> !log installing new hhvm packages on the depooled server mw1189, for debugging [20:22:42] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: Puppet last ran 73106 seconds ago, expected 14400 [20:22:47] Logged the message, Master [20:23:00] <_joe_> cscott: no, I told you, labs/private repo [20:23:02] cscott: you are listed as project admin though, and that is the only thing that you should need to configure instances ..hmm [20:23:44] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [20:23:45] <_joe_> sorry, going to bed [20:23:49] <_joe_> see you tomorrow [20:24:07] robh: Hello! I'm here to make sure that JonKatz (who just joined) gets access to stat1003. :-) [20:24:24] "Private repo for the Labs project. This is not to be shared with anyone. Your checkout should stay private to yourself." this doesn't make sense to me because the whole thing is public [20:24:37] hahaha :D [20:24:38] robh: This is RT ticket 8647. [20:24:58] mutante: https://git.wikimedia.org/blob/labs%2Fprivate/cde292503932d8e5ffab32145c8d01d70a6b62b6/README [20:25:31] cscott: yes, that, for example go to https://gerrit.wikimedia.org/r/#/c/166569/1/modules/passwords/manifests/init.pp , expand the whole file.. look for "redis" , you got the password [20:26:32] mutante: thanks. i was just going through the process of git cloning so i could properly grep. [20:27:00] oddly (?) that's the password that was already in the old copy of the config file. so there's something else going on. [20:28:00] (03PS9) 10Chad: Adding tools for banning/unbanning an ES node [puppet] - 10https://gerrit.wikimedia.org/r/164617 [20:28:02] (03PS8) 10Chad: Another es-tool function: restart a node the fast & easy way [puppet] - 10https://gerrit.wikimedia.org/r/164401 [20:28:13] cscott: sure re ocg [20:28:16] hoo: what's up? [20:28:34] greg-g: Fuzz about https://gerrit.wikimedia.org/r/167884 [20:28:40] there's more to it [20:28:50] well, we killed all the other changes [20:29:15] * greg-g reads [20:29:16] cscott: YuviPanda might know [20:29:17] greg-g: https://people.wikimedia.org/~hoo/Watchlist.png that's how my meta watchlist looks atm [20:29:23] spot the problem :S [20:29:31] Deskana: has Jon posted their ssh key to officewiki? (out of interest) [20:29:39] JohnLewis: Yep. [20:29:59] hoo: the HHVM link, clearly. It's spam! https://www.mediawiki.org/wiki/Talk:HHVM [20:30:01] hoo: yeah, gotcha, do you know what change made this happen? [20:30:16] sure [20:30:38] https://gerrit.wikimedia.org/r/124302 [20:30:43] was reverted and re-reverted [20:30:49] so is back in state just like that [20:30:57] Deskana: Yeah - just gotta wait now. If I hadn't made mutante move the ticket to AR - I would have patched something up for robh to look at [20:31:22] eeek, 93 comments [20:31:59] greg-g: there is no need whatsoever to read them [20:32:05] heh [20:32:16] greg-g: I think we all agree to push that change [20:32:17] As the commit message states, the change in core was not meant to affect WMF, peiod. [20:32:22] question is now or during swat? [20:32:31] * hoo could do it now [20:32:36] mutante, _joe_: figured it out. thanks for the help. [20:32:39] hoo: this is a product-y question, I'm defering to James_F [20:32:48] cscott: yw, great [20:32:55] greg-g: the time? [20:33:01] he +1ed the change [20:33:03] James_F: this is re the make all individual changes show up on watchlists, not just an entry per page [20:33:07] https://gerrit.wikimedia.org/r/#/c/124302/ [20:33:09] it was an old hostname for deployment-graphite, which was failing dns lookup and crashing [20:33:29] * Nemo_bis wonders about the +V in here [20:33:36] (whenever this is fixed, please reply at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical) ) [20:33:52] (last section) [20:34:03] wait... this escalated to en, yet? [20:34:04] :S [20:34:19] (03CR) 10Dzahn: "abandoning because it should already be in Change-Id: Ib75c18f75a8b8048" [puppet] - 10https://gerrit.wikimedia.org/r/167889 (owner: 10Dzahn) [20:34:31] (03Abandoned) 10Dzahn: role/labstools: -pmtpa, +codfw, eqiad -> default [puppet] - 10https://gerrit.wikimedia.org/r/167889 (owner: 10Dzahn) [20:34:33] * hoo would favor to push the config. now-ish [20:35:47] did we resolve things yet? [20:35:56] hoo: given I'd be undoing something James explicitly +2'd, I'd feel better waiting until he responds [20:36:06] greg-g: He +1ed the config. patch [20:36:12] yeah [20:36:26] i saw all of the changes... the one to core and the config one [20:36:38] it failed to register with me that we were also changing the expandwatchlist setting [20:36:43] despite it in the commit message [20:36:50] i am fine with the change in principle [20:36:52] aude: State is: Master is re-reverted now. And we decided to push your config. patch [20:36:59] ohhhhhhhh, ugh [20:37:01] but mgith confuse folks [20:37:02] RECOVERY - Disk space on lanthanum is OK: DISK OK [20:37:04] but not yet then [20:37:06] * when [20:37:07] until communicated some more [20:37:10] imho [20:37:11] RECOVERY - CI tmpfs disk space on lanthanum is OK: DISK OK [20:37:18] !log lanthanum /var/lib/jenkins-slave/tmpfs went full again. cleared up a bunch of files [20:37:20] aude: Yes... people are complaining everywhere [20:37:26] Logged the message, Master [20:37:28] Even I thought it was a bug [20:37:35] right, ok, sorry, was just getting back from taking care of my son for a bit over lunch... yes... hoo deploy at will [20:37:57] * greg-g was confusing the patches [20:38:13] (03PS5) 10Hoo man: Set extendwatchlist = 0 for wmf wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167884 (https://bugzilla.wikimedia.org/72330) (owner: 10Aude) [20:38:46] (03CR) 10Hoo man: [C: 032] "Per the +1s and Greg" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167884 (https://bugzilla.wikimedia.org/72330) (owner: 10Aude) [20:39:26] (03Merged) 10jenkins-bot: Set extendwatchlist = 0 for wmf wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167884 (https://bugzilla.wikimedia.org/72330) (owner: 10Aude) [20:40:12] !log hoo Synchronized wmf-config/CommonSettings.php: Set extendwatchlist = 0 (duration: 00m 08s) [20:40:13] (03PS1) 10Dzahn: fix IP address for radium [dns] - 10https://gerrit.wikimedia.org/r/167909 [20:40:15] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: puppet fail [20:40:17] Logged the message, Master [20:40:28] (03PS4) 10Andrew Bogott: We have APC, let's use it in $wgMainCacheType [puppet] - 10https://gerrit.wikimedia.org/r/119102 (owner: 10Nemo bis) [20:40:40] thanks ho [20:40:42] hoo [20:41:01] !log updated OCG to version 523c8123cd826c75240837c42aff6301032d8ff1 [20:41:05] Logged the message, Master [20:41:05] greg-g: I also earlier Wikimailed a user who's watchlist kept fataling out and spamming the fatal logs [20:41:10] Maybe he wants it cleared out [20:41:12] (03CR) 10Dzahn: [C: 032] fix IP address for radium [dns] - 10https://gerrit.wikimedia.org/r/167909 (owner: 10Dzahn) [20:41:20] * they want [20:41:35] hoo: if so, feel free to do that for them [20:41:39] (03CR) 10Andrew Bogott: [C: 032] We have APC, let's use it in $wgMainCacheType [puppet] - 10https://gerrit.wikimedia.org/r/119102 (owner: 10Nemo bis) [20:42:04] greg-g: Sure... I just send the mail to let them know... saw that in the fatal logs before, quite spammy :P [20:44:20] * greg-g nods [20:45:57] ok, also posted on enwiki and commons that stuff is fixed [20:46:13] ty [20:47:42] is it just me or the footer looks strange to me https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Meta-Wiki_watchlist..._a_bug.3F [20:47:47] inside the body content [20:47:56] is that intended? [20:48:10] errr #content [20:48:18] huh? Not sure what you mean? [20:48:51] This page was last modified on October 21, 2014 .... etc is all in the white area (#content) [20:49:00] normally it is below / outside [20:49:03] (03CR) 10Hashar: "Make sure APC is configured with shared memory to be used as a cache :D" [puppet] - 10https://gerrit.wikimedia.org/r/119102 (owner: 10Nemo bis) [20:49:06] it is below for me? [20:49:10] Are you on vector? [20:49:13] vector [20:49:25] maybe it's just a template or something missing a [20:49:30] * hoo purges [20:49:42] aude: Tidy should fix such [20:49:48] nope, still looks fine for me [20:49:49] other pages look fine [20:49:49] (03CR) 10Andrew Bogott: "Does this patch need to be reverted? I'm mostly taking Reedy's word for it that this is useful." [puppet] - 10https://gerrit.wikimedia.org/r/119102 (owner: 10Nemo bis) [20:50:18] http://snag.gy/YhWgA.jpg [20:51:05] no [20:51:16] how is it showing you hidden categories? [20:51:19] Is that a gadget? [20:51:31] morebots, you ok? [20:51:31] I am a logbot running on tools-exec-14. [20:51:31] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [20:51:31] To log a message, type !log . [20:51:43] preference [20:51:45] ah [20:51:49] maybe that is broken [20:52:12] looks ok on https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals) [20:52:19] * hoo tries [20:52:28] (03CR) 10Nemo bis: "I've not reassessed the status of it since I submitted this patch, IIRC back then I was told it would be ok" [puppet] - 10https://gerrit.wikimedia.org/r/119102 (owner: 10Nemo bis) [20:52:28] anyway, i assume *not* a bug and just some template or such [20:52:32] that's a nice setting to have anyway [20:52:41] aude: Then tidy has a bug :S [20:52:53] uhhhhhh [20:53:00] just one? [20:53:03] purge and now the proposals vp is broken [20:53:48] maybe a vp template [20:53:54] aude: :S Works for me with vector and hidden categories [20:54:03] vp? [20:54:03] oooo [20:54:08] village pump [20:54:12] oh [20:54:37] (03CR) 10coren: "Responses inline, changeset coming." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/167852 (owner: 10coren) [20:55:10] (03PS3) 10coren: Tool Labs: major cleanup of classes and roles [puppet] - 10https://gerrit.wikimedia.org/r/167852 [20:55:20] maybe caching [20:56:23] gwicke: what were the high t= values you saw in runJobs? Note that those are ms not sec. [20:58:42] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:58:55] AaronSchulz: I think I pasted one on the bug [20:59:20] t=14415 [20:59:51] if that's ms then it would be within the timeout [21:00:04] spagewmf, ebernhardson: Respected human, time to deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141021T2100). Please do the needful. [21:00:36] just grepping with -P I don't see anything high enough [21:01:28] (03PS1) 10Chad: Remove "Copyright 2014" from Phabricator footer [puppet] - 10https://gerrit.wikimedia.org/r/167928 [21:01:29] AaronSchulz: the question would then be why the job isn't removed from the queue [21:02:45] is there a condition under which a 'good' job will legitimately remain in the queue after execution? [21:05:34] (03CR) 10Ori.livneh: [C: 032] delete deployment::packages [puppet] - 10https://gerrit.wikimedia.org/r/167864 (owner: 10Ori.livneh) [21:05:42] if it returned false by mistake...no normal way [21:08:04] AaronSchulz: would that still show up as 'good' ? [21:08:50] no [21:09:30] YuviPanda: is the split of dsh into ::dsh and ::dsh::config really warranted? Surely it's not the end of the world for a machine to get the dsh package when it doesn't strictly need it [21:11:39] * Coren still thinks jouncebot should be more tonge in cheek. [21:12:18] , Hear and Obey! Time to deploy X. Compliance is not optional. [21:12:19] :-) [21:12:20] AaronSchulz: maybe we should just clear the queue by disabling retries temporarily? [21:13:13] AaronSchulz: if the job doesn't actually time out, then we don't need to dump the page names [21:15:12] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [21:15:42] AaronSchulz: https://gerrit.wikimedia.org/r/#/c/165635/ [21:18:02] (03PS1) 10Ori.livneh: Add 'motd' module from MediaWiki-Vagrant [puppet] - 10https://gerrit.wikimedia.org/r/167953 [21:18:06] Coren: ^ [21:20:16] (03CR) 10Andrew Bogott: Tool Labs: major cleanup of classes and roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167852 (owner: 10coren) [21:25:22] PROBLEM - Apache HTTP on mw1189 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.034 second response time [21:25:31] PROBLEM - HHVM rendering on mw1189 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [21:27:41] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 67906 bytes in 6.112 second response time [21:28:22] <_joe_> that's me ^^ [21:28:28] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.257 second response time [21:29:17] !log ebernhardson Synchronized php-1.25wmf4/extensions/LiquidThreads/api/ApiQueryLQTThreads.php: Bump LQT in 1.25wmf4 (duration: 00m 04s) [21:29:23] Logged the message, Master [21:44:31] AaronSchulz, apergos, ori, YuviPanda, I'm about to delete the labs project 'Sartoris' which hasn't been used since march. Any last words? [21:44:37] (03PS2) 10Rush: Remove "Copyright 2014" from Phabricator footer [puppet] - 10https://gerrit.wikimedia.org/r/167928 (owner: 10Chad) [21:44:42] no [21:44:46] no [21:44:47] (03CR) 10Rush: [C: 032 V: 032] Remove "Copyright 2014" from Phabricator footer [puppet] - 10https://gerrit.wikimedia.org/r/167928 (owner: 10Chad) [21:45:02] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:47:39] * andrewbogott pulls the plug [21:47:46] AaronSchulz, ori: not even "burn in hell!"? :P [21:51:09] Jeff_Green: looks like the 'fundraising' project is also long-defunct. Any reason to preserve it? [21:57:57] andrewbogott: in labs? [21:58:05] Jeff_Green: yes [21:59:11] andrewbogott: probably [21:59:15] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:59:27] i certainly don't use it, don't remember if I was the one who created it [21:59:34] Jeff_Green: 'probably' what? Its instances have been shut down since March [21:59:56] two instances, 'precise-packager' and 'mingle' [22:00:12] mingle wasn't me [22:00:32] precise-packager-- I think I was actually working on pediapress madness there [22:00:41] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:42] who are the members? [22:00:57] sorry, I don't have my phone handy for the 2-factor auth [22:01:28] Jgreen Pgehres Mwalker Katie Horn Frdataproxy Frlabsjenkins Ssmith Awight [22:01:43] huh [22:02:03] you can certainly torch precise-packager [22:02:10] I can't speak for mingle [22:02:59] ok -- well, I've given notice many times of impending doom, so... [22:05:11] dunno. I work in vbox because I wanted firewalls, subnets, etc to test the stuff I need to test [22:05:32] so there's nothing in labs I care about [22:06:24] (03PS3) 10Dzahn: move racktables from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/167903 [22:06:32] the fr folks use mingle elsewhere, I don't know what that instance was about [22:06:32] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.068 second response time [22:06:51] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 67905 bytes in 0.145 second response time [22:06:59] * Jeff_Green heading out. have a good night [22:13:15] !log deleted unused labs projects: versionview, feeds, datadog, fundraising-awight, simplewiki, mediawiki-custom-de, fundraising, sartoris, wikibits, incubator, wikiversity-sandbox, data4all [22:13:24] Logged the message, Master [22:23:56] (03PS1) 10Rush: storage settings for phab db [puppet] - 10https://gerrit.wikimedia.org/r/167967 [22:25:27] (03PS4) 10Dzahn: move racktables from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/167903 [22:26:11] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [22:27:28] (03CR) 10Dzahn: "out of curiosity, so we are sure we want attachments in the db and not in the local file system? i clicked on the docs link you provided a" [puppet] - 10https://gerrit.wikimedia.org/r/167967 (owner: 10Rush) [22:28:58] (03CR) 10Rush: "definitely want them on filesystem but I have no ability to configure and test it at the moment and decision was made to mirror bugzilla a" [puppet] - 10https://gerrit.wikimedia.org/r/167967 (owner: 10Rush) [22:29:28] (03CR) 10Rush: "definitely want them on filesystem but I have no ability to configure and test it at the moment and decision was made to mirror bugzilla a" [puppet] - 10https://gerrit.wikimedia.org/r/167967 (owner: 10Rush) [22:29:48] (03CR) 10Dzahn: "gotcha, was more curious about it. thanks for explaining" [puppet] - 10https://gerrit.wikimedia.org/r/167967 (owner: 10Rush) [22:31:18] (03CR) 10Dzahn: [C: 031] storage settings for phab db [puppet] - 10https://gerrit.wikimedia.org/r/167967 (owner: 10Rush) [22:31:24] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:31:32] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:36:21] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.051 second response time [22:36:41] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 67905 bytes in 0.179 second response time [22:39:31] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: puppet fail [22:39:34] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:39:52] sorry, I'll make icinga shut up about mw1189 [22:42:20] !log manually finished global rename for BonumTV --> Karypal which failed due to page move timeout [22:42:25] Logged the message, Master [22:43:17] (03CR) 10Dzahn: [C: 04-1] "compiler says not yet http://puppet-compiler.wmflabs.org/434/change/167903/html/magnesium.wikimedia.org.html" [puppet] - 10https://gerrit.wikimedia.org/r/167903 (owner: 10Dzahn) [22:47:57] !log radium - installed OS, signing puppet cert requests, initial run ... [22:48:02] Logged the message, Master [22:59:13] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [23:00:04] RoanKattouw, ^d, marktraceur, MaxSem, James_F: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141021T2300). [23:00:16] I'll do it [23:01:45] (03PS4) 10Gage: Enable GELF for MRAppManager part 2 [puppet] - 10https://gerrit.wikimedia.org/r/167044 [23:02:05] (03PS1) 10Dzahn: add IPv6 records for radium [dns] - 10https://gerrit.wikimedia.org/r/167980 [23:02:21] (03CR) 10Rush: [C: 032] storage settings for phab db [puppet] - 10https://gerrit.wikimedia.org/r/167967 (owner: 10Rush) [23:04:34] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: /srv 18637 MB (3% inode=99%): [23:04:56] * legoktm is here [23:05:52] (03CR) 10Dzahn: [C: 032] add IPv6 records for radium [dns] - 10https://gerrit.wikimedia.org/r/167980 (owner: 10Dzahn) [23:07:26] !log maxsem Synchronized php-1.25wmf4/extensions/CentralAuth/: SWAT (duration: 00m 04s) [23:07:31] Logged the message, Master [23:07:35] legoktm, ^^^ [23:08:07] !log maxsem Synchronized php-1.25wmf4/extensions/MobileFrontend/: SWAT (duration: 00m 04s) [23:08:11] Logged the message, Master [23:08:38] MaxSem: thanks! [23:09:07] !log maxsem Synchronized php-1.25wmf3/extensions/MobileFrontend/: SWAT (duration: 00m 04s) [23:09:11] Logged the message, Master [23:09:14] ori, ^^^ [23:09:27] MaxSem: thanks [23:17:23] (03CR) 10Dzahn: "radium.wikimedia.org. 3600 IN AAAA 2620:0:861:1:208:80:154:39" [dns] - 10https://gerrit.wikimedia.org/r/167980 (owner: 10Dzahn) [23:21:46] !log maxsem Synchronized php-1.25wmf4/includes/PrefixSearch.php: https://gerrit.wikimedia.org/r/#/c/167982/ (duration: 00m 03s) [23:21:53] Logged the message, Master [23:30:34] robh: Nothing seems to be merging for MobileFrontend in gerrit: https://gerrit.wikimedia.org/r/#/c/167309/ https://gerrit.wikimedia.org/r/#/c/167983/ [23:30:54] any suggestions? hashar and Krinkle aren't around :( [23:31:57] hmm, well actually the first one finally merged [23:32:11] 20 minutes after the +2 [23:33:25] there's a big queue, probably, it's 4pm in the afternoon, 20 minutes isn't unheard of [23:33:59] both are merged now [23:34:08] see the graphs at: https://integration.wikimedia.org/zuul/ [23:34:54] "fast enough" is somewhat subjective, but when users have the impression that your application is stuck, it's a good indicator that something isn't fast enough [23:35:28] ori: fair, but 20minutes isn't unreasonable if you compare our build times to other organizations [23:35:54] like who? [23:36:01] like fb and google [23:36:13] and mozilla [23:36:14] the relevant benchmark is the number of tests executed [23:36:28] google and mozilla are actually compiling [23:36:31] our code is interpreted [23:36:55] ori: and..... [23:36:55] at Badoo, they run tests in under 1 minute and tests generate a wall of shame of devs whose single tests run for longer than a second [23:37:16] greg-g: so you'd expect their build times to be longer, because they have something to build [23:37:37] greg-g: what do I look for on those charts? Is there anything that will give me an idea of when things are slow or backed up? [23:38:02] so, summary of my understanding: we recently increased build times due to the vendor work from bd808 and hashar, that's expected and A Good Thing. [23:38:14] why is that a good thing? [23:38:21] kaldari|2: that page will show you what's in progress (nothing right now, you missed the rush ;) ) [23:38:33] 99.99% of patches don't touch the dependency manifests [23:38:39] why would slowing down everyone be a good thing? [23:38:48] (03CR) 10coren: Tool Labs: major cleanup of classes and roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/167852 (owner: 10coren) [23:38:50] sure, take me the wrong way and argue that point. [23:39:01] sorry, i'm not trying to be difficult [23:39:05] how did you mean it? [23:39:18] the good thing is we're doing more tests to make sure a core change doesn't break and extension that is deployed [23:39:35] Our test infrastructure is horribly under powered for the work we ask of it [23:39:42] now, can things be faster? of course, we're working on that [23:39:46] time to get the phab bot over here :) [23:39:53] but, the recent slow down was expected, unfortunately [23:39:55] Like equivalent to a 2 year old laptop [23:40:04] heh, yeah, 8 gig ram, because.... :/ [23:40:31] greg-g: they're unacceptably slow, and 'expected' isn't really a good excuse for it, for the same reason 'that was expected to fatal' isn't [23:40:41] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: puppet fail [23:41:03] can i recommend one easy and simple improvement? [23:41:25] buy more hardware? [23:41:34] re-writing tests to not hit the database would be a good start [23:41:41] Need moar CPUS! [23:41:52] plus 1 to both bd808 and legoktm [23:41:54] bd808: it's an insult to the hardware to blame it for this [23:42:12] ori: did you konw that our jenkins master has only 8 gigs of ram? :) [23:42:18] that's plenty [23:42:25] that's an insult, so saying "more ram" is reasonable.... [23:42:44] really, our tests shouldn't take this long to run [23:42:44] greg-g: can i suggest some cheaper and easier solutions to try in the interim? [23:43:15] but I would blame the tests rather than the test runner [23:43:37] ori: always [23:43:39] it used to not be as bad because test groups were run in parallel, but now it's just one huge job [23:43:52] greg-g: first, remove *all* non-voting jobs [23:44:09] greg-g: nobody *ever* looks at them [23:44:17] with caveats for new jobs that people are getting used to and will turn voting, sure [23:44:31] ie: rubucop that's coming [23:44:41] start it out as non-voting, see how it acts, turn to voting [23:44:46] it's insane to me that we'd be adding more linters when the test infra is so overloaded [23:44:48] but yeah, agreed on the general point [23:44:52] why the hell do we need rubocop? [23:44:57] non-voting should run post-merge [23:45:10] non-voting shouldn't run because they're not even a little bit useful [23:45:14] they confuse users [23:45:16] or that :P [23:45:25] heh [23:45:26] They don't confuse me, I just ignore them. :-) [23:46:01] they should run until we fixed everything so they can actually vote :) [23:47:06] https://phabricator.wikimedia.org/T794 [23:47:09] but if you try that people will say it's not important enough [23:47:09] I remember how a bug resulting in fatal was merged on one extension where tests weren't voting: "if Jenkins says tests succeeded, I can merge" [23:47:26] bike shed there ^^ :) [23:47:49] and if you say it, people will say it's bike shedding.. cya [23:48:00] mutante: sorry, debate [23:48:05] didn't meant to minimize [23:48:29] I see the point in linters, I was just being a proxy for ori -> phab so it's not lost right now, and I want others (eg antoine and timo) to see it [23:48:57] common reason why those can't be voting: [23:48:59] ./manifests/role/deployment.pp:198 ERROR two-space soft tabs not used [23:49:22] but if i make retab changes.... [23:49:29] then that's a debate as well [23:49:30] so, ok, "move all non-voting jobs to post-merge, except where not" [23:49:41] I'm fine with exceptions to rules [23:50:29] ori: (honestly) did you have any other suggestions? I'm/hashar's all ears. [23:50:46] (there was a "first" but no "second" ;) ) [23:51:40] greg-g: it used to not be as bad because test groups were run in parallel, but now it's just one huge job [23:51:45] that's a show-stopper [23:51:48] i'm not sure why that's ok [23:52:02] i reported it early on: https://bugzilla.wikimedia.org/show_bug.cgi?id=71029 [23:52:33] oh, I meant something else. [23:52:54] oh right, that's another bug [23:52:54] mw-core used to have a phpunit-api job, a phpunit-parser job, a phpunit-misc job, etc. [23:53:05] now it's just one huge phpunit job [23:53:58] ori: do you disagree with the conclusion to that bug you did link to? [23:54:13] greg-g: yes, we shouldn't do it; the price is too high [23:54:40] what would a low enough price be? (I'm going to ask antoine to do timing tests) [23:54:43] i don't know of a single bug that it has caught; the possibility of APIs changing while patches are in the process of getting merged is pretty unlikely [23:55:12] a timing test isn't right because the issue is the cap it places on parallelism [23:55:24] I think running extension jobs after a core patch is merged would be a much better idea [23:55:28] bbiab [23:55:41] legoktm: good point [23:55:41] yes, that's not a bad thought [23:56:12] the next suggestion is to make certain tests only run if a particular file is changed. i think we're doing this already in certain cases, but i'm not sure. we shouldn't run the vendor tests unless composer.json is touched. [23:56:37] * greg-g is taking notes in phab on the previous one, go ahead, I'll catch up [23:56:58] (03CR) 10coren: [C: 032] "Cool beans, and Tools isn't the only one that might use this profitably." [puppet] - 10https://gerrit.wikimedia.org/r/167953 (owner: 10Ori.livneh) [23:56:59] to super-short-term solution for the mediawiki/vendor test is to just disable it completely right now because it's just ridiculously slow [23:58:35] https://phabricator.wikimedia.org/T795 - previous one [23:58:54] next: reuse results. this was pointed out ages ago; i'm not sure why it was acted on. if i submit a change, jenkins runs the tests; if i +2, it runs them again. there is no reason for it to do that unless something was merged in the interim. [23:59:11] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures