[00:04:11] hearing no comments from anybody, and being unable to log in even on the mgmt console, I'm going to powercycle 1007 [00:04:57] !log powercycled elastic1007, inaccessible via ssh or mgmt console [00:05:14] Logged the message, Master [00:05:44] ERROR: Timeout while waiting for server to perform requested power action. [00:05:45] ugh [00:05:59] ^d, manybubbles ^^ [00:06:12] * apergos counts to 30 and will try again [00:06:15] <^d> Dammit. [00:06:21] <^d> We've been chatting about this. [00:06:29] <^d> elastic1007 keeps rebooting :\ [00:06:52] <^d> I wonder if we should have it leave the cluster for the time being. [00:07:16] you may not have a choice [00:07:17] ERROR: Timeout while waiting for server to perform requested power action. [00:07:18] again [00:07:39] !log ERROR: Timeout while waiting for server to perform requested power action. (from attempt to powercycle elastic1007) [00:07:54] Logged the message, Master [00:07:56] <^d> Let's depool from LVS so we stop trying to send it traffic. [00:09:11] racadm serveraction powerstatus [00:09:17] Server power status: OFF [00:09:19] bleah [00:10:28] trying powerup now [00:10:54] same error [00:12:35] nothing useful from gettracelog [00:12:47] ^d, are you doing the pybal bit or do you need me to? [00:12:52] s/need/want/ whatever [00:12:55] <^d> I can't. [00:13:03] ok lemme do that right away [00:14:43] ^d: I'm here now [00:14:46] was having dinner with family [00:14:48] what happened? [00:14:58] <^d> elastic1007 flipped out again. [00:14:59] I should log that [00:15:04] <^d> We just depooled from lvs. [00:15:11] <^d> apergos: I'd say yes :) [00:15:14] !log depooled elastic1007 in pybal [00:15:31] Logged the message, Master [00:15:52] k [00:16:19] * apergos looks at racadm getsysinfo in hopes that something there will be useful [00:16:31] cluster health is still red. [00:16:33] <^d> apergos: So the box is powered off at the moment? [00:16:47] might be because we were doing a index rebuild. it should be ok. doing more research [00:16:49] it's off and it ain't coming on [00:16:56] <^d> Mmk. [00:17:20] (03CR) 10Legoktm: [C: 031] Add modify restricted to sysop at Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101605 (owner: 10John F. Lewis) [00:19:00] I could try a soft reset of the drac, hoping that's somehow implicated [00:19:08] not likely but here we go [00:24:02] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [00:24:02] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [00:24:02] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [00:24:02] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [00:24:02] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [00:24:20] there we go [00:24:31] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [00:24:35] no dice, same bad responses [00:24:41] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [00:24:51] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [00:24:51] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [00:24:51] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [00:24:51] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [00:25:02] <^d> Ok, so all the others are happy now at least. [00:25:33] ^d: yeah, I was still resharding when elastic1007 went nuts [00:25:49] <^d> *nod* [00:25:51] so one resharding operation got stuck (which is pretty OK) [00:25:57] manybubbles / ^d: can we perhaps make the check exit with 2 only on the master? [00:26:00] and another one failed to delete the old indexes [00:26:11] non-zero actually [00:26:13] no one available in the dc today, chris is travelling I think along with others through the 20th [00:26:19] <^d> Yeah [00:26:28] <^d> apergos: It won't come up again accidentally, right? It'll stay down? [00:26:43] it's powered off according to the drac [00:26:44] paravoid: we're not really happy about the state of the check either. I wish it didn't do one per server [00:26:50] <^d> apergos: Then let's just !log it and leave it for now. We can live until someone can physically check it. [00:27:14] anyway, the one that failed to delete as well as the hung reshard caused the red status [00:27:36] !log elastic1007 failed to come back up after several attempts, a soft drac reset and some more attempts. leaving it in power off state [00:27:48] from Elasticsearch's perspective, those shards were screwed. luckily, none of them were serving traffic yet/any more. [00:27:50] Logged the message, Master [00:28:00] well, that's why we have clusters and redundancy [00:28:06] yeah [00:28:17] so we never actually stopped serving any traffic [00:28:25] really? that's awesome [00:28:40] Cirrus retries when elasticsearch fails [00:28:42] rather, it should. [00:28:52] it would be nice to know if that actually worked [00:28:55] it assumes that those nodes go down from time to time for stuff like rolling restartgs [00:28:59] yeah [00:29:07] I wonder if I have a log on that any more [00:30:06] if you do, where would it be? [00:30:11] (for my own reference) [00:30:20] https://bugzilla.wikimedia.org/show_bug.cgi?id=58557 [00:30:39] most obvious (to us, unfortunately) place is in Elatica's debug logs. [00:30:50] I'm not actually sure what the logging setup does with those [00:33:46] apergos: I looked in the logs and didn't see any mention of CirrusSearchSearcher, which should have been there if there were search failures [00:33:56] ^d: check my logic, please [00:34:13] gonna go help put my kids to bed now that everything has calmed down for a bit [00:34:17] <^d> Yeah, we should reattempt. [00:34:33] <^d> s/should/do/ [00:34:43] ok [00:35:13] <^d> I think we've got it set to 3 total attempts (1 initial + 2 retries). [00:35:17] well, that mightbe my last gasp for the night, I feel a sleep attack coming on [00:35:27] <^d> All the others were fine, so even if someone hit 1007 they'd hit a fine one next. [00:35:27] good night apergos [00:35:37] <^d> apergos: g'night, thanks for your help. [00:35:57] night, have a quieter rest of the day! [01:01:31] PROBLEM - Puppet freshness on cp1048 is CRITICAL: Last successful Puppet run was Mon 16 Dec 2013 06:59:39 PM UTC [01:21:51] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 3: unassigned_shards: 0 [01:21:51] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 3: unassigned_shards: 0 [01:21:51] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 3: unassigned_shards: 0 [01:21:51] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 3: unassigned_shards: 0 [01:22:02] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:22:02] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:22:02] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:22:02] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:22:02] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:22:31] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:22:41] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:22:51] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:22:51] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:22:51] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:22:51] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:23:02] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:23:02] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:23:02] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:23:02] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:23:02] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:23:31] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:23:41] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:33:49] (03PS1) 10Manybubbles: Update elasticsearch settings [operations/puppet] - 10https://gerrit.wikimedia.org/r/102048 [01:38:41] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:38:51] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:38:51] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:38:51] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:38:52] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:38:53] working on it [01:39:02] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:39:02] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:39:02] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:39:02] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:39:02] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 2: unassigned_shards: 0 [01:39:07] (03PS1) 10Ryan Lane: Allow salt/puppet access from pmtpa and eqiad labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102052 [01:39:41] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:39:51] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:39:51] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:39:52] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:39:52] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:40:01] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:40:02] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:40:02] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:40:02] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:40:02] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1318: active_shards: 3938: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [01:40:30] (03CR) 10Ryan Lane: [C: 032] Allow salt/puppet access from pmtpa and eqiad labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102052 (owner: 10Ryan Lane) [01:52:56] (03PS2) 10Manybubbles: Update elasticsearch settings [operations/puppet] - 10https://gerrit.wikimedia.org/r/102048 [01:57:50] !log Reloading Zuul to deploy I80dafe3457c65 [01:58:08] Logged the message, Master [02:08:04] elastic1007 is still stuck in the elasticsearch cluster state as active. no shards are on it. I'm going to try bouncing a different elasticsearch server to see if it'll recover properly [02:09:28] well, the cluster state went red, so it'll log more failures.... [02:09:35] I'm not really sure what to do about this [02:10:17] well, its back yellow so no warnings, but elastic1007 is still stuck.... [02:14:35] (03PS1) 10Ryan Lane: Restart nscd and nslcd after reconfiguration [operations/puppet] - 10https://gerrit.wikimedia.org/r/102058 [02:16:11] !log LocalisationUpdate completed (1.23wmf6) at Tue Dec 17 02:16:11 UTC 2013 [02:16:27] Logged the message, Master [02:25:39] !log manually setting elastic1008 to be master eligible so we have three master eligible machines in production until the puppet code that would do that properly is merged [02:25:57] Logged the message, Master [02:26:51] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.140 [02:26:52] (03CR) 10Ryan Lane: [C: 032] Restart nscd and nslcd after reconfiguration [operations/puppet] - 10https://gerrit.wikimedia.org/r/102058 (owner: 10Ryan Lane) [02:30:08] !log LocalisationUpdate completed (1.23wmf7) at Tue Dec 17 02:30:08 UTC 2013 [02:30:23] Logged the message, Master [02:49:29] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Dec 17 02:49:29 UTC 2013 [02:49:44] Logged the message, Master [02:57:51] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1317: active_shards: 3937: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [02:58:20] !log Reloading Zuul to deploy I4e18cb2dc2a7f4 [02:58:34] Logged the message, Master [03:01:02] (03CR) 10Chad: [C: 031] Update elasticsearch settings [operations/puppet] - 10https://gerrit.wikimedia.org/r/102048 (owner: 10Manybubbles) [03:22:52] (03PS3) 10Dzahn: create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 [03:23:28] (03CR) 10jenkins-bot: [V: 04-1] create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 (owner: 10Dzahn) [03:25:58] (03PS4) 10Dzahn: create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 [03:29:06] (03PS1) 10Dzahn: capitalize resource reference [operations/puppet] - 10https://gerrit.wikimedia.org/r/102062 [03:29:07] (03CR) 10jenkins-bot: [V: 04-1] capitalize resource reference [operations/puppet] - 10https://gerrit.wikimedia.org/r/102062 (owner: 10Dzahn) [03:29:59] (03PS2) 10Dzahn: capitalize resource reference [operations/puppet] - 10https://gerrit.wikimedia.org/r/102062 [03:36:25] (03PS1) 10Dzahn: fix another deprecation notice from jenkins [operations/puppet] - 10https://gerrit.wikimedia.org/r/102063 [03:38:28] (03CR) 10Dzahn: [C: 032] capitalize resource reference [operations/puppet] - 10https://gerrit.wikimedia.org/r/102062 (owner: 10Dzahn) [03:39:13] who forgot to merge? [03:39:15] +/etc/init.d/nslcd restart [03:39:15] +/etc/init.d/nscd restart [03:40:52] Ryan_Lane: :) [03:41:23] can this go? b/modules/labs_vmbuilder/files/firstboot.sh [03:42:32] well,saw the commit message, done [03:43:51] (03CR) 10Dzahn: [C: 031] create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 (owner: 10Dzahn) [03:43:55] (03PS5) 10Dzahn: create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 [03:44:46] (03CR) 10Dzahn: [C: 032] fix another deprecation notice from jenkins [operations/puppet] - 10https://gerrit.wikimedia.org/r/102063 (owner: 10Dzahn) [03:51:17] (03CR) 10Dzahn: [C: 04-1] "did i miss something or why are you changing templates/mediawiki.gr when the rest is all about .pt" [operations/dns] - 10https://gerrit.wikimedia.org/r/101873 (owner: 10ArielGlenn) [03:52:21] (03CR) 10Dzahn: "s/changing/adding" [operations/dns] - 10https://gerrit.wikimedia.org/r/101873 (owner: 10ArielGlenn) [03:58:25] (03CR) 10Dzahn: [C: 031 V: 031] create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 (owner: 10Dzahn) [04:02:31] PROBLEM - Puppet freshness on cp1048 is CRITICAL: Last successful Puppet run was Mon 16 Dec 2013 06:59:39 PM UTC [04:03:56] (03CR) 10Dzahn: [C: 031 V: 031] "if no more general concerns here i'm about to self-merge with this result, we can always fix it and actual users are not switched over to " [operations/puppet] - 10https://gerrit.wikimedia.org/r/101174 (owner: 10Dzahn) [04:14:40] (03PS6) 10Dzahn: role and module structure for ishmael [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 [04:15:15] (03CR) 10jenkins-bot: [V: 04-1] role and module structure for ishmael [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [04:16:21] (03PS7) 10Dzahn: role and module structure for ishmael [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 [04:16:23] (03CR) 10Tim Starling: "GPL is not a permissive license. It is probably the most restrictive OSS license." [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [04:19:50] (03CR) 10Dzahn: [C: 031] "ori-l: yes, your review made sense to me, thanks, followed advice and changed to have all the config values in role class, none in module" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [04:24:37] (03CR) 10Ori.livneh: [C: 032] "Very nice." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96403 (owner: 10Dzahn) [04:26:09] (03CR) 10Dzahn: "adding Ariel, per MarkMonitor contact" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/88705 (owner: 10Dzahn) [04:26:42] (03CR) 10Dzahn: "any news? should we just do it or wait?" [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [04:28:39] (03CR) 10Dzahn: "feel free to re-add me once there are major changes here, but removing myself for now because i don't think i have much input and need to " [operations/apache-config] - 10https://gerrit.wikimedia.org/r/65443 (owner: 10Dzahn) [04:29:40] lol,that may not work since i'm the owner [04:29:53] can i give a gerrit change to nobody ?:) [04:30:26] abandoning it would remove valuable discussion, but i dont think i'm going to be the one to solve it [04:31:23] abandoning it doesn't delete it [04:31:50] in RT i'd do "given to nobody in particular" [04:32:11] true [04:33:12] (03CR) 10Ori.livneh: "OK, let's just leave it. Call it the AITSTP ("as if there's something to protect") license." [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/101793 (owner: 10Ori.livneh) [04:39:30] (03Abandoned) 10Dzahn: add our networks as variables to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/88755 (owner: 10Dzahn) [04:41:34] (03CR) 10Dzahn: "Gage, and others, added you as reviewers after abandoning it. That's just a way to say "FYI" and that we have a todo here" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88755 (owner: 10Dzahn) [04:44:59] (03Abandoned) 10Dzahn: disable wikistats update cron jobs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95317 (owner: 10Dzahn) [04:50:03] (03CR) 10Dzahn: "come on, it's not a big deal and easily changed, let's just merge, add a second link or abandon:) cleanup of pending review before holiday" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97190 (owner: 10Faidon Liambotis) [04:51:01] (03PS2) 10Tim Starling: EasyTimeline support for private wikis via img_auth [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101105 (owner: 10Aaron Schulz) [04:51:10] (03CR) 10Tim Starling: [C: 032] EasyTimeline support for private wikis via img_auth [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101105 (owner: 10Aaron Schulz) [04:51:11] (03CR) 10jenkins-bot: [V: 04-1] EasyTimeline support for private wikis via img_auth [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101105 (owner: 10Aaron Schulz) [04:52:31] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [04:52:59] !log tstarling synchronized wmf-config/InitialiseSettings.php [04:53:01] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [04:53:16] Logged the message, Master [05:00:17] (03CR) 10Faidon Liambotis: [C: 032] Update elasticsearch settings [operations/puppet] - 10https://gerrit.wikimedia.org/r/102048 (owner: 10Manybubbles) [05:25:11] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:27:02] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [05:29:37] (03CR) 10MZMcBride: "Suggestion: why not set up a protected wiki page on Meta-Wiki and just say that you'll sync it every year or so by hand? This is vaguely w" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97190 (owner: 10Faidon Liambotis) [05:30:06] mutante: I totally love that you're going through the backlog, BTW. [05:30:22] Just saying for that particular problem... a wiki page might be nice. [05:30:24] (03PS1) 10Ori.livneh: mwprof module: update package deps & Upstart job def [operations/puppet] - 10https://gerrit.wikimedia.org/r/102070 [05:30:38] I think the 404 page used to come from a wiki page, maybe. [05:31:23] maybe [05:31:30] Perhaps. [05:32:36] http://korma.wmflabs.org/browser/scr-companies-summary.html [05:32:52] (03CR) 10Ori.livneh: [C: 032] mwprof module: update package deps & Upstart job def [operations/puppet] - 10https://gerrit.wikimedia.org/r/102070 (owner: 10Ori.livneh) [05:34:56] Gloria: like I said before, I don't think Gerrit stats are especially revealing [05:36:00] ori-l: No, but we can count reviewers per changeset now, though. So we can find areas in need of review. [05:36:07] https://www.mediawiki.org/wiki/Gerrit/Reports/Open_changesets_by_owner is interesting. [05:36:09] In theory. [05:36:17] legoktm: Want to write another report? :-) [05:36:23] There's now a usable API. [05:36:24] ha, no. [05:36:27] Heh. [05:36:50] ori-l: Flow, flow, flow your boat. [05:36:53] "interesting" is what you say when you don't know what to say [05:37:05] what is interesting about it? [05:37:05] I say "interesting" often. [05:37:07] When I find things... [05:37:13] ori-l: I was subtly pointing out who has the most open changes. [05:37:22] Too subtle for me. [05:37:28] It has a nice curve to it. [05:37:39] oh look, I know that guy [05:37:45] 85, 62, 51, 41, 35 ... [05:38:01] Interestingly, legoktm has most of his changesets outside of core. [05:38:10] Interesting. [05:38:39] legoktm: I couldn't defend against that wontfix because I don't really understand the job queue or its internals. [05:38:58] There's definitely a few bugs lurking there. [05:39:37] Gloria: it's a valid wontfix, the real bug is https://bugzilla.wikimedia.org/show_bug.cgi?id=42862 [05:39:48] * Gloria clicks. [05:39:57] Fun. [05:52:21] (03CR) 10Dzahn: "MZMcBride, sounds all fine with me, just actually wanted to get this particular change set decided one way or another. We can always make " [operations/puppet] - 10https://gerrit.wikimedia.org/r/97190 (owner: 10Faidon Liambotis) [05:54:15] mutante: i don't understand 5928/88705 [05:56:50] jeremyb: by this time i dont know anymore myself. besides "We will be redirecting from those domains to the Turkish language [05:56:54] Wikipedia. I'm looping in Doni, Daniel, and Rob to help me make that [05:56:57] happen. [05:57:02] and then it was stalled ..legal [05:57:10] when was that? [05:57:20] Wed, 9 Oct 2013 00:53:14 -0700 [05:57:56] correction .. Date: Oct 8, 2013 7:24 PM from when we were contacted :) [05:58:11] I was going to rebase that change and do the grammar tweak. [05:58:19] But... I still feel weird about linking to Twitter. [05:58:27] sounds good, thx [05:58:42] i dont have an opinion on twitter, really [05:58:51] I'm now leaning toward abandonment again. [05:58:54] how about identi.ca , does it still breathe [05:58:59] It's dead. [05:59:09] i just wanted to say you should (also) mention the -tech part [05:59:09] I suggested linking to a Twitter search. [05:59:13] ugh... not getting involved... [05:59:14] That seemed like a reasonable compromise to me. [05:59:14] not just for non-tech people [05:59:22] also tech people who understand will see it [05:59:34] That was another option, yeah. [05:59:40] Not using @wikimedia. [05:59:43] so why dont you just put several links [05:59:46] and be done with it [05:59:50] Neither handles are really useful. [05:59:56] Because more noise doesn't increase signal. ;-) [06:00:04] and sprinkles and a cherry on top? [06:00:07] Do we link to status.wikimedia.org? [06:00:17] meh, if the site is already down, i'd rather have 2 links than none [06:01:02] imho, just add it all and another [06:01:38] Heh. [06:05:21] heck, you could also add watchmouse and icinga [06:06:20] and links to meta telling you about the IRC channels and whatnot [06:06:32] links to jobs.. if you think you can fix this ..:) [06:06:50] Maybe. This is why I think a wiki page would work well. [06:07:23] maybe, yea, not sure about the sync once a year part [06:07:40] Or sync with a "pull request." [06:07:49] better [06:07:50] You could just put a link to [[mw:Gerrit]] on Meta-Wiki. [06:08:20] But it's not really a code review question, so using Gerrit for the content itself is... awkward. [06:08:49] whatever works to get the right people on it [06:15:25] Gloria: code, content, what's the difference, make the wiki people gerrit users, and the gerrit users wiki users [06:15:57] Gerrit is hostile to discussion. [06:16:04] Though some would argue MediaWiki is as well. :-) [06:21:40] (03CR) 10MZMcBride: "I'd personally prefer to see this split into two or more changes: one doing touch-ups and/or adding multiple channel support and the other" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101816 (owner: 10Ori.livneh) [06:25:47] identi.ca ain't dead, I was just using it! [06:26:50] greg-g: please add SAL feed to that change immediately :) thx [06:28:30] well, if we actually used it, i guess :/ [06:29:11] greg-g: when twitter worked, identica worked as well, just forgot account and twitter feed is broken but dunno about identica [06:29:40] but there is an account [06:30:06] Gloria: [06:30:56] I thought it was dead. [06:31:02] Didn't they kill it? [06:31:21] no [06:31:38] switched the backend software from Status.net to pump.io [06:31:41] i thought if in doubt Brion knows:) [06:31:45] to save mucho mucho money [06:32:21] (one is your standard lamp app, the other is node.js, which is nicer on his server costs, apparently) [06:32:47] So I guess just the bridge broke? [06:32:52] yeah, new API [06:33:01] switched from having a large, active user-base to being Evan and Greg [06:33:05] which was kind of dumb, not making it backwards compat for a little bit :/ [06:33:07] yea, but i dont know if the bridge broke for both or just twitter [06:33:08] to save mucho mucho money [06:33:20] ori-l: many ways to skin a cat [06:33:54] Gloria in excelsis Deo [06:34:31] :) [06:34:35] https://bugzilla.wikimedia.org/show_bug.cgi?id=50109 [06:34:45] I guess it's not dead. [06:34:48] Except it uses different software. [06:34:53] With a different API. [06:34:56] And has no more new users. [06:35:05] s/new// [06:35:08] identi.ca, that is. [06:35:39] as i said, ask Brion, he worked for them [06:35:43] Gloria: but new users can join any other pump.io site, which is federated, so it doesn't matter much, he's forcing the federated bit (identi.ca was the main central site in the federated world of Status Net, which wasn't really ideal) [06:35:53] also, you just repeated what I said in here :) [06:36:21] mutante: He worked for StatusNet, not identi.ca or pump.io, I think. [06:36:38] Gloria: oh,ok, well, close enough [06:36:52] He worked for StatusNet, which was the 3 person company behind identi.ca and ancillary consulting services [06:37:15] I don't think pump.io would want the traffic, TBH. [06:37:19] It'd probably not respond well. [06:37:22] Twitter would be fine. [06:37:25] * greg-g sighs [06:37:26] That's a point for Twitter. [06:37:28] Louder. [06:37:30] now you're just making things up [06:37:44] I'm speculating. [06:37:56] "oh, they made some much better/more performant software? they probably don't want the traffic" [06:37:56] there are protests against Twitter on the streets of S.F. fwiw:) [06:38:00] Wikimedia wikis get a lot of traffic. If it gets redirected, that's not insignificant. [06:38:12] I don't know anything about pump.io. [06:38:20] I think Domas' blog is more popular. [06:38:22] do you have to know everything? [06:38:23] We could link there. [06:38:45] Nope. But this is, again, why I'd prefer a wiki page. [06:38:57] I'd prefer a static page, personally, but yeah [06:39:01] So that this kind of thing can be discussed, at least for the error page. morebots' behavior is... Bugzilla's scope, I guess. [06:40:51] Gloria: but it can be discussed, on gerrit, i dont see why there needs to be this big divide, since every wikitech/labs user is already gerrit, and the chance is high if you are into labs you care about things like what is in error pages and vice versa [06:42:03] I think Gerrit is hostile to discussion and MediaWiki is our content collaboration platform. [06:42:06] * Gloria shrugs. [06:43:11] Gloria: didn't someone write a "Gerrit should die in a fire!" page on mw.org a year or two ago? [06:43:39] Probably more than one. :-) [06:43:45] !log Increased number of Jenkins executors for 'integration-slave01' from 2 to 3 [06:43:46] Brion did an evaluation page somewhere. [06:43:55] If you search for "phabricator" you can probably find it. [06:44:00] Logged the message, Master [06:46:27] (03PS1) 10Springle: update coredb topology after pmtpa changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/102077 [06:47:25] stop hating particular software, make people aware they can get involved in any bit of the server config if they want to by using gerrit, it's special, nobody does that [06:47:51] (03CR) 10Springle: [C: 032] update coredb topology after pmtpa changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/102077 (owner: 10Springle) [07:03:31] PROBLEM - Puppet freshness on cp1048 is CRITICAL: Last successful Puppet run was Mon 16 Dec 2013 06:59:39 PM UTC [07:12:17] (03CR) 10Ori.livneh: [C: 031] "I didn't test, but the code looks sane." [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 (owner: 10Ottomata) [07:24:11] TimStarling: why don't we set max-age or expires on upload.wm.o images? [07:24:31] why do we not, rather [07:36:21] (03Abandoned) 10Ori.livneh: Parametrize listen_socket in Gdash and Graphite modules [operations/puppet] - 10https://gerrit.wikimedia.org/r/101050 (owner: 10Ori.livneh) [08:02:23] (03PS2) 10ArielGlenn: add pt domains [operations/dns] - 10https://gerrit.wikimedia.org/r/101873 [08:04:15] (03CR) 10ArielGlenn: "No, *I* missed something. I had no intention of doing anything to gr domains. Wtf... Also, I apparently must be blind because I looked at " [operations/dns] - 10https://gerrit.wikimedia.org/r/101873 (owner: 10ArielGlenn) [08:15:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Two other files also need changes" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [08:22:04] (03CR) 10Alexandros Kosiaris: "No we don't ;-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/88755 (owner: 10Dzahn) [08:31:18] (03CR) 10Alexandros Kosiaris: "I am not clear on what the merits of the definition-in-class approach vs the straight definition approach are. As those two are now, I pre" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83768 (owner: 10Dzahn) [08:41:15] akosiaris: thanks for reviewing the icinga changes; when you said to consolidate the private ipv4 network vars I wasn't sure what you meant, do you mean to add a new stanza to networks.pp for that? or just to swap in the subnet names as you listed them? [08:42:13] the latter [08:42:34] the former should be done too but not in this changeset [08:42:50] (also network.pp has some issues where private subnets are included in 2620:0:860::/46 under external_networks but that's another deal) [08:43:07] ok, good, because I was hoping to fix up network.pp anyways, it needs some love [08:44:57] network.pp indeed needs some love [08:45:42] ah the other thing is that the rule saddr localhost ACCEPT doesn't appear to be in the default rules, the iptables rule would be "ACCEPT all -- any any localhost anywhere", I saw a rule for lo but not for that [08:46:15] but, I do not have a lot of confidence in my eyes recently so... did I miss it? [08:51:36] there is this: # Accept all loopback traffic [08:51:36] interface lo ACCEPT; [08:52:04] main-input-default-drop.conf [08:52:43] it does not work by saddr but by interface [08:53:52] yes, and there was a separate lo rule in the icinga rules too [08:54:07] but unless you expect to see 127.0.0.1 on anything other than lo, it should be the same [08:55:36] I would expect it too, but someone thought they should be separate rules, maybe there was nota real reason for it [08:56:39] also I would like to remove the mgmt subnets, I can't imagine why they had them in there, but again... maybe they knew something I don't? [08:57:20] that was my question as well :-) [08:58:00] well maybe for the first take I'll just translate [08:58:11] and after that seems ok I can start removing things [08:59:10] I can think that we might want icinga to ping the mgmt network in order to catch problems but I don't see a reason for incoming firewall rules for that [08:59:29] me iether [08:59:35] *either [09:25:12] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:28:01] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [09:38:12] TimStarling: why don't we set max-age or expires on upload.wm.o images? [09:38:16] because they change [09:38:55] same reason we don't set public caching headers on text, it changes too [09:39:51] PROBLEM - RAID on virt5 is CRITICAL: CRITICAL: Active: 14, Working: 14, Failed: 1, Spare: 0 [09:48:35] half awake antoine says hi [09:49:18] (03PS6) 10Mattflaschen: Create "Draft" namespace on the English Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [10:04:31] PROBLEM - Puppet freshness on cp1048 is CRITICAL: Last successful Puppet run was Mon 16 Dec 2013 06:59:39 PM UTC [10:07:56] (03CR) 10Hashar: [C: 032] "Thanks!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101881 (owner: 10BryanDavis) [10:08:05] (03Merged) 10jenkins-bot: Revert "beta: let sysops add/remove gwtoolset group" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101881 (owner: 10BryanDavis) [10:10:15] (03CR) 10Mattflaschen: "I think I addressed everything. Please review, but if it looks right, just +1 for now." (033 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [10:10:29] (03CR) 10Mattflaschen: [C: 04-1] Create "Draft" namespace on the English Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [10:20:45] (03PS4) 10Mark Bergsma: Remove Squid manifests and files [operations/puppet] - 10https://gerrit.wikimedia.org/r/101864 [10:20:46] (03PS4) 10Mark Bergsma: Remove role::cache::squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/101860 [10:20:47] (03PS4) 10Mark Bergsma: Remove -squid host lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/101863 [10:20:48] (03PS4) 10Mark Bergsma: Remove all node definitions for Squids [operations/puppet] - 10https://gerrit.wikimedia.org/r/101856 [10:20:49] (03PS4) 10Mark Bergsma: Move all existing Squids to the decommission lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/101857 [10:20:50] (03PS4) 10Mark Bergsma: Update Icinga cache groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/101859 [10:24:01] !log Depooled esams Squids in PyBal [10:24:15] Logged the message, Master [10:27:17] !log Depooled eqiad Squids in PyBal [10:27:32] Logged the message, Master [10:39:20] (03PS1) 10Mark Bergsma: Move esams text IPv6 termination to Varnish directly [operations/puppet] - 10https://gerrit.wikimedia.org/r/102103 [10:40:41] (03CR) 10Mark Bergsma: [C: 032] Move esams text IPv6 termination to Varnish directly [operations/puppet] - 10https://gerrit.wikimedia.org/r/102103 (owner: 10Mark Bergsma) [10:45:06] (03Abandoned) 10Mark Bergsma: Resolve esams IPv6 IP conflict [operations/puppet] - 10https://gerrit.wikimedia.org/r/101883 (owner: 10Mark Bergsma) [10:49:30] (03PS1) 10Mark Bergsma: Remove old esams text LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102106 [10:50:45] (03CR) 10Mark Bergsma: [C: 032] Remove old esams text LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102106 (owner: 10Mark Bergsma) [10:55:12] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::1 [10:55:41] hmm [10:58:22] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 97.77 ms [10:59:07] (03PS1) 10QChris: Log X-Analytics header of response instead of request [operations/puppet] - 10https://gerrit.wikimedia.org/r/102107 [11:04:41] (03PS1) 10Mark Bergsma: Remove old esams text LVS service IPs for https [operations/puppet] - 10https://gerrit.wikimedia.org/r/102108 [11:05:18] (03CR) 10Mark Bergsma: [C: 032] Remove old esams text LVS service IPs for https [operations/puppet] - 10https://gerrit.wikimedia.org/r/102108 (owner: 10Mark Bergsma) [11:09:31] got the one page right away and the other with a pile o delay [11:13:48] akosiaris: any clue how packages in debian collab-maint get uploaded ? The phantomjs maintainer did some work for a new release but it is not showing up on the PTS page. [11:14:06] akosiaris: i guess he hasn't asked to get the new package uploaded [11:21:31] hashar: i have no idea [11:21:59] akosiaris: filled a bug about it :-] thanks anyway [11:22:08] :-) [11:25:37] apergos: so been puzzling about Parsoid upstart :D Basically Gabriel sent another upstart job configuration which would apparently be deployed using git-deploy instead of puppet :-D [11:25:47] it is not like we reinvent the well [11:26:03] yes, we were talking about it yesterday [11:26:22] https://gerrit.wikimedia.org/r/#/c/101900/ [11:26:25] I think the one to be used or not with git deploy will get replaced by yours [11:26:31] because we need log rotation [11:26:55] we need e.g. rotation if it's bigger than some reasonable cutoff [11:26:55] I am not sure why Gabriel wrote an alternative so :-] [11:27:33] but he was saying they want to use debs to deploy so I'm a bit confused about the whole deal [11:28:05] I think rgst (ryan-git-sartoris-trbuchet) - deploy is better suited for our needs [11:28:25] since it keeps multiple versions, permits instant rollback, and allows small tweaks quickly [11:28:48] but they seem to be sold on the debs approach because third parties will want it and eat your own dog food [11:29:36] anyways I would forge ahead and let me know when you want me to have a look [11:31:08] rgst (ryan-git-sartoris-trbuchet) [11:31:11] * hashar giggles [11:34:26] if it had a single vowel so that it could be pronoun cable it would be a nice name for the project :-D [11:37:43] (03CR) 10Hashar: "expect fork is needed as I have tested on labs" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [11:38:13] apergos: I have replied to gabriel on my change, you might want to read our exchanges https://gerrit.wikimedia.org/r/#/c/99656/ :-D [11:38:21] I am sure I will [11:38:39] well if you see me talking about rgst-deploy, that's what it is :-P [11:38:50] lunch [11:38:51] brb [11:38:54] feel free to buy a vowel and stuff it in there [11:38:58] enjoy [11:41:28] (03PS1) 10Mark Bergsma: Add text-lb.eqiad.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/102112 [11:42:01] (03CR) 10Mark Bergsma: [C: 032] Add text-lb.eqiad.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/102112 (owner: 10Mark Bergsma) [11:42:41] oh wow fedora 20 [11:42:44] (03PS1) 10Mark Bergsma: Update reverse DNS for text-lb.eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/102115 [11:42:54] I should update my beta release install [11:47:46] (03PS1) 10Mark Bergsma: Remove old eqiad/esams LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102116 [11:47:58] hashar: that reply makes perfect sense, I said as much yesterday when chatting with g wicke [11:48:56] (03CR) 10Mark Bergsma: [C: 032] Remove old eqiad/esams LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102116 (owner: 10Mark Bergsma) [11:50:38] apergos: you can actually use + in puppet to concatenate strings ??? [11:50:50] I commented that I have no idea [11:51:01] I expect not :-D [11:51:20] but I need to break those loooong lines up somehow [11:51:40] so do I. I think it only works in the += form when overriding variables from inherited classes in child classes [11:51:50] if the network.pp changes get in before that, I can rewrite it with shorter lines :-D [11:51:56] ugh [11:52:16] * apergos avoids inheritance in puppet [12:00:08] (03PS1) 10Mark Bergsma: Remove esams ipv6 LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102117 [12:08:37] (03PS1) 10Mark Bergsma: Remove old esams service IPs from protoproxies [operations/puppet] - 10https://gerrit.wikimedia.org/r/102119 [12:10:14] (03CR) 10Mark Bergsma: [C: 032] Remove esams ipv6 LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102117 (owner: 10Mark Bergsma) [12:10:36] (03CR) 10Mark Bergsma: [C: 032] Remove old esams service IPs from protoproxies [operations/puppet] - 10https://gerrit.wikimedia.org/r/102119 (owner: 10Mark Bergsma) [12:13:29] (03PS1) 10Mark Bergsma: Remove esams videos.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/102120 [12:14:31] (03CR) 10Mark Bergsma: [C: 032] Remove esams videos.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/102120 (owner: 10Mark Bergsma) [12:23:08] (03PS1) 10Mark Bergsma: Use eqiad ($mw_primary) as secondary upstream instead of pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/102122 [12:23:41] (03CR) 10jenkins-bot: [V: 04-1] Use eqiad ($mw_primary) as secondary upstream instead of pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/102122 (owner: 10Mark Bergsma) [12:31:09] (03PS2) 10Mark Bergsma: Use eqiad ($mw_primary) as secondary upstream instead of pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/102122 [12:32:43] (03PS3) 10Mark Bergsma: Use eqiad ($mw_primary) as secondary upstream instead of pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/102122 [12:33:08] (03PS4) 10Mark Bergsma: Use eqiad ($mw_primary) as secondary upstream instead of pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/102122 [12:34:43] (03CR) 10Mark Bergsma: [C: 032] Use eqiad ($mw_primary) as secondary upstream instead of pmtpa [operations/puppet] - 10https://gerrit.wikimedia.org/r/102122 (owner: 10Mark Bergsma) [12:45:39] (03PS5) 10Mark Bergsma: Remove Squid manifests and files [operations/puppet] - 10https://gerrit.wikimedia.org/r/101864 [12:45:40] (03PS5) 10Mark Bergsma: Remove role::cache::squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/101860 [12:45:41] (03PS5) 10Mark Bergsma: Remove -squid host lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/101863 [12:45:42] (03PS5) 10Mark Bergsma: Move all existing Squids to the decommission lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/101857 [12:45:43] (03PS5) 10Mark Bergsma: Update Icinga cache groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/101859 [12:48:01] (03CR) 10Mark Bergsma: [C: 032] Move all existing Squids to the decommission lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/101857 (owner: 10Mark Bergsma) [12:49:17] (03CR) 10Mark Bergsma: [C: 032] Remove role::cache::squid [operations/puppet] - 10https://gerrit.wikimedia.org/r/101860 (owner: 10Mark Bergsma) [12:51:22] apergos: could you perhaps work on decommissioning/shutting down all squids? [12:52:00] hmm our usual decommissioning crew is on the road all week aren't they [12:52:05] yes [12:52:09] everything that matches https://gerrit.wikimedia.org/r/#/c/101856/ [12:52:45] pmtpa squids can all get wiped and removed [12:52:54] esams squids I (amssq*) I want to reinstall as varnish [12:53:07] eqiad we don't know yet, they can join the rest of the pool [12:53:09] the cp10* are to be reclaimed I expect? [12:53:13] yes [12:53:16] like the others [12:53:22] great [12:53:46] so the amssq will be renamed? or you will keep the names? [12:53:51] same names [12:53:55] amssq47+ is varnish now [12:54:03] amssq32-46 will join them soon [12:54:11] but first need a clean reinstall with precise [12:54:15] they're lucid now [12:55:05] I'll get the ones I do to the power off stage [12:55:28] thanks [13:01:11] (03CR) 10Mark Bergsma: [C: 032] Remove -squid host lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/101863 (owner: 10Mark Bergsma) [13:02:48] (03CR) 10Mark Bergsma: [C: 032] Remove Squid manifests and files [operations/puppet] - 10https://gerrit.wikimedia.org/r/101864 (owner: 10Mark Bergsma) [13:04:46] PROBLEM - Puppet freshness on cp1048 is CRITICAL: Last successful Puppet run was Mon 16 Dec 2013 06:59:39 PM UTC [13:08:26] I'm about to comment out all the pmtpa squids in pybal (they are set to false now) [13:08:29] mark ^^ [13:08:40] fine [13:09:47] and upload, duh [13:10:11] (03PS1) 10Mark Bergsma: Remove pmtpa text LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102128 [13:13:59] (03CR) 10Mark Bergsma: [C: 032] Remove pmtpa text LVS service IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102128 (owner: 10Mark Bergsma) [13:16:54] (03PS1) 10Mark Bergsma: Remove pmtpa as bits backend for esams/ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/102130 [13:17:58] (03CR) 10Mark Bergsma: [C: 032] Remove pmtpa as bits backend for esams/ulsfo [operations/puppet] - 10https://gerrit.wikimedia.org/r/102130 (owner: 10Mark Bergsma) [13:25:50] (03PS1) 10Mark Bergsma: Remove bits-lb.pmtpa LVS service monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/102133 [13:28:12] (03CR) 10Mark Bergsma: [C: 032] Remove bits-lb.pmtpa LVS service monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/102133 (owner: 10Mark Bergsma) [13:31:18] !log performing a rolling restart of the elasticsearch cluster to pick up new settings [13:31:34] Logged the message, Master [13:39:41] (03PS1) 10ArielGlenn: remove pmtpa squids from dsh, dhcp, netboot.conf, decom rt #6520 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102135 [13:44:11] (03CR) 10ArielGlenn: [C: 032] remove pmtpa squids from dsh, dhcp, netboot.conf, decom rt #6520 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102135 (owner: 10ArielGlenn) [13:52:46] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [13:52:47] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 10: number_of_data_nodes: 10: active_primary_shards: 1316: active_shards: 3575: relocating_shards: 5: initializing_shards: 50: unassigned_shards: 312 [13:52:56] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 10: number_of_data_nodes: 10: active_primary_shards: 1316: active_shards: 3603: relocating_shards: 5: initializing_shards: 50: unassigned_shards: 284 [13:52:57] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 10: number_of_data_nodes: 10: active_primary_shards: 1316: active_shards: 3603: relocating_shards: 5: initializing_shards: 50: unassigned_shards: 284 [13:54:47] ignore [13:54:52] hhhhhh [13:57:12] (03PS1) 10Faidon Liambotis: Switch donate-lb esams to the text-lb IP [operations/dns] - 10https://gerrit.wikimedia.org/r/102139 [13:58:13] (03CR) 10Faidon Liambotis: [C: 032] Switch donate-lb esams to the text-lb IP [operations/dns] - 10https://gerrit.wikimedia.org/r/102139 (owner: 10Faidon Liambotis) [14:02:46] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1317: active_shards: 3937: relocating_shards: 15: initializing_shards: 0: unassigned_shards: 0 [14:02:46] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1317: active_shards: 3937: relocating_shards: 15: initializing_shards: 0: unassigned_shards: 0 [14:02:56] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1317: active_shards: 3937: relocating_shards: 14: initializing_shards: 0: unassigned_shards: 0 [14:02:56] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 11: number_of_data_nodes: 11: active_primary_shards: 1317: active_shards: 3937: relocating_shards: 14: initializing_shards: 0: unassigned_shards: 0 [14:03:32] going to do a snooth restart for the remaining ones so not to cause this. either way shouldn't bother users but smoother is less likely to bother them [14:10:56] !log the first pass of index building for all wikisources is complete, starting the second pass [14:11:10] Logged the message, Master [14:14:15] (03PS1) 10Hashar: contint: install libsikuli-script-java for browser tests [operations/puppet] - 10https://gerrit.wikimedia.org/r/102141 [14:14:45] manybubbles: mind merging in a new package for contint please ? https://gerrit.wikimedia.org/r/#/c/102141/ :-) [14:14:49] rrrrr [14:14:51] manybubbles: ignore me [14:15:00] another m name? [14:15:13] na I have pick the last human that talked in this channel .. [14:15:19] not a clever choice [14:15:39] I need a pick_random_ops() macro [14:15:53] good call [14:16:13] i need a pick_random_hashar() [14:16:21] i think i need a hashar for at least two things now :> [14:16:30] not even urgent anyway, that is for labs so I installed the missing package :D [14:17:00] I'm restarting elasticsearch nodes and wrapping christmas presents! [14:17:04] (03CR) 10Hashar: "The new package is for labs instance running browser tests. Already did the install there." [operations/puppet] - 10https://gerrit.wikimedia.org/r/102141 (owner: 10Hashar) [14:17:09] It is more relaxing then yesterday [14:17:09] hashar: what's the priority of fixing or working around these stupid hash mismatches? :( [14:17:26] MatmaRex: no idea, I am not even sure a bug has been filled [14:17:44] manybubbles: good to know :-) [14:18:20] manybubbles: regarding the conversation about search enhancement from yesterday meeting, I think Chad and You should just provide the backend. [14:18:35] manybubbles: leaving features / designer figure out how to revamp the search page or think about adding new features to it [14:18:41] hashar: of course it's been, i filed one [14:18:51] and you actually commented on it [14:18:58] or do you mean a gerrit bug? then probably not [14:19:13] can't you, like, retry once or twice if it fails? :( [14:19:18] I am sure elasticSearch has the potential for them to add new exciting features such as category intersection out of the box without the backend having to invest too much time on it. But that is my 0.01 centime de franc [14:19:27] (03PS1) 10Mark Bergsma: Temporarily restore the old wikimedia-lb IP (for donate) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102142 [14:19:38] MatmaRex: the git plugin in Jenkins doesn't offer a way to retry on error [14:19:47] MatmaRex: if I knew java, I could potentially add such a feature [14:20:27] D: [14:20:30] MatmaRex: anyway, the hash mismatch occurs only a few time on Zuul side, so I don't think it is that much of a high priority even if annoying. There must be some weird issue on Gerrit side. I don't have access to the log there though [14:20:33] (03CR) 10Mark Bergsma: [C: 032] Temporarily restore the old wikimedia-lb IP (for donate) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102142 (owner: 10Mark Bergsma) [14:20:44] hashar: it occurs about once a day [14:20:58] or at least i see it about once a day, once every two days [14:21:07] and then i explain wtf is going on to the nth person [14:21:12] i think i'm up to 6 of them now [14:21:42] why does gerrit suck so bad :( [14:22:41] I have other suckers: [14:22:42] 00:34:13.597 Tests: 7324, Assertions: 51423, Errors: 10, Skipped: 14. [14:22:42] :-D [14:26:30] MatmaRex: so you would have to point qchris and ^d to https://bugzilla.wikimedia.org/show_bug.cgi?id=57483 [14:26:46] and get them to look at the Gerrit log to figure out what is happening, might have some clues there [14:27:49] (03CR) 10ArielGlenn: [C: 032] contint: install libsikuli-script-java for browser tests [operations/puppet] - 10https://gerrit.wikimedia.org/r/102141 (owner: 10Hashar) [14:28:18] hashar, MatmaRex: Are we hitting the problem more often these days? [14:28:46] Wait ... once a day? :-( [14:29:10] I think I'll have to request some gerrit time to fix that :-/ [14:29:14] yeah zuul reports a few of them everyday [14:29:17] on git remote update origin [14:29:18] maybe i exaggerrated a little bit [14:29:25] on Nov 20th that was Host key verification failed [14:29:29] oh, or maybe i didn't. :) [14:29:33] then mostly had hash mismatch [14:29:33] (03PS1) 10Faidon Liambotis: Kill wlm.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/102143 [14:29:46] i see these at least once every few days on failing changes [14:29:52] paravoid: don't we need wiki loves monument every year ? [14:29:58] and i saw it once locally (for the first time ever, i think) [14:30:00] (03CR) 10Faidon Liambotis: [C: 032] Kill wlm.wikimedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/102143 (owner: 10Faidon Liambotis) [14:30:22] qchris: there's a bug, search for 'hash mismatch' [14:30:42] on zuul side it happened 1 or 2 times per day [14:30:49] (on a side note, what are you up to these days if not gerrit? :) ) [14:30:51] bug gbeing https://bugzilla.wikimedia.org/show_bug.cgi?id=57483 [14:30:53] hashar: the dns change points to the apache change which involves mobile people as well [14:31:05] Mhmm. Ok. I'll wait for demon and discuss with him. [14:31:34] Let's see whether or not he wants us to fix it now, or wait for the gerrit 2.8 upgrade and see if it disappears :-) [14:33:43] ah the bug was a dupe of https://bugzilla.wikimedia.org/show_bug.cgi?id=53895 :D [14:34:04] qchris: might be related to some issue in the ssh daemon embed in Gerrit I guess [14:34:08] (03PS1) 10Mark Bergsma: Decommission bits.pmtpa servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/102144 [14:34:11] Jssh or something? [14:34:31] Mina ssh ... that's an awful beast :-( [14:34:39] :-D [14:34:56] should migrate Zuul to use the REST API [14:35:07] :-P [14:37:01] (03CR) 10Mark Bergsma: [C: 032] Decommission bits.pmtpa servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/102144 (owner: 10Mark Bergsma) [14:41:57] (03PS1) 10Mark Bergsma: Remove pmtpa from geodns [operations/dns] - 10https://gerrit.wikimedia.org/r/102145 [14:42:00] :D [14:43:40] wooooo [14:44:03] :-) [14:47:58] 13 years of history going /dev/null [14:55:47] (03PS1) 10Matanya: alphabetic order [operations/dns] - 10https://gerrit.wikimedia.org/r/102147 [14:57:54] (03PS2) 10Matanya: alphabetic order [operations/dns] - 10https://gerrit.wikimedia.org/r/102147 [14:58:17] (03CR) 10Faidon Liambotis: [C: 032] Remove pmtpa from geodns [operations/dns] - 10https://gerrit.wikimedia.org/r/102145 (owner: 10Mark Bergsma) [15:00:00] !log Jenkins manually configured mediawiki-core-phpunit-misc job to be runnable concurrently [15:00:12] " << WHAT PART ABOUT THIS IS SO HARD TO UNDERSTAND?" LOL @ templates/wikimedia.org [15:00:17] Logged the message, Master [15:01:55] (03CR) 10Faidon Liambotis: [C: 032] Update reverse DNS for text-lb.eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/102115 (owner: 10Mark Bergsma) [15:02:04] hey [15:02:05] not that one [15:02:28] it was a dependency [15:02:39] i guess, but it's incorrect [15:02:45] oh well [15:02:52] I haven't deployed yet, I can revert :) [15:03:03] nah [15:03:14] I guess we can change everything to point at text-lb.eqiad too [15:03:21] except that text-lb. didn't exist yet until a few hours ago [15:03:30] reverse DNS for the lb [15:03:32] who cares :) [15:06:24] (03PS1) 10Mark Bergsma: Remove ssl1-4 node entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/102148 [15:06:45] paravoid: in your spare time (no such thing, i know) please review and if applicable merged my patch above [15:08:01] for the weekend queue [15:08:02] looks good, although videos should probably go, possibly others (but these can follow in a separate commit) [15:08:19] but the commit message needs to be a bit better I think [15:08:53] e.g. "Sort wikimedia.org servers alphabetically" [15:09:12] mark: no weekend queue anymore [15:09:17] :P [15:10:19] (03PS3) 10Matanya: Sort wikimedia.org servers alphabetically [operations/dns] - 10https://gerrit.wikimedia.org/r/102147 [15:10:33] here you go paravoid ^ [15:14:14] apergos: don't forget decommissioning.pp :) [15:14:26] I will do my best to forget it [15:14:41] it is useless in the way we decommission things now (but just for you, I'll add them at the end :-P) [15:14:52] ah [15:15:01] that's right [15:15:28] anyway [15:15:29] if it is, maybe we should get rid of it? [15:15:33] not yet [15:15:38] rob is on that [15:15:40] sq67-70 and ssl1-4 can be decommed now too [15:15:44] rob ain't on that [15:16:02] yes, he is [15:19:28] (03PS1) 10Mark Bergsma: Remove LVS service 'text' (Squid) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102150 [15:19:41] (03PS4) 10Faidon Liambotis: Sort wikimedia.org servers alphabetically [operations/dns] - 10https://gerrit.wikimedia.org/r/102147 (owner: 10Matanya) [15:19:48] (03CR) 10Faidon Liambotis: [C: 032] Sort wikimedia.org records alphabetically [operations/dns] - 10https://gerrit.wikimedia.org/r/102147 (owner: 10Matanya) [15:20:21] (03CR) 10Mark Bergsma: [C: 032] Remove ssl1-4 node entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/102148 (owner: 10Mark Bergsma) [15:30:02] dr0ptp4kt: around? [15:30:11] paravoid, yes [15:30:17] hey [15:30:18] have 5'? [15:30:29] looking into merging the zero patch [15:31:00] anyone dealing with hardware in WMF ? [15:31:06] average: what does that mean? [15:31:27] paravoid, yeah. how can i help? [15:31:28] if anyone is doing assembly of machines and things related to hardware [15:31:35] the hardware takes care of itself [15:31:42] paravoid: ^^ [15:31:46] sure there are people doing that average [15:31:59] average: can you ask what you really want to know instead of meta-questions? :) [15:32:41] dr0ptp4kt: so first of all there's 2-3 different changes all merged into that commit aiui [15:33:01] paravoid: this is http://meta.stackoverflow.com/questions/66377/what-is-the-xy-problem/ [15:34:13] (03CR) 10Mark Bergsma: [C: 032] Remove LVS service 'text' (Squid) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102150 (owner: 10Mark Bergsma) [15:34:35] (03PS1) 10ArielGlenn: remove mgmt and prod ips for pmtpa squids, decom rt #6520 [operations/dns] - 10https://gerrit.wikimedia.org/r/102157 [15:34:59] and with appreciation to ops: http://xkcd.com/705/ [15:35:18] paravoid, yeah. (1) is grabbing the proxies json file. (2) is doing the netmapper and doing fun things with it. [15:36:12] and i guess (3) could be small tweaks to regexes [15:36:29] removing \.wikipedia.org and putting it on the top [15:36:44] \.wikipedia\. even [15:37:09] (03PS5) 10Ottomata: Using custom ganglia module instead of Logster. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 [15:37:31] I don't see the point of tag_direct tbh :) [15:37:37] we're unsetting X-F-B at the bottom anyway [15:37:52] paravoid, looking for thumbs up on ^^ :D [15:38:08] (03PS1) 10Mark Bergsma: Remove reference to LVS 'text' IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102158 [15:39:18] (03CR) 10Dzahn: "dsh groups are still used by sync-apache" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [15:39:54] (03CR) 10Dzahn: "yes, dsh_groups the check seems utterly unused. shall drop" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [15:41:04] (03CR) 10Mark Bergsma: [C: 032 V: 032] Remove reference to LVS 'text' IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102158 (owner: 10Mark Bergsma) [15:44:33] paravoid, i see what you mean about tag_carrier. that's a vestige of the time when we were going to send the x-f-b header in. should we just get rid of the tag_carrier function and put the concrete set req.http.X-CS2 = ... calls in the two places whree tag_carrier is invoked instead? [15:44:51] already on it [15:44:55] I'm actually really impressed I can do a rolling restart of the elasticsearch cluster while serving traffic and doing bulk indexing and it all just works [15:46:31] (03CR) 10ArielGlenn: [C: 032] remove mgmt and prod ips for pmtpa squids, decom rt #6520 [operations/dns] - 10https://gerrit.wikimedia.org/r/102157 (owner: 10ArielGlenn) [15:49:47] (03PS19) 10Faidon Liambotis: Handle proxies for Wikipedia Zero [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [15:50:07] dr0ptp4kt: if you have a moment, do a final review of PS19 and I'll deploy immediately [15:50:18] paravoid, looking [15:51:00] (03PS1) 10Mark Bergsma: Rename LVS service 'text-varnish' to 'text' [operations/puppet] - 10https://gerrit.wikimedia.org/r/102159 [15:51:58] (03CR) 10Dr0ptp4kt: [C: 031] Handle proxies for Wikipedia Zero [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [15:52:05] ^paravoid [15:58:05] (03CR) 10Faidon Liambotis: [C: 032] Handle proxies for Wikipedia Zero [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [15:58:14] (03CR) 10Faidon Liambotis: [V: 032] Handle proxies for Wikipedia Zero [operations/puppet] - 10https://gerrit.wikimedia.org/r/88261 (owner: 10Dr0ptp4kt) [15:58:24] (03PS1) 10ArielGlenn: pmtpa squids to decommissioning.pp, decommed rt #6520 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102161 [15:58:25] did jenkins go to bed? [15:58:34] seems so [15:59:58] (03CR) 10Mark Bergsma: [C: 032 V: 032] Rename LVS service 'text-varnish' to 'text' [operations/puppet] - 10https://gerrit.wikimedia.org/r/102159 (owner: 10Mark Bergsma) [16:00:54] dr0ptp4kt: deployed; fully in effect in 30mins [16:00:56] * apergos eyes ^d [16:01:07] <^d> Oh man, what'd I do now?!? [16:01:13] heh [16:01:19] paravoid, thx, will set a reminder to exercise it a little [16:01:22] it's not what you do it's who you know [16:01:24] (in prod) [16:01:34] (03CR) 10Andrew Bogott: "https://gerrit.wikimedia.org/r/#/c/102155/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/100221 (owner: 10Andrew Bogott) [16:02:17] <^d> apergos: I know lots of peeps. What's up? [16:02:22] I think it's likely a hashar issue actually, but on gallium and lanthanum we have a downgrade of some packages which puppet doesn't like, it's set off by php5-mysql [16:02:37] (absent -> present) [16:03:02] libapache2-mod-php5 php-pear php5-cli php5-common php5-curl php5-dbg blah blah, a whole pile that would be downgraded [16:03:33] <^d> So, I know we've got that new php in apt...I'm guessing this is related? [16:03:34] apergos: ah yeah [16:03:36] expected [16:03:44] <^d> Yay, hashar saves the day :) [16:03:49] apergos: I think akosiaris manually installed an update of php packages on jenkins boxes [16:03:56] * hashar sends a coffee & donuts to ^d [16:04:17] beta got upgrade as well iirc [16:04:21] and testwiki [16:04:50] mark: paravoid: looking at jenkins [16:05:22] PROBLEM - Puppet freshness on cp1048 is CRITICAL: Last successful Puppet run was Mon 16 Dec 2013 06:59:39 PM UTC [16:10:04] !log jenkins in a bad mood for some unknown reason :( [16:10:22] Logged the message, Master [16:10:48] (03PS1) 10Mark Bergsma: Rename role::cache::varnish::text/upload to role::cache::text/upload [operations/puppet] - 10https://gerrit.wikimedia.org/r/102164 [16:17:09] !log Jenkins web service threads are all hang busy waiting for a long request to achieve. Caused by myself :/ [16:17:26] Logged the message, Master [16:22:22] PROBLEM - Puppet freshness on manutius is CRITICAL: Last successful Puppet run was Tue 17 Dec 2013 01:21:21 PM UTC [16:27:36] !log restarting Jenkins (stuck) [16:27:51] Logged the message, Master [16:30:54] RobH: I can't ping labnet1001 -- do I need to do something to get DNS working? (And, does it have an OS already?) [16:33:19] (03CR) 10jenkins-bot: [V: 04-1] pmtpa squids to decommissioning.pp, decommed rt #6520 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102161 (owner: 10ArielGlenn) [16:34:23] well that was a failed fail [16:34:31] * apergos tries a rebase :-P [16:34:37] (03PS2) 10ArielGlenn: pmtpa squids to decommissioning.pp, decommed rt #6520 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102161 [16:34:46] greg-g: what time would you like me to be around for the gwtoolset deploy onto commons? [16:34:49] I think Jenkins is rejecting everything right now... [16:35:37] dan-nl: we should hopefully get to it around 22:00 UTC, honestly, pending any issues with the regular mediawiki rollout. cc Reedy [16:36:44] * apergos waits...  [16:38:59] andrewbogott: uhhh, i have no idea dude [16:39:17] i'd have to check into it, and im about to go get breakfast, sorry ;] [16:39:30] been in datacenter rfp mode (in seattle today) [16:40:04] "dns needs update to put in entry for mgmt and production" <- you wrote [16:40:11] but, not urgent, have breakfast! [16:40:22] (03CR) 10jenkins-bot: [V: 04-1] Rename role::cache::varnish::text/upload to role::cache::text/upload [operations/puppet] - 10https://gerrit.wikimedia.org/r/102164 (owner: 10Mark Bergsma) [16:42:36] uhh [16:42:42] how did you get the OS installed with no dns? [16:43:41] breakfast delayed for 10 minutes [16:44:59] andrewbogott: its in the bios right now that i can see. [16:45:29] Did you install a new OS or was there already one there? (If the former, I am not sure how you did without DNS, if the latter then its old and needs replacement anyhow) [16:45:36] I haven't touched it [16:45:46] oh, then it has no os. [16:45:51] it needs full setup [16:46:31] OK. I haven't done that but am interested in learning. That's something you start from mgmt right? [16:47:10] Yep, so lets see.... first thing is to set the dns properly. [16:47:19] im goign to run out of time with you before i head out to eat shortly [16:47:30] lets see if we cannot bring in our rt triage person to assist =] [16:47:32] mutante: ^ [16:47:50] (it may be a bit earlier yet though ;) [16:47:55] early even [16:48:04] yeah, I'll ping him in an hour [16:48:13] well, if you dont care [16:48:17] i can do the dns change and set you to review [16:48:19] so you can see how its done [16:48:21] sound good? [16:48:29] (puts you one step further down road ;) [16:49:02] sure, thanks! [16:50:12] Hm, I see Deskana isn't here... [16:50:56] !log Jenkins is busy reloading, should be back around 5:20pm UTC. Don't kill it meanwhile it is busy reading a bunch of files :-( [16:51:03] got to commute out unfortunately [16:51:10] but jenkins should be fine in half an hour or so [16:51:12] Logged the message, Master [16:52:29] (03PS1) 10RobH: RT: 6477 labnet1001 dns [operations/dns] - 10https://gerrit.wikimedia.org/r/102170 [16:54:29] andrewbogott: i added you for that review above, so the mgmt is in there already, and that adds the production dns [16:54:37] so then all dns stuff will be done [16:55:02] twkozlowski: You called for me? :) [16:55:04] i've left unmerged, so you can do that stuff =] (So you will review, then login to ns0.wikimedia.org as root and run authdns-update) [16:55:17] no need to forward key, paravoid setup the new ones to work smarter than that [16:55:29] it also gives a nice diff before merging [16:55:35] DGarry: nothing urgent :) [16:55:48] DGarry: just wanted to say that OAuth if FUCKING AWESOME! [16:56:05] is * :) [16:56:13] <^d> RobH: Oh btw, manybubbles and I are done with arsenic. We rewrote stuff to use the job queue so we don't overload a single box anymore :) [16:56:23] RobH: thx [16:56:38] <^d> RobH: So its yours again :) [16:56:44] twkozlowski: I'm very pleased you like it! I'll pass your feedback on to the engineers. :) [16:57:57] ^d: can you put that in any kind of RT ticket and assign to me, im walking out door now =] [16:58:04] just used CropTool with OAuth for the first time, DGarry [16:58:26] the experience is first-class; pure pleasure [16:58:45] <^d> RobH: Will do [16:58:51] !log Jenkins is back up [16:59:04] Logged the message, Master [16:59:15] twkozlowski: My favourite thing about things like CropTool using OAuth is that the actions are correctly attributed to your account instead of a random bot. [16:59:52] (03PS2) 10Andrew Bogott: RT: 6477 labnet1001 dns [operations/dns] - 10https://gerrit.wikimedia.org/r/102170 (owner: 10RobH) [17:00:17] andrewbogott: what's up, i've been pinged as triage person [17:00:23] DGarry: Random bots have feelings too!! [17:00:43] mutante: sometime soon I'm going to need training about how to do an OS install on a new server. [17:00:48] Reedy: Then they can authorise OAuth and use it to rotate their own images. :P [17:00:52] mutante: at your convenience... [17:00:53] OAuth doesn't discriminate! [17:01:00] DGarry: that's true, I think I just focused on the way you authenticate the tools etc. [17:01:11] (03CR) 10Andrew Bogott: [C: 032 V: 032] RT: 6477 labnet1001 dns [operations/dns] - 10https://gerrit.wikimedia.org/r/102170 (owner: 10RobH) [17:01:15] andrewbogott: as long as it's Dell ..:) [17:01:57] DGarry: it's easy and swift, finally a human(e) way to do things :) [17:03:15] mutante: I think it is... [17:03:30] So to start I need to get on mgmt which… I think I've done before but am currently failing at [17:04:33] andrewbogott: new docs: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Installation my usual cheat sheet i used to use when i did them in the past, it's been a while: https://wikitech.wikimedia.org/w/index.php?title=Build_a_new_server&redirect=no [17:04:42] andrewbogott: just saw it's the redirect now [17:05:04] that once was Ben's page, i used to always go through that [17:05:10] * andrewbogott reads [17:05:19] so using the opportunity there's so many people in here... [17:05:24] https://gerrit.wikimedia.org/r/#/c/100825/ [17:05:31] how do I tell which MW version this will be included in? [17:05:34] https://wikitech.wikimedia.org/w/index.php?title=Build_a_new_server&action=history [17:06:35] andrewbogott: https://wikitech.wikimedia.org/w/index.php?title=Build_a_new_server&oldid=74562 [17:06:38] twkozlowski: https://wikitech.wikimedia.org/wiki/Deployments [17:07:13] twkozlowski: it should be in wmf7 I think. [17:07:21] andrewbogott: so, management.. it should just be servername.mgmt [17:07:26] mutante: there are a bunch of files in puppet/files/dhcpd, which ones do I update? [17:07:32] andrewbogott: unlesss you first need to create those as well [17:07:39] dunno, I assume so? [17:07:41] i think they usually "came with that" [17:07:47] oh, ok. [17:07:50] legoktm: Yes, but how do I tell it's wmf7? :) [17:07:50] from provisioning [17:07:53] just mgmt though [17:07:58] If I can connect to mgmt does that mean that dhcpd is already done? [17:08:06] andrewbogott: no [17:08:17] andrewbogott: first get the MAC address from mgmt [17:08:20] ok, then… how do I know if I need to do dhcpd or not? [17:08:20] then use it in DHCP [17:08:47] And this would be admin@ or mgmt@? [17:08:52] andrewbogott: you check in public puppet, files/dhcpd/ [17:09:03] twkozlowski: it was merged on the 12th, so that friday, it went into the wmf7 branch. For most things, they go into the branch that is deployed to testwikis every friday [17:09:17] Ah, root@ [17:09:26] andrewbogott: most likely you want the one with ttyS1-115200 [17:09:29] grep for examples [17:09:40] andrewbogott: yes, root [17:10:26] legoktm: s/friday/thursday/ [17:10:33] andrewbogott: i should have added, before it has a server name, it should be in DNS as .mgmt [17:10:38] vendor tag [17:10:39] There's a no-deploys-on-Fridays rule [17:10:49] RoanKattouw: ohright. twkozlowski: I meant thursday* [17:10:51] andrewbogott: and if it is, you want to give it the real name [17:11:06] paravoid, any objection to me merging this today? https://gerrit.wikimedia.org/r/#/c/101431/ its working: [17:11:09] legoktm: 19:35 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: phase1 wikis to 1.23wmf7 [17:11:14] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=&vl=&x=&n=&hreg%5B%5D=cp1048.eqiad.wmnet&mreg%5B%5D=kafka.rdkafka.topics.*.txmsgs.per_second>ype=stack&glegend=show&aggregate=1 [17:11:30] huh, Jenkins is now telling me the dns change didn't merge, even though I can see that it did :( [17:11:35] legoktm: the patch was merged at 7:51 PM [17:12:10] oh uck, I think gerrit is showing me local timezones [17:12:32] andrewbogott: because you did both manual, Verified and Code Review [17:12:48] andrewbogott: let just jenkins do the verify [17:12:55] so https://wikitech.wikimedia.org/wiki/Server_Admin_Log uses UTC? [17:12:56] I figured jenkins wouldn't verify in that repo [17:13:01] it should [17:13:06] andrewbogott: things have become better:) [17:13:14] paravoid, would you be able to have the zero.json and proxies.json files refreshed on the varnishes? i added my ip to a config file to do some testing, but i'm not sure if it made its way in yet [17:13:24] then it would mean the patch was merged an hour before Reedy deployed wmf7 to testwikis [17:13:28] ottomata: no objections [17:13:37] but he would have made the branch earlier [17:13:49] * twkozlowski confused [17:13:55] well, anyway, what should I do with that patch now? Gerrit is confused and stuck :( [17:13:56] paravoid, thanks [17:14:28] twkozlowski: ok, it's not in wmf7 :/ https://github.com/wikimedia/mediawiki-core/tree/wmf/1.23wmf7/includes/revisiondelete [17:14:32] dr0ptp4kt: the cron runs every 5' [17:14:45] (compare with master: https://github.com/wikimedia/mediawiki-core/tree/master/includes/revisiondelete) [17:14:45] legoktm: oh no :-( [17:15:00] ottomata: looks good :) [17:15:08] twkozlowski: you can talk to greg-g about getting it backported, seems like a pretty big bug [17:15:37] (03CR) 10Dzahn: "you added a duplicate here" [operations/dns] - 10https://gerrit.wikimedia.org/r/102170 (owner: 10RobH) [17:15:37] legoktm: not sure about the exact scope; I can't get three files to unsuppress on Commons [17:15:47] (03PS1) 10Dan-nl: beta: copy-upload-domains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102173 [17:16:01] andrewbogott: you added a duplicate there in DNS [17:16:02] paravoid, yeah, hoping to (not?) rule out the cronjob [17:16:10] andrewbogott: see all lines that start with 171 [17:16:15] mutante: the problem isn't that there are duplicates -- the problem is that that patch is already merged. And gerrit is trying to merge it again [17:16:19] hence the patch is 100% duplicate? [17:16:35] hm, I think? [17:16:38] * andrewbogott looks [17:16:41] hmm [17:17:08] oh, hm... [17:17:21] looks like there were already multiple 171s [17:17:33] also, spaces vs. tabs, but i dont know what our style guide is for zone files [17:17:41] just looks messy if its mixed in gerrit [17:17:46] note that I did not write that patch [17:17:53] legoktm: but thanks anyway, I'll keep in mind to check Github next time [17:18:06] andrewbogott: i know, just commenting the patch itself [17:19:16] (03PS1) 10Andrew Bogott: Remove duplicate for labnet [operations/dns] - 10https://gerrit.wikimedia.org/r/102174 [17:19:31] I'm still puzzled by the fact that that patch is merged yet pending in gerrit :( [17:19:56] (03CR) 10Dzahn: [C: 031] Remove duplicate for labnet [operations/dns] - 10https://gerrit.wikimedia.org/r/102174 (owner: 10Andrew Bogott) [17:19:56] oh maybe it isn't pending anymore [17:20:34] andrewbogott: yea,hmm, i am not sure about that part when jenkins disagrees with a human after the fact [17:21:16] hashar would [17:21:36] ok anyway, back to the OS install… how do I get the MAC address from mgmt? [17:21:55] if it's a Dell, get MAC address from the mgmt console, run racadm getsysinfo; we use the first interface [17:22:00] in git, edit files/dhcpd/* (any new uses server linux-host-entries.ttyS1-115200) [17:22:05] after editing, merge and then run puppet on brewster (dhcp server) [17:22:38] legoktm: twkozlowski what to backport? [17:22:42] sorry, was in a meeting [17:22:48] Ah -- 'any new server users…' that's what I was looking for (among other things) [17:23:39] greg-g: https://gerrit.wikimedia.org/r/#/c/100825/ [17:23:42] (03PS9) 10Dan-nl: Production configuration for GWToolset [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 [17:23:47] greg-g: https://gerrit.wikimedia.org/r/#/c/100825/, apparently it's preventing oversighters on commons from unsuppressing [17:24:04] perhaps also the English Wikipedia ones too! [17:24:11] If that's going to make this more important :) [17:24:25] (03CR) 10Dan-nl: "- added whitelisted domains to the wgCopyUploadsDomains array for commons." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 (owner: 10Dan-nl) [17:27:11] legoktm: twkozlowski so, the fix for a fatal is causing that behavior? might need to rope in AaronSchulz to review what his patch does and how to do it better/avoid that issue. [17:27:34] (03CR) 10Andrew Bogott: [C: 032] Remove duplicate for labnet [operations/dns] - 10https://gerrit.wikimedia.org/r/102174 (owner: 10Andrew Bogott) [17:27:50] greg-g: no, that's the fix for the fatals. [17:28:08] ah, so that needs to be backported to address the issue? [17:28:15] (03PS1) 10Andrew Bogott: Added dhcp entry for labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102175 [17:28:26] yes [17:28:26] greg-g: yes, please [17:28:37] anyone online that can approve an operations/mediawiki-config for the beta cluster? [17:29:16] mutante: https://gerrit.wikimedia.org/r/102175 <- ? [17:29:16] (03CR) 10ArielGlenn: [C: 032] pmtpa squids to decommissioning.pp, decommed rt #6520 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102161 (owner: 10ArielGlenn) [17:29:18] gotcha, is the issue presenting itself in wmf6 (ie: is it expressing itself on the wikikpedias who are still on wmf6 and will be until thursday when they get wmf7) [17:29:18] https://gerrit.wikimedia.org/r/#/c/101061/ is for production [17:29:38] greg-g: it's presenting itself on Commons which is on wmf6 [17:29:44] ah, from the backtrace, looks like it's wmf6... [17:30:14] alright, so we need two backports, one to wmf6 and one to wmf7, since that patch isn't in wmf7 either [17:30:15] so unless you backport this into wmf7, we'll have to wait until wmf8, that is Dec 31 [17:30:49] (03CR) 10Dzahn: [C: 031] "looks good, you can verify MAC/vendor like here http://www.coffer.com/mac_find/?string=d4%3Abe%3Ad9" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102175 (owner: 10Andrew Bogott) [17:30:53] (03CR) 10Andrew Bogott: [C: 032] Added dhcp entry for labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102175 (owner: 10Andrew Bogott) [17:31:02] greg-g: and the backport is not in wmf7 [17:31:02] er, the patch* [17:31:02] twkozlowski: legoktm can one/both of you prepare the backports? [17:31:12] to both 6&7 [17:31:18] i can do that [17:31:22] thanks [17:31:31] greg-g: that'd be helpful. I haven't heard from any Wikipedia oversighters about that, but I guess they would appreciate if they didn't have to wait until Thursday [17:31:39] right :) [17:31:49] thanks legoktm :) [17:32:20] dan-nl: ooo, it's GWToolset day \o/ [17:32:23] greg-g: https://gerrit.wikimedia.org/r/102176 and https://gerrit.wikimedia.org/r/102177 [17:32:37] :) [17:33:52] twkozlowski: finally :P [17:33:53] Reedy: online yet? :) [17:34:33] I've been online for years ;) [17:34:51] dan-nl: so https://bugzilla.wikimedia.org/show_bug.cgi?id=58224 is for you guys? [17:35:10] yes [17:35:37] Reedy: "I was online before it was cool" [17:36:17] mutante: you think lvm or raid1-lvm? I feel like the two pages you linked me to disagree on this point [17:36:18] Reedy: so, the list for this morning keeps growing :) 1) normal train 2) those two backports from legoktm and 3) gwtoolset (preferably in that order) [17:36:51] (03CR) 10Dzahn: [C: 031] "heh Ariel, that's clearly just because you are in .gr and type it all the time, no worries:) looks much better now. last note: no wikivoya" [operations/dns] - 10https://gerrit.wikimedia.org/r/101873 (owner: 10ArielGlenn) [17:37:21] andrewbogott: if in doubt, trust the newer info on Server Lifecycle [17:37:53] (03PS6) 10Ottomata: Using custom ganglia module instead of Logster. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 [17:38:22] not deeps link from rev history i pointed you to, though i did that because i was not seeing that on the new page [17:38:31] (03CR) 10ArielGlenn: "No wikivoyage, at least it's not in my email with the list of pt domains, here's the copy-paste:" [operations/dns] - 10https://gerrit.wikimedia.org/r/101873 (owner: 10ArielGlenn) [17:38:56] (03PS1) 10Andrew Bogott: Partition for labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102178 [17:38:57] Shouldn't we fix the fatals first? I know they exist in both... [17:38:57] (03PS1) 10Andrew Bogott: Remove some cruft left over from ceph testing. [operations/puppet] - 10https://gerrit.wikimedia.org/r/102179 [17:39:17] (03CR) 10jenkins-bot: [V: 04-1] Remove some cruft left over from ceph testing. [operations/puppet] - 10https://gerrit.wikimedia.org/r/102179 (owner: 10Andrew Bogott) [17:39:48] (03CR) 10Dzahn: "gotcha, yea, then looks good, i had no idea about the TECH-PRO status for .pt but if we need to to not block it, do it" [operations/dns] - 10https://gerrit.wikimedia.org/r/101873 (owner: 10ArielGlenn) [17:40:10] Reedy: sure, good point [17:40:22] twkozlowski: did you have any questions about that bug? [17:40:36] (03CR) 10Ottomata: [C: 032 V: 032] Using custom ganglia module instead of Logster. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/101431 (owner: 10Ottomata) [17:41:10] dan-nl: nope, I just wondered why it was so silent [17:41:21] but I see you added some domains now [17:42:08] yes, would you be able to merge https://gerrit.wikimedia.org/r/#/c/102173/. hashar usually does it, but he's not on irc atm [17:42:24] (03CR) 10Andrew Bogott: [C: 032] Partition for labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102178 (owner: 10Andrew Bogott) [17:42:40] (03PS2) 10Andrew Bogott: Remove some cruft left over from ceph testing. [operations/puppet] - 10https://gerrit.wikimedia.org/r/102179 [17:42:51] twkozlowski: https://gerrit.wikimedia.org/r/#/c/101873/ [17:42:52] that list is our initial list … we expect other glams to join as we finalise their commitment to commons [17:42:53] dan-nl: Reedy does that [17:42:59] ah, ok [17:43:10] that one is for beta [17:43:14] mutante: what about that patch? :) [17:43:38] twkozlowski: telling you .pt domains are appearing first because of that TECHPRO status stuff, Ariel explained in the commit [17:43:40] Annotating some of those uris might be nice [17:43:50] twkozlowski: letting you know because you made legal buy them all, right:) [17:43:56] (03CR) 10Odder: "Well, I guess that the WMPL guys will notice it if you changed the name servers, but they might not be very happy about it." [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [17:44:07] A few are obvious but others don't look so inviting [17:44:18] (03CR) 10Reedy: [C: 032] beta: copy-upload-domains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102173 (owner: 10Dan-nl) [17:44:21] mutante: oh, I did? :) [17:44:28] (03PS1) 10Ottomata: Updating varnishkafka module with ganglia python module [operations/puppet] - 10https://gerrit.wikimedia.org/r/102181 [17:44:30] (03Merged) 10jenkins-bot: beta: copy-upload-domains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102173 (owner: 10Dan-nl) [17:44:33] (03CR) 10Andrew Bogott: [C: 032] Remove some cruft left over from ceph testing. [operations/puppet] - 10https://gerrit.wikimedia.org/r/102179 (owner: 10Andrew Bogott) [17:44:40] thanks! [17:44:54] mutante: oh yes, I see I did. [17:44:56] (03PS2) 10Ottomata: Updating varnishkafka module with ganglia python module [operations/puppet] - 10https://gerrit.wikimedia.org/r/102181 [17:45:01] (03CR) 10Ottomata: [C: 032 V: 032] Updating varnishkafka module with ganglia python module [operations/puppet] - 10https://gerrit.wikimedia.org/r/102181 (owner: 10Ottomata) [17:45:32] (03CR) 10Dzahn: "Odder, this change isn't changing the nameserver responsible for domains, it's just removing things from our zone files that we are NOT re" [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [17:45:52] twkozlowski: yes, a ton of them:) [17:46:10] twkozlowski: i was so amazed when i saw that mail one day saying they got them practically all and at once [17:46:31] i had expected a longish trail of tickets and disusssion for each one instead that takes forever [17:46:34] same for the budget decision [17:46:43] mutante: Yes, that lady from legal told me to check the report for November [17:47:10] mutante: I need to do a puppet refresh someplace so the netbook settings are picked up? [17:47:24] andrewbogott: netbook? [17:47:27] andrewbogott: netinstall? [17:47:37] should be all brewster, yes [17:47:38] mutante: though now I check it, the .pt domains aren't mentioned at https://blog.wikimedia.org/2013/12/11/wikimedia-foundation-report-november-2013/#Domains_Obtained [17:47:44] um… netboot [17:48:07] andrewbogott: yes, dhcp and install server are same box, and still brewster last time i checked [17:48:20] is brewsted in tampa? [17:48:22] brewster? [17:48:26] yes [17:48:36] hm, ok [17:48:39] i might not be up2date about install server in eqiad [17:48:40] if there is one [17:48:51] (03PS1) 10Ottomata: Depending on ganglia-monitor package [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/102184 [17:48:57] (03PS1) 10Ryan Lane: Enable a labs site override option for nova config [operations/puppet] - 10https://gerrit.wikimedia.org/r/102185 [17:49:00] (03CR) 10Ottomata: [C: 032 V: 032] Depending on ganglia-monitor package [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/102184 (owner: 10Ottomata) [17:49:04] twkozlowski: interesting, so the only ones NOT listed we add first?:) please gerrit comment [17:49:19] twkozlowski: also see reply on that other change to remove chapter domains [17:49:26] Yeah, I saw that [17:49:33] still not sure what to do [17:49:41] I mean, WMF *owns* that domain :) [17:50:12] (03PS1) 10Ottomata: Updating varnishkafka module with ganglia-monitor dependency [operations/puppet] - 10https://gerrit.wikimedia.org/r/102186 [17:50:21] (03PS2) 10Ottomata: Updating varnishkafka module with ganglia-monitor dependency [operations/puppet] - 10https://gerrit.wikimedia.org/r/102186 [17:50:27] (03CR) 10Ottomata: [C: 032 V: 032] Updating varnishkafka module with ganglia-monitor dependency [operations/puppet] - 10https://gerrit.wikimedia.org/r/102186 (owner: 10Ottomata) [17:50:27] apergos: see above as well, fyi [17:50:29] hm, is brewster sick? Mutante, can you access? [17:50:54] eh? [17:52:21] apergos: see my talk with twkozlowski/odder [17:52:27] andrewbogott: hmm.. not from iron ... [17:53:00] Oh, but I can from bast1001 [17:53:09] andrewbogott: it's not sick [17:53:13] That's…. disturbing [17:53:15] andrewbogott: you can still get on it from fenari [17:53:35] but must be eqiad/pmtpa delibarate or not [17:53:47] I can still reach other pmtpa servers from iron [17:54:00] dunno but apparently we own these now [17:54:00] firewall changes via ferm? [17:54:16] andrewbogott: maybe related to akos' ferm additions? [17:54:22] and if there are ferm rules on a host, then iron is not a bastion so yeah, no ssh via it [17:54:27] use bast1001 [17:54:37] I thought the whole point of iron was that it /was/ a bastion :) [17:54:43] no [17:54:46] eh? [17:54:50] you are killing me [17:54:53] yes, that was definitely the pount [17:54:54] *point [17:54:55] about which bastion to use [17:54:59] ask paravoid [17:55:20] unless I am seriously hallucinating about it [17:56:22] we're not supposed to be doing key forwarding on these hosts anyways so bast1001 is just being proxied through' [17:56:43] (03CR) 10Odder: "mutante asked me to comment here, but I have no idea how I can add to the conversation." [operations/dns] - 10https://gerrit.wikimedia.org/r/101873 (owner: 10ArielGlenn) [17:56:51] anyway… now I just have to figure out how to pxe boot [17:57:31] twkozlowski: apergos , fine so you both dont know, hehe, alright [17:58:06] I know we own the bloomin things, otherwise marmonitor wouldn't be registering them to us [17:58:09] and i know it's not you making it so chaotic process [17:58:12] neither of you [17:58:13] and I know they have this stupid tech-pro thingie [17:58:14] mutante: I think everything is in #5531? :) [17:58:24] and that is *it* :-D [17:58:54] well and I know or I think I know that if I push this out, then I can email mm to finish up [17:59:06] !log reedy synchronized php-1.23wmf6/includes/revisiondelete/RevisionDelete.php 'I0bd4a5fe9687c4261ca0f57e30f723e8bf2589ac' [17:59:20] Logged the message, Master [17:59:35] twkozlowski: yes, that's correct, so you can add to the conversation by saying "yea, those are the domains i asked for on 5531, do it":) [17:59:57] mutante: the problem is, I didn't know you actually bought them. [17:59:59] apergos: he's correct, those are listed on ticket, still +1 [18:00:21] I only know about those in the report, that legal lady told me you bought some, and didn't buy others [18:00:27] so this is the thing, unless there's a purchase order or an invoice or something in rt [18:00:33] how do any of us know we bought them? [18:00:46] we dont [18:00:54] and that's unsolvable right this instant [18:00:58] because we pay somebody to handle it for us [18:01:01] (but it would be nice to be solvable real dang soon) [18:01:02] well, I checked wikisource.pt and it shows WMF data if you whois it [18:01:10] and they are not proactive [18:01:17] about telling us [18:01:26] there should be newdomains, like newprojects@, or whatever [18:01:32] !log reedy synchronized php-1.23wmf7/includes/revisiondelete/RevisionDelete.php 'I0bd4a5fe9687c4261ca0f57e30f723e8bf2589ac' [18:01:36] so [18:01:39] the real upshot is [18:01:43] heck, half of the time we need to tell THEM [18:01:47] Logged the message, Master [18:01:47] to also change the NS to us [18:01:51] when we buy stuff [18:01:57] and find out by accident [18:02:02] are these symlinks all I need? and if so, can I +2 it and get the ball back to them? [18:02:09] yeah 'accident', meh [18:02:37] apergos: yes, well, they are all you need for a WMF error page instead of nothing found [18:02:45] depending on apache config [18:02:48] that will do for them [18:02:50] but thats not bad [18:02:55] because they are new and havent been used [18:02:58] https://blog.wikimedia.org/2013/12/11/wikimedia-foundation-report-november-2013/#Domains_Obtained lists some domains [18:03:04] all they need is a working pimary dns [18:03:06] but .pt ones aren't mentioned :-( [18:03:06] and it just tells people we have them, and they are at WMF alreadt [18:03:25] all right. have finger on the +2 trigger [18:03:26] apergos: then that should really be all, yes [18:03:32] speak now or... don't :-P [18:03:39] (03PS3) 10ArielGlenn: add pt domains [operations/dns] - 10https://gerrit.wikimedia.org/r/101873 [18:03:41] I lie, have finger on rebase [18:04:11] hm, 'PXE-E51: No DHCP or proxyDHCP offers were received.' [18:04:13] (03CR) 10ArielGlenn: [C: 032] add pt domains [operations/dns] - 10https://gerrit.wikimedia.org/r/101873 (owner: 10ArielGlenn) [18:05:16] (03PS1) 10Ottomata: Sorting metric_descriptions before outputing pyconf file [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/102187 [18:05:37] (03CR) 10Ottomata: [C: 032 V: 032] Sorting metric_descriptions before outputing pyconf file [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/102187 (owner: 10Ottomata) [18:05:40] (03PS1) 10Jdlrobson: Disable VisualEditor experimental mode in mobile [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102188 [18:06:15] (03PS1) 10Aaron Schulz: Add 1 more default job runner process per server [operations/puppet] - 10https://gerrit.wikimedia.org/r/102189 [18:06:21] andrewbogott: tail -f the dhcpd log on brewster [18:08:01] mutante: any idea where that is? [18:08:31] Oh, it's just syslog [18:09:24] mutante: 'no free leases' [18:09:54] (03PS1) 10Ottomata: Updating varnishkafka module [operations/puppet] - 10https://gerrit.wikimedia.org/r/102190 [18:10:02] sounds like a vlan issue [18:10:14] (03CR) 10Ottomata: [C: 032 V: 032] Updating varnishkafka module [operations/puppet] - 10https://gerrit.wikimedia.org/r/102190 (owner: 10Ottomata) [18:10:41] apergos 'vlan issue' = 'LeslieCarr issue'? [18:10:58] or rob or chris [18:11:10] !log deleted php-1.22wmf17 from tin per reedy [18:11:10] depending on how much needs to be done [18:11:23] (or me or para void or maybe a kosiaris too) [18:11:25] andrewbogott: what apergos said:) [18:11:26] Logged the message, Master [18:11:45] what server are you installing? sorry, I came in at the end of this conversation though [18:11:54] labnet1001 [18:12:04] is it a reclaimed host? it was something else? [18:12:07] sure its labs vlan something [18:12:15] to separate from prod [18:12:25] It might be new, or newish [18:12:26] brand new host afaik [18:12:45] * apergos looks in dns [18:12:51] but dont know for real, racktables should tell and find it by vendor ID [18:12:52] ah [18:13:03] brand new host that needs to be in the labs vlan ? [18:13:08] the one you changed it from in DNS [18:13:17] LeslieCarr: yes [18:13:23] 10.64.20.13 [18:13:28] that's the ip so [18:13:56] (03PS1) 10Ottomata: Making sure generate pyconf runs before replace pyconf [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/102191 [18:14:20] labs-hosts1-b-eqiad seems right (hope it's in the right place for that) [18:14:42] (03CR) 10Odder: "Yeah, sorry. I don't think WMPL is planning to change their nameservers from 42.pl to something else, so you should be free to merge this." [operations/dns] - 10https://gerrit.wikimedia.org/r/86659 (owner: 10Dzahn) [18:14:46] (03PS2) 10Ottomata: Making sure generate pyconf runs before replace pyconf [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/102191 [18:15:07] LeslieCarr: andrewbogott needs this to be able to install from brewster https://gerrit.wikimedia.org/r/#/c/102175/1/modules/install-server/files/dhcpd/linux-host-entries.ttyS1-115200 [18:15:12] hrm, going to look it up in racktables [18:15:21] (03CR) 10Ottomata: [C: 032 V: 032] Making sure generate pyconf runs before replace pyconf [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/102191 (owner: 10Ottomata) [18:15:24] d4:be:d9:af:66:c2; [18:16:14] wait, D4 BE D9 AF 66 BA is what I'm seeing... [18:16:16] so it's eth0 looks correct [18:16:19] (03PS1) 10Ottomata: Updating varnishkafka module [operations/puppet] - 10https://gerrit.wikimedia.org/r/102192 [18:16:44] but the eth1-3 aren't -- they need to be in a bundle and were just all put into the wrong vlan [18:16:48] yuck [18:16:58] let me double check the vlan layout of tampa [18:17:01] LeslieCarr: https://gerrit.wikimedia.org/r/#/c/102170/ [18:17:22] andrewbogott: wrong interface? note how it said we use the first one [18:17:28] it's in row c [18:17:33] that was the first one... [18:18:31] it's WMF3643 for racktables [18:18:38] per https://gerrit.wikimedia.org/r/#/c/102170/2/templates/10.in-addr.arpa [18:18:39] well there's your problem [18:18:49] it's in row c [18:18:52] and the ip was put in row b [18:19:06] need to move it to row c's ip range and it should work fine [18:19:17] (03CR) 10Ottomata: [C: 032 V: 032] Updating varnishkafka module [operations/puppet] - 10https://gerrit.wikimedia.org/r/102192 (owner: 10Ottomata) [18:19:33] !log reedy updated /a/common to {{Gerrit|I5ae36ae21}}: EasyTimeline support for private wikis via img_auth [18:19:38] (03PS1) 10Reedy: All non wikipedias to 1.23wmf7 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102193 [18:19:48] Logged the message, Master [18:20:38] * andrewbogott doesn't know how to do that [18:20:57] so if you go to the dns git repo [18:21:02] go to the rdns file for 10/8 [18:21:24] line 1710 starts the labs-hosts1-c-eqiad section [18:21:41] (03PS1) 10ArielGlenn: remove amssq31-46 from dsh and other squid related manifests [operations/puppet] - 10https://gerrit.wikimedia.org/r/102194 [18:21:57] choose the next available ip in that range (looks like 10.64.37.10) and assign that to labnet1001 [18:22:06] then find the other labnet1001 ip and remove that entry [18:22:22] then go to the wmnet forward dns file and change labnet1001's ip in there [18:22:28] do you know how to push dns changes ? [18:23:01] I know how to push, but… 10.in-addr.arpa looks totally undifferentiated to me, I'm scrolling madly looking for landmarks [18:23:27] There are two references to labnet1001 in that file [18:24:42] oh, the second reference is for the mgmt interface [18:24:57] the reference on line 1515 is the one to delete [18:25:04] 171 1H IN PTR labnet1001.mgmt.eqiad.wmnet. <- mgmt? [18:25:12] also... i just realized... do all of the labs machines need ot be on the same instances vlan ? [18:25:15] And how is it possible that multiple nodes share the same address there? [18:25:22] because currently it's two separate vlans [18:25:32] and we had been resisting bridging the vlans across multiple rows [18:25:36] I would think that we'd want all the labs boxes on the same vlan. [18:25:40] Ryan_Lane: correct? [18:25:57] yes [18:26:03] that's management [18:26:03] it's a flat network [18:26:29] we can't split the virtual networks across vlans as far as I know [18:26:30] (03CR) 10Reedy: [C: 04-1] "GWToolset needs adding to extension-list and at the same time can be removed from extension-list-labs" (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 (owner: 10Dan-nl) [18:26:32] how does being management allow multiple boxes to share the same IP? [18:26:47] that's not multiple boxes, labnet is wmf3643 [18:26:52] (03CR) 10ArielGlenn: [C: 032] remove amssq31-46 from dsh and other squid related manifests [operations/puppet] - 10https://gerrit.wikimedia.org/r/102194 (owner: 10ArielGlenn) [18:26:58] ah, ok. [18:27:42] (03PS1) 10Reedy: Fix indenting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102195 [18:27:54] (03PS2) 10Reedy: Fix indenting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102195 [18:28:01] (03CR) 10Reedy: [C: 032] Fix indenting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102195 (owner: 10Reedy) [18:28:13] andrewbogott: that's what i meant about them coming named as vendor tag from provisioning [18:28:16] (03PS1) 10Andrew Bogott: Moved labnet to row c, where it is. [operations/dns] - 10https://gerrit.wikimedia.org/r/102196 [18:28:23] before you give it a name [18:28:37] LeslieCarr: https://gerrit.wikimedia.org/r/#/c/102196/ [18:28:58] (03CR) 10Lcarr: [C: 04-1] "you also need to fix forward dns" [operations/dns] - 10https://gerrit.wikimedia.org/r/102196 (owner: 10Andrew Bogott) [18:29:17] (03Merged) 10jenkins-bot: Fix indenting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102195 (owner: 10Reedy) [18:29:58] (03PS2) 10Andrew Bogott: Moved labnet to row c, where it is. [operations/dns] - 10https://gerrit.wikimedia.org/r/102196 [18:30:45] (03CR) 10Lcarr: [C: 04-1] "it's 10.64.*37*.10" [operations/dns] - 10https://gerrit.wikimedia.org/r/102196 (owner: 10Andrew Bogott) [18:30:49] (03PS3) 10Andrew Bogott: Moved labnet to row c, where it is. [operations/dns] - 10https://gerrit.wikimedia.org/r/102196 [18:31:00] realized that a second after I reviewed [18:31:05] hehe [18:31:36] (03CR) 10Lcarr: [C: 032] Moved labnet to row c, where it is. [operations/dns] - 10https://gerrit.wikimedia.org/r/102196 (owner: 10Andrew Bogott) [18:31:43] didn't submit, but +2 :) [18:34:50] brewster still says 'no free leases' [18:35:06] did you update dns ? [18:35:13] using authdns-update ? [18:35:19] on rubidium, yes [18:35:31] LeslieCarr: if you get some time today: [18:35:31] https://rt.wikimedia.org/Ticket/Display.html?id=6488 [18:35:41] andrewbogott: but its not merged [18:35:43] this isn't super urgent, but would be nice to have as we start using kafka more ofificially [18:35:57] i hope to deploy varnishkafka to the rest of the mobiles by the end of this week, if not sooner [18:36:03] oh it is, i expected the bot to say it [18:36:05] gah [18:36:39] A while back there was some talk of MAC addresses… do I need to change that someplace still? [18:36:41] hrm, it has the correct ip ... [18:36:41] (03PS3) 10Reedy: Remove ancient ArticleFeedbackTool v4 cruft [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98074 (owner: 10Nemo bis) [18:36:46] (03CR) 10Reedy: [C: 032] Remove ancient ArticleFeedbackTool v4 cruft [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98074 (owner: 10Nemo bis) [18:36:50] well double check the mac address is correct ? [18:37:01] yea, check it's really the first interface [18:37:06] and if you have multiple [18:38:01] usually the no free leases means that the mac address is incorrect [18:38:04] hmm, or can it be BIOS settings doing PXE on the wrong [18:39:00] ahha [18:39:05] so it's requesting on d4:be:d9:af:66:ba [18:39:10] or i guess tcpdump on brewster.. andrewbogott, you are not alone, getting new servers up usually takes getting around a couple hurdles [18:39:14] and the dhcpd file says d4:be:d9:af:66:c2 [18:40:04] there you go, both Dell MACs and in the same range [18:40:10] mutante: racadm getsysinfo reports various mac addresses, I don't know what you mean by 'first'. [18:40:10] tells you its the box but other interface [18:40:12] NIC1 Ethernet = d4:be:d9:af:66:ba [18:40:25] But earlier in the output it says MAC Address = d4:be:d9:af:66:c2 [18:40:30] andrewbogott: well now you know which one you are trying to use [18:40:33] (03Merged) 10jenkins-bot: Remove ancient ArticleFeedbackTool v4 cruft [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98074 (owner: 10Nemo bis) [18:40:33] just take that one? [18:40:45] Reedy: shell requests day? :) [18:41:10] mutante: https://dpaste.de/FIFc [18:41:13] (03PS1) 10Ottomata: Updating varnishkafka ganglia view with per_second metrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/102199 [18:41:13] andrewbogott: if the one that Leslie said its requesting does NOT look like it's the first , then i think go to BIOS [18:41:29] (03PS3) 10Reedy: Remove ArticleFeedback leftovers from German Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98073 (owner: 10Nemo bis) [18:41:33] ah... the NIC1 ethernet is the one you want -- the MAC address is the mac address of the drac [18:41:34] look at that and tell me what you mean by 'first' [18:41:39] (confusing, i know) [18:41:49] (03PS2) 10Ottomata: Updating varnishkafka ganglia view with per_second metrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/102199 [18:42:08] (03CR) 10Ottomata: [C: 032 V: 032] Updating varnishkafka ganglia view with per_second metrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/102199 (owner: 10Ottomata) [18:42:29] andrewbogott: what she said, you want the first of the EMBEDDED ones [18:42:37] stuff under RAC means mgmt [18:42:58] d4:be:d9:af:66:ba [18:43:10] should be in the DHCP file [18:43:42] (03PS1) 10Andrew Bogott: Change the MAC address for labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102200 [18:44:02] !log reedy synchronized wmf-config/ [18:44:17] Logged the message, Master [18:45:26] (03CR) 10Dzahn: [C: 031] "correct, the first of the "embedded" NICs is right. as per racadm getsysinfo:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102200 (owner: 10Andrew Bogott) [18:45:51] (03PS3) 10Reedy: Clean up old CodeReview settings: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101068 (owner: 10Chad) [18:45:56] (03CR) 10Reedy: [C: 032] Clean up old CodeReview settings: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101068 (owner: 10Chad) [18:46:55] (03CR) 10Lcarr: [C: 032] Change the MAC address for labnet1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102200 (owner: 10Andrew Bogott) [18:47:30] (03Merged) 10jenkins-bot: Clean up old CodeReview settings: [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101068 (owner: 10Chad) [18:48:25] (03PS1) 10Jforrester: Enable VisualEditor by default on "phase 4" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102208 [18:49:02] (03PS4) 10Reedy: Remove ArticleFeedback leftovers from German Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98073 (owner: 10Nemo bis) [18:49:08] (03CR) 10Reedy: [C: 032] Remove ArticleFeedback leftovers from German Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98073 (owner: 10Nemo bis) [18:51:29] (03Merged) 10jenkins-bot: Remove ArticleFeedback leftovers from German Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98073 (owner: 10Nemo bis) [18:51:40] LeslieCarr, mutante, pxe boot seems to be doing something now. thanks [18:51:58] well, if 'TFTP open timeout' counts as doing something [18:52:14] yw [18:52:22] ottomata1: hey, so the iperf is getting filtered [18:52:42] what ports do we need destination to 239.192.1.51 ? [18:52:43] (03PS1) 10Nemo bis: Fetch only "Wikipedia" label from Netha Hussain's blog to Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/102210 [18:52:50] LeslieCarr, by the way, Brewster is doing a fair bit of 'DHCPDISCOVER from 00:23:8b:65:32:f2 via 10.23.0.3: unknown network segment' Is that anything that matters? (I know it's not related to my install…) [18:53:12] this is just for ganglia [18:53:14] so I guess 8649 [18:53:28] mutante: if you can merge the planet patch above you'd make the author quite happy :) [18:53:49] (03PS2) 10Reedy: Add Item and Item_talk namespace aliases for Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99405 (owner: 10Aude) [18:53:54] andrewbogott: your next step would be mapping which OS image to boot then [18:53:54] (03CR) 10Reedy: [C: 032] Add Item and Item_talk namespace aliases for Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99405 (owner: 10Aude) [18:54:04] andrewbogott: like around these: [18:54:06] You need to edit netboot.cfg from the puppet configuration it's a bash case statement. Make sure your hostname is matched by a regex in there. [18:54:25] sounds like you boot now but that regex doesnt get you [18:54:30] andrewbogott: eh we hould figure out which host is rebooting constantly but not too important at this moment [18:55:18] mutante, isn't that this? https://gerrit.wikimedia.org/r/#/c/102178/1/modules/install-server/files/autoinstall/netboot.cfg [18:55:49] (03PS1) 10Ottomata: rtt.avg is not per_second [operations/puppet] - 10https://gerrit.wikimedia.org/r/102212 [18:56:08] (03CR) 10Ottomata: [C: 032 V: 032] rtt.avg is not per_second [operations/puppet] - 10https://gerrit.wikimedia.org/r/102212 (owner: 10Ottomata) [18:56:12] (03CR) 10Dzahn: [C: 031] "Nemo_bis, sure no problem. tell the author it will be changed very soon and it's nice they are offering an on-topic URL like this" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102210 (owner: 10Nemo bis) [18:56:31] (03Merged) 10jenkins-bot: Add Item and Item_talk namespace aliases for Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99405 (owner: 10Aude) [18:56:49] (03PS1) 10Ryan Lane: Add localhost permissions for labs testing. [operations/puppet] - 10https://gerrit.wikimedia.org/r/102214 [18:57:33] andrewbogott: wait:) [18:58:03] (03CR) 10Dzahn: "yes, thats what you want, but you need to end that line with a bracket ). see example of stat1002" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102178 (owner: 10Andrew Bogott) [18:58:52] andrewbogott: missing a ), syntax error [18:59:04] omg that is the dumbest file syntax ever! [18:59:07] mismatched paren [18:59:11] :) [18:59:37] mutante, you mean (:) [18:59:40] (03PS1) 10Andrew Bogott: Added a needed ) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102216 [18:59:53] andrewbogott: hehee [19:00:08] (03CR) 10Dzahn: [C: 031] Added a needed ) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102216 (owner: 10Andrew Bogott) [19:01:11] (03CR) 10Ryan Lane: [C: 032] Add localhost permissions for labs testing. [operations/puppet] - 10https://gerrit.wikimedia.org/r/102214 (owner: 10Ryan Lane) [19:02:12] (03PS1) 10Chad: Keep TitleKey from stealing Cirrus' prefix search [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102217 [19:02:16] <^d> manybubbles: ^ [19:03:30] ^d, manybubbles: CirrusSearch is causing exceptions in production: http://pastebin.com/PK1eKe53 [19:03:41] (03CR) 10Manybubbles: Keep TitleKey from stealing Cirrus' prefix search (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102217 (owner: 10Chad) [19:03:53] RoanKattouw: reading [19:04:18] (03PS2) 10Reedy: Raise $wgRateLimit for rollback for editors on dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98510 (owner: 10Odder) [19:04:23] (03CR) 10Reedy: [C: 032] Raise $wgRateLimit for rollback for editors on dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98510 (owner: 10Odder) [19:04:28] <^d> RoanKattouw: On it, will fix. [19:04:29] <^d> Easy. [19:04:35] (03Merged) 10jenkins-bot: Raise $wgRateLimit for rollback for editors on dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98510 (owner: 10Odder) [19:04:48] (03CR) 10Andrew Bogott: [C: 032] Added a needed ) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102216 (owner: 10Andrew Bogott) [19:04:57] (03PS2) 10Andrew Bogott: Added a needed ) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102216 [19:06:01] PROBLEM - Puppet freshness on cp1048 is CRITICAL: Last successful Puppet run was Mon 16 Dec 2013 06:59:39 PM UTC [19:06:19] (03PS1) 10BryanDavis: Update Scholarships configuration for deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/102219 [19:06:36] (03CR) 10Andrew Bogott: [C: 032] Added a needed ) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102216 (owner: 10Andrew Bogott) [19:07:09] (03PS2) 10Reedy: Cross-wiki backlink purging for commons file changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101106 (owner: 10Aaron Schulz) [19:07:26] <^d> manybubbles: https://gerrit.wikimedia.org/r/#/c/102220/ [19:07:32] (03CR) 10Reedy: [C: 032] Cross-wiki backlink purging for commons file changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101106 (owner: 10Aaron Schulz) [19:07:43] (03Merged) 10jenkins-bot: Cross-wiki backlink purging for commons file changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101106 (owner: 10Aaron Schulz) [19:09:00] (03PS2) 10Reedy: All non wikipedias to 1.23wmf7 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102193 [19:09:23] (03CR) 10BryanDavis: [C: 04-1] "Ready for review but doesn't need to be pushed yet. Deploy window is 2014-12-18T23:00Z." [operations/puppet] - 10https://gerrit.wikimedia.org/r/102219 (owner: 10BryanDavis) [19:09:33] (03CR) 10Reedy: [C: 032] All non wikipedias to 1.23wmf7 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102193 (owner: 10Reedy) [19:09:41] (03Merged) 10jenkins-bot: All non wikipedias to 1.23wmf7 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102193 (owner: 10Reedy) [19:10:50] RoanKattouw: we have a window tommorow to release more cirrus, would you mind if we waited until then to push the fix? [19:11:06] I'm not seeing the exception very frequently, but I certainly do see it [19:11:17] it was getting washed out by all the jobs timing out [19:11:23] Sounds fine [19:11:35] !log reedy synchronized wmf-config/ [19:11:37] cool. [19:11:45] I thought it might have been related to a 500 error someone was seeing in VE but it turns out that person was reporting something on their own wiki, not our cluster [19:11:52] Logged the message, Master [19:12:24] mutante, does autoinstall have its own log someplace? I'm still getting an ftp timeout [19:12:24] RoanKattouw: merged ^d's fix for it [19:13:01] Cool [19:13:48] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All non wikipedias to 1.23wmf7 [19:14:03] Logged the message, Master [19:14:20] (03CR) 10Chad: Keep TitleKey from stealing Cirrus' prefix search (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102217 (owner: 10Chad) [19:16:33] (03CR) 10Ori.livneh: [C: 032] Add 1 more default job runner process per server [operations/puppet] - 10https://gerrit.wikimedia.org/r/102189 (owner: 10Aaron Schulz) [19:17:47] (03CR) 10Manybubbles: Keep TitleKey from stealing Cirrus' prefix search (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102217 (owner: 10Chad) [19:18:05] MatmaRex: its Dan Garry, who's not in here right now [19:18:09] (nominally DGarry) [19:20:27] <^d> greg-g: He's in other channels :p [19:21:41] ^d: (after doing a /whois) but only one that MatmaRex is in (#mediawiki) [19:22:08] 40 Catchable fatal error: Argument 2 passed to {closure}() must be an array, null given, called in /usr/local/apache/common-local/php-1.23wmf7/includes/filerepo/L [19:22:08] ocalRepo.php on line 282 and defined in /usr/local/apache/common-local/php-1.23wmf7/includes/filerepo/LocalRepo.php on line 261 [19:22:24] eek [19:22:38] I saw a few of those durring the deploy but they've mostly stopped [19:22:46] I figured they were a non atomic deploy thing [19:23:01] PROBLEM - Puppet freshness on manutius is CRITICAL: Last successful Puppet run was Tue 17 Dec 2013 01:21:21 PM UTC [19:23:25] All the code was there before wikis were switched [19:23:40] then I was wrong [19:23:56] still coming, but slowly [19:24:01] http://en.wiktionary.org/w/api.php?action=query&titles=File%3Aen-us-calamity.ogg&prop=imageinfo&iiprop=url&format=json [19:24:12] I'll dump a stack trace [19:24:41] I was wondering if AaronSchulzs globalusage commit did it [19:24:50] (change to config) [19:26:00] /usr/local/apache/common-local/php-1.23wmf7/includes/filerepo/LocalRepo.php(282): [19:26:00] {closure}(Object(ForeignDBFile), NULL) [19:26:25] lunch, brb [19:26:49] !log reedy updated /a/common to {{Gerrit|Id4124ef28}}: All non wikipedias to 1.23wmf7 [19:27:06] Logged the message, Master [19:27:22] !log reedy synchronized wmf-config/ 'Revert Cross-wiki backlink purging for commons file changes' [19:27:37] Logged the message, Master [19:27:48] ori-l: that's still wrong [19:28:05] you're the *only* one who makes non-security commits on tin! [19:29:52] That wasn't a security commit [19:30:45] yes, hence 'non-' [19:31:18] I'm missing context at this point... [19:31:26] I haven't complained about it for a while [19:31:33] i know, i did it for you :P [19:31:37] the complaning [19:31:42] it's on autopilot now [19:32:40] !log reedy synchronized wmf-config/ 'Revert Revert Cross-wiki backlink purging for commons file changes' [19:32:51] wth is this findFiles vodoo? [19:32:54] voodoo? [19:32:58] Logged the message, Master [19:33:15] https://bugzilla.wikimedia.org/show_bug.cgi?id=58587 [19:33:47] if ( $fileMatchesSearch( $file, $searchSet[$dbKey] ) ) { [19:33:55] $searchSet[$dbKey] is null [19:37:18] reverting that config change won't help btw [19:37:41] I know [19:37:48] I've already tried and re-reverted [19:38:03] The closure code just makes it harder to read/follow [19:38:59] jumping back and forth in the same function [19:39:26] oh boy. who is responsible? [19:39:34] Reedy [19:39:40] ori-l: well done [19:39:44] :) [19:39:50] Responsible for what? [19:39:53] can we poke bd808|LUNCH because his title contains the word "multimedia" [19:40:04] who wrote the code/change that brought this up [19:40:24] I didn't write it! [19:40:35] you write everything [19:40:51] It's AaronSchulzs Aaron|home [19:40:52] https://git.wikimedia.org/commit/mediawiki%2Fcore.git/687c5b557778885a58902abc0643db7017e2f082 [19:41:07] Reedy: is it just the api info query? [19:41:24] https://gerrit.wikimedia.org/r/#/c/97993/ [19:41:29] looks like it [19:41:45] Looks to be [19:44:54] (03CR) 10Gage: [C: 031] "Looks ok to me, +1 because I'm a noob" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 (owner: 10Dzahn) [19:45:06] Non trivial revert is non trivial [19:48:54] (03CR) 10GWicke: beta: manage parsoid using upstart (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [19:50:56] (03CR) 10GWicke: [C: 031] "Anyway, looks good to me overall." [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [19:52:35] oh good, comments on the upstart job [19:54:03] !log reedy synchronized php-1.23wmf7/includes/filerepo/LocalRepo.php 'Fix fatal I12513b40453573124e838d54a72a2f9a2d3de338' [19:54:20] Logged the message, Master [19:56:35] That looks to have stemmed it [19:57:51] mutante, back from lunch? [19:58:46] !log reedy synchronized php-1.23wmf7/extensions/GWToolset 'Staging' [19:59:01] Logged the message, Master [20:00:22] andrewbogott: whats the status? got an installer? [20:00:40] but partman breaks?:P [20:00:42] Oh, same as before, TFTP open timeout [20:00:51] ok [20:02:06] andrewbogott: ok, so DHCP works, and it gets an IP, right [20:02:22] and then you can talk to that IP from brewster [20:02:24] via ICMP [20:02:28] but TFTP doesnt work? [20:02:38] then it maybe firewalling [20:02:59] https://dpaste.de/tg94 [20:03:03] as you say [20:03:15] what is "frack puppet"? i see no host named frack. [20:03:26] frack stands for "fundraising rack" [20:03:30] thanks [20:03:37] it's an entirely separate piece of infrastructure for the most part [20:04:00] Jeff_Green is doing the ops side of it [20:04:01] is that for security reasons? [20:04:14] yeah [20:04:15] andrewbogott: so 10.64.37.10 to 10.64.37.1 no UDP 69 [20:04:17] cool [20:04:43] mutante: seems like. I'm unclear on what 10.64.37.1 is [20:05:31] !log deploying change 102285 to OpenStackManager on virt0 [20:05:45] Logged the message, Master [20:07:22] andrewbogott: 10.64.37.1 is vrrp-gw-1119.eqiad.wmnet [20:07:32] afraid at this point we would ping Leslie again [20:07:34] \o/ just killed 16 lines of configuration :) [20:07:34] which is… what? [20:07:53] gw for gateway i'm sure [20:08:03] and made the process for adding new images way simpler (and actually possible multi-region) [20:08:11] LeslieCarr: help with yet another networking problem? [20:08:15] for the labs-hosts1-c-eqiad network [20:08:32] andrewbogott: line 1738 in 19,in-addr.arpa in DNS repo [20:08:41] andrewbogott: 10.in-addr.arpa [20:11:28] andrewbogott: the issue will be that UDP 69 for TFTP doesnt go out of the VLAN over to brewster [20:11:47] (03PS10) 10Reedy: Production configuration for GWToolset [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 (owner: 10Dan-nl) [20:12:35] (03PS11) 10Reedy: Production configuration for GWToolset [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 (owner: 10Dan-nl) [20:12:48] (03CR) 10Reedy: [C: 032] Production configuration for GWToolset [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 (owner: 10Dan-nl) [20:16:07] apergos, want to help with this firewall thing? [20:18:03] watching bab5 episodes now (off the clock)... sorry [20:18:11] if it's still there tomorrow I'll poke [20:18:34] apergos: no worries [20:33:23] what's the operations-request RT queue named? [20:33:26] ops-request ? [20:33:51] ottomata: nice graphs [20:33:51] ops-requests [20:34:14] yeah finally, right? [20:34:21] been reading this article for the last bit [20:34:23] http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying [20:34:29] wtf Jenkins [20:34:30] wtf [20:34:31] think i will deploy to the rest of mobiles soon [20:34:34] :) [20:34:49] (03CR) 10Reedy: [V: 032] "Jenkins is AWOL" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101061 (owner: 10Dan-nl) [20:35:40] pmtpa and sdtpa are the same facility, right? [20:36:17] Different floors, same building [20:36:20] k [20:36:21] Different companies too [20:36:24] oh [20:36:29] * jgage is looking at racktables [20:36:46] IIRC equinix owns one of them now [20:40:17] LeslieCarr, back? [20:40:44] (03CR) 10Hashar: "> We want to use upstart in prod as well, so I'll probably end up stealing some or all of this code." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [20:42:15] 10 01I was just discussion with gwicke the deployment of the Math2.0. In the database there is a lot of change a new field for svg is insered and the content of the field mathml is changed furthermore the field outputhash is no longer needed. as a result it seems to be unreasonable to change the schema of the old math table. a solution could be to introduce a new table math2 that does the caching for extension [20:43:44] (03CR) 10Hashar: "Will poke Ariel to get that merged for beta cluster. Then I will unleash the code that restart parsoid automatically on beta." [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [20:44:14] physikerwelt_: your text is almost the same color as my background so I can't read it [20:44:23] did you mean to hide it? [20:44:32] (03PS1) 10RashiqAhmad: Created a favicon including 16x16, 32x32, and 48x48 versions of https://commons.wikimedia.org/wiki/File:Wikibooks-logo.svg. It substitutes wikibooks.ico at operations/mediawiki-config/docroot/bits/favicon. Bug: 58165 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102297 [20:44:55] greg-g: no I just used trillian [20:45:10] greg-g: for it looks black [20:45:25] !log reedy started scap: Rebuild l10n cache for GWToolset, remove AFT, no wikis moving onto GWToolset at this point (commented out) [20:45:26] probably good to not use colored text [20:45:30] finally [20:45:41] Logged the message, Master [20:45:52] !log finished the rolling restart of the elasticsearch cluster. it could have been done more quickly but there was no hurry. [20:46:08] Logged the message, Master [20:47:07] greg-g: there is no color option in my client [20:47:45] it shows as the same colour as everyone elses to me.. [20:51:32] beside the color... does anyone see a problem to have a second math table? [20:51:45] hey Reedy, does the merge of https://gerrit.wikimedia.org/r/#/c/101061/ put gwtoolset on commons? [20:53:19] * bd808 doesn't love the name  [20:54:12] I don't know the pros and cons of adding a table, but 'math2' is a decidedly horrible name [20:54:30] Reedy: i'll be back in an hour if you need me [20:54:47] <^d> Actually, the convention is to use "two." Case in point: "querycachetwo" [20:55:13] (03CR) 10Qgil: [C: 04-1] "Please follow the commit message guidelines in order to fix the commit subject, add a description, and a reference to the relatd bugzilla " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102297 (owner: 10RashiqAhmad) [20:55:32] ok thanks for the hint [20:55:45] <^d> I'm mostly joking about two. [20:55:52] <^d> It's a bad name :) [20:56:09] mathng ;) [20:56:26] that scales well by simply appending more 'ng's in the future [20:56:36] math_2k13 [20:58:25] (03PS1) 10Ottomata: Deploying varnishkafka to remaining mobile varnishes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/102302 [20:59:52] errms, what happened to Gerrit? [21:00:17] I just got an e-mail saying Quim added me to https://gerrit.wikimedia.org/r/#/c/102297/ [21:00:53] can't see me in the list of reviewers, and the patch isn't visible in 'My reviews', either [21:02:03] paravoid: any objections to that? deploying varnishkafka to rest of mobiles? [21:02:54] (03CR) 10Faidon Liambotis: [C: 032] Deploying varnishkafka to remaining mobile varnishes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/102302 (owner: 10Ottomata) [21:03:33] (03PS2) 10RashiqAhmad: Created a favicon including 16x16, 32x32, and 48x48 versions of https://commons.wikimedia.org/wiki/File:Wikibooks-logo.svg. It substitutes wikibooks.ico at operations/mediawiki-config/docroot/bits/favicon. Bug: 58165 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102297 [21:03:44] (03PS3) 10RashiqAhmad: Created a favicon including 16x16, 32x32, and 48x48 versions of https://commons.wikimedia.org/wiki/File:Wikibooks-logo.svg. It substitutes wikibooks.ico at operations/mediawiki-config/docroot/bits/favicon. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102297 [21:03:53] danke :) [21:03:59] (03CR) 10Ottomata: [C: 032 V: 032] Deploying varnishkafka to remaining mobile varnishes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/102302 (owner: 10Ottomata) [21:04:21] good luck :) [21:07:37] (03PS4) 10Odder: Created a favicon including 16x16, 32x32, and 48x48 versions of https://commons.wikimedia.org/wiki/File:Wikibooks-logo.svg. It substitutes wikibooks.ico at operations/mediawiki-config/docroot/bits/favicon. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102297 (owner: 10RashiqAhmad) [21:08:11] (03CR) 10Odder: [C: 031] "https://commons.wikimedia.org/wiki/File:Wikibooks-logo.svg is the logo on Commons; the icon itself looks OK to me, even better than the cu" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102297 (owner: 10RashiqAhmad) [21:09:11] (03PS5) 10RashiqAhmad: Create a new favicon for Wikibooks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102297 [21:11:04] (03CR) 10Qgil: [C: 031] "Isarra left a +1 above..." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99768 (owner: 10M4tx) [21:14:21] !log reedy finished scap: Rebuild l10n cache for GWToolset, remove AFT, no wikis moving onto GWToolset at this point (commented out) [21:14:36] Logged the message, Master [21:15:32] scap completed in 31m 28s. [21:16:07] !log reedy updated /a/common to {{Gerrit|I2d1c666e1}}: Production configuration for GWToolset [21:16:12] (03PS1) 10Reedy: Move GWToolset to 1.23wmf7 file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102304 [21:16:23] Logged the message, Master [21:16:30] (03CR) 10Reedy: [C: 032] Move GWToolset to 1.23wmf7 file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102304 (owner: 10Reedy) [21:16:32] (03PS6) 10GWicke: beta: manage parsoid using upstart [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [21:18:35] (03PS2) 10Reedy: Move GWToolset to 1.23wmf7 file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102304 [21:18:55] (03CR) 10Reedy: [C: 032 V: 032] Move GWToolset to 1.23wmf7 extension-list file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102304 (owner: 10Reedy) [21:19:04] (03CR) 10GWicke: "Minor tweak to the upstart config. Also upped ulimit -n slightly to 10k while we are at it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [21:19:38] (03PS4) 10Reedy: Make officewiki's Report: namespace VE-enabled [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101908 (owner: 10Jforrester) [21:19:46] (03CR) 10Reedy: [C: 032 V: 032] Make officewiki's Report: namespace VE-enabled [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/101908 (owner: 10Jforrester) [21:20:16] (03CR) 10Hashar: "Awesome Gabriel! Thank you very much to have caught the trailing ampersand. Gotta test out it reload properly over ssh" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [21:20:32] Reedy: wow, that's a longer one than normal, right? [21:20:44] Not really [21:20:49] Or is it [21:20:51] I forget [21:20:59] ori-l: The time please!? [21:21:20] oh, um [21:22:08] I guess all the localisation files would've changed in both versions to remove AFT [21:22:14] So it's not unexpected [21:22:40] (03CR) 10Hashar: [C: 031 V: 031] "PS6 lets me restart parsoid from deployment-bastion over ssh (aka no leaked file descriptor). So that is a works for me!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/99656 (owner: 10Hashar) [21:23:20] !log reedy synchronized wmf-config/ 'Enable GWToolset on commonswiki' [21:23:27] scary shit yo [21:23:30] (03PS1) 10QChris: Let Mingle card references point to Thoughtworks' instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/102306 [21:23:36] Logged the message, Master [21:25:07] * apergos peeks in [21:25:15] that could be exciting (gwtoolset) [21:25:23] yeah, here we go [21:25:25] hm, that's not very helpful: https://graphite.wikimedia.org/render?from=-1months&until=now&width=500&height=380&target=deploy.scap&uniq=0.2655685825739056&title=deploy.scap [21:26:33] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [21:27:13] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.93 ms [21:29:05] (03CR) 10Ori.livneh: [C: 031] Create "Draft" namespace on the English Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [21:29:07] wooot [21:29:08] http://f.cl.ly/items/1U2q281V1p2l2G3S0M0h/Screen%20Shot%202013-12-17%20at%204.28.37%20PM.png [21:30:45] !log csteipp synchronized php-1.23wmf7/includes 'bug58088' [21:30:51] (03CR) 10Qgil: [C: 04-1] "After looking literally closer, the 16X16 version is not as sharp as the previous one. You can see the difference in the first ow of pixel" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/99768 (owner: 10M4tx) [21:30:59] Logged the message, Master [21:31:40] jenkins is out of form today, I notice. [21:31:47] !log csteipp synchronized php-1.23wmf6/includes 'bug58088' [21:32:03] Logged the message, Master [21:32:14] greg-g: dan-nl-afk Would look like it's ok. It hasn't broken anything not directly related... So will need proper testing of the code [21:32:34] twkozlowski: That's one way of putting it [21:34:50] (03CR) 10Ori.livneh: [C: 032] Update Scholarships configuration for deploy [operations/puppet] - 10https://gerrit.wikimedia.org/r/102219 (owner: 10BryanDavis) [21:35:18] Reedy: that's as good as I can hope, I guess :) [21:36:30] !log csteipp synchronized php-1.23wmf6/extensions/CentralAuth 'bug57081' [21:36:32] I should eat my dinner since it was brought home for me [21:36:47] Logged the message, Master [21:37:33] !log csteipp synchronized php-1.23wmf7/extensions/CentralAuth 'bug57081' [21:37:48] Logged the message, Master [21:38:05] LeslieCarr: so wazzat mean? [21:38:09] 8649 should work? [21:38:15] but we don't know why it doesn't yet? [21:46:31] (03CR) 10Mattflaschen: [C: 031] "Ori reviewed this for merge. He noted that self-merges are pretty standard for mediawiki-config, so I'm going to do so when I deploy this" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [21:46:40] (03PS1) 10Ryan Lane: Only install mysql on openstack database node [operations/puppet] - 10https://gerrit.wikimedia.org/r/102309 [21:47:39] !log csteipp synchronized php-1.23wmf6/extensions/TimedMediaHandler 'bug56699' [21:47:56] Logged the message, Master [21:48:10] !log csteipp synchronized php-1.23wmf7/extensions/TimedMediaHandler 'bug56699' [21:48:26] Logged the message, Master [21:49:15] why is jenkins so slow now? [21:49:46] hell. is it even running? [21:50:12] Yes. With a 15-minute backlog. [21:50:26] It is emphatic, lets you check the warmt of your drink. [21:50:35] Look at it from the bright side! [21:50:36] Ryan_Lane: Every day around lunchtime l10n-bot submits hundreds of commits, causing congestion [21:50:37] -_- [21:50:47] Antoine has nefarious plans for making this better [21:50:49] fuck it, I'm +2/merging it [21:50:53] (lunchtime PST I mean) [21:51:00] (03PS2) 10Ryan Lane: Only install mysql on openstack database node [operations/puppet] - 10https://gerrit.wikimedia.org/r/102309 [21:51:06] (03CR) 10Ryan Lane: [C: 032 V: 032] Only install mysql on openstack database node [operations/puppet] - 10https://gerrit.wikimedia.org/r/102309 (owner: 10Ryan Lane) [21:51:51] lesson: don't work over lunch? [21:52:14] it's 5PM here [21:52:24] (03PS1) 10Mattflaschen: Add VisualEditor to the draft namespace [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102311 [21:53:36] don't work past 3 then? >.> [21:53:49] (03PS1) 10Ottomata: Setting log_statistics_interval to 15 for varnishkafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/102312 [21:53:55] (03CR) 10Mattflaschen: "Per notes at https://gerrit.wikimedia.org/r/#/c/97675/ and https://www.mediawiki.org/wiki/Draft_namespace" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102311 (owner: 10Mattflaschen) [21:53:56] dan-nl-afk: when you're back, let me know how your testing goes [21:53:59] (03CR) 10Jforrester: [C: 04-1] "Can't this just be squashed into Ib56f1085eea106de6 for clarity?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102311 (owner: 10Mattflaschen) [21:54:03] (03PS2) 10Ottomata: Setting log_statistics_interval to 15 for varnishkafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/102312 [21:54:23] (03CR) 10Ottomata: [C: 032 V: 032] Setting log_statistics_interval to 15 for varnishkafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/102312 (owner: 10Ottomata) [21:54:58] * bd808 sees GWToolset listed on [[:commons:Special:Version]] \o/ [21:55:36] oh, we're on wmf7 now [21:56:36] yay! legoktm, Aaron|home! [21:56:44] twkozlowski: :D [21:56:45] that was a good fix, that one. [21:57:05] (03PS2) 10Mattflaschen: Add VisualEditor to the draft namespace [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102311 [21:58:08] (03CR) 10Mattflaschen: "Alright, I'll squash them." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102311 (owner: 10Mattflaschen) [21:58:15] yeah, l10n-bot should run at something like midnight pst :/ [21:58:36] so only the australian folks could get annoyed :> [21:59:07] MatmaRex: midnight PST is 9 AM our time :) [22:00:28] 9 am is like middle of the night anyway [22:01:42] hey Reedy, greg-g, any issues? [22:02:38] none that we see, any from your side? [22:03:07] !log ori synchronized php-1.23wmf7/resources/startup.js 'I3d5bcf10e: startup.js: log current time as global 'mediaWikiLoadStart'' [22:03:21] need someone to add dan-nl and DivadH to the gwtoolset group [22:03:24] Logged the message, Master [22:04:13] dan-nl: isn't maartaanaaaaa an admin of some sort on commons? [22:04:27] i'll go into the commons room and sort it [22:04:30] (I always forget where the extra a's go, so I put them everywhere) [22:04:30] dan-nl: any more right you need? [22:04:39] dan-nl: I already gave you gwtoolset [22:04:41] no that should be it [22:04:45] (03PS7) 10Mattflaschen: Create "Draft" namespace on the English Wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [22:04:57] twkozlowski: thanks, see it now [22:05:14] well, I forgot autopatroller, so here it goes too [22:05:14] (03Abandoned) 10Mattflaschen: Add VisualEditor to the draft namespace [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102311 (owner: 10Mattflaschen) [22:06:14] PROBLEM - Puppet freshness on cp1048 is CRITICAL: Last successful Puppet run was Mon 16 Dec 2013 06:59:39 PM UTC [22:06:15] (03CR) 10Mattflaschen: "I squashed in the VisualEditor change, instead of doing it separately." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [22:06:30] * dan-nl running an initial gwtoolset test [22:07:19] (03PS1) 10Yurik: Handle HTTPS for Zero traffic [operations/puppet] - 10https://gerrit.wikimedia.org/r/102316 [22:07:53] (03CR) 10Jforrester: [C: 031] "VE changes look good." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97675 (owner: 10MZMcBride) [22:08:08] (03PS1) 10coren: Two minor bugfixes to maintain-replicas.pl [operations/software] - 10https://gerrit.wikimedia.org/r/102318 [22:08:15] paravoid, its all you now ^^^ :) [22:08:25] 2013-12-17T22:28:47 Reedy (Talk | contribs | block) changed group membership for User:Reedy from autopatroller to autopatroller and GWToolset user [22:08:31] haha clever :) [22:11:41] mutante, any idea why msyed's patch isn't merged yet? Should I do it? [22:15:24] !log restarted apache on zirconium for scholarships vhost [22:15:36] greg-g, Reedy, so far so good … david wants to prepare a data set to test with tomorrow. thus far i have been able to upload a data set and save a mapping for it. that means the filebackend is working as expected [22:15:41] Logged the message, Master [22:15:56] dan-nl: https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/GWToolset,n,z [22:16:05] great [22:16:18] Reedy, as far as i understand it, this also needs to be merged in order for the jobs to run https://gerrit.wikimedia.org/r/#/c/101058/ [22:16:28] dan-nl: Folks in #wikimedia-commons are asking lots of questions [22:16:52] !log ori synchronized php-1.23wmf6/resources/startup.js 'I3d5bcf10e: startup.js: log current time as global 'mediaWikiLoadStart'' [22:16:57] dan-nl: Need to poke someone from ops or ori-l (who looks to be busy atm) [22:17:01] what is the story with knams? same facility as esams? [22:17:07] jgage: nope [22:17:09] Logged the message, Master [22:17:10] (03PS6) 10Andrew Bogott: create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 (owner: 10Dzahn) [22:17:12] Reedy: look at what? [22:17:22] ori-l: https://gerrit.wikimedia.org/r/#/c/101058/ [22:17:32] jgage: IIRC knams only has networking kit in [22:17:39] ok [22:18:04] hashar: is that good to merge from your perspective? [22:18:14] hashar: "Add runner for GWToolset jobs", I mean [22:19:23] ori-l: a new job added to the runjobs.sh loop isn't it ? [22:19:26] no clue honestly [22:19:45] I thought there was already patch [22:20:29] (03CR) 10Andrew Bogott: "Yeah, I don't understand about the class wrapper either, hopefully ryan will comment." [operations/puppet] - 10https://gerrit.wikimedia.org/r/83768 (owner: 10Dzahn) [22:20:30] I can merge it if there's someone in platform who can vouch for it; I can't look at it in detail myself at the moment. [22:20:34] yeah https://gerrit.wikimedia.org/r/#/c/101058/ [22:21:03] I think it is fine to pass unknown job types to the runJobsLoopServices [22:21:15] but haven't looked at the runJobs.php code [22:21:32] I guess it will simply not find such jobs in production and continue [22:21:42] (03CR) 10Aaron Schulz: [C: 031] Add runner for GWToolset jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/101058 (owner: 10Dan-nl) [22:21:48] where as it will run then on beta so I did +1 it for that [22:21:48] (03CR) 10Andrew Bogott: [C: 04-1] mediawiki_singlenode : lint cleanup (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/100790 (owner: 10Matanya) [22:22:06] AaronSchulz: if that change doesn't cause trouble for production, you can go ahead and +2 it [22:22:18] AaronSchulz: err, get someone from ops to merge it in. [22:22:37] that would at least enable the jobs to run on beta [22:23:13] PROBLEM - Puppet freshness on manutius is CRITICAL: Last successful Puppet run was Tue 17 Dec 2013 01:21:21 PM UTC [22:23:22] (03CR) 10Ori.livneh: [C: 032] Add runner for GWToolset jobs [operations/puppet] - 10https://gerrit.wikimedia.org/r/101058 (owner: 10Dan-nl) [22:23:56] thanks [22:26:03] (03PS1) 10Ottomata: Setting up nrpe alert for varnishkafka process [operations/puppet] - 10https://gerrit.wikimedia.org/r/102325 [22:26:16] dan-nl: sorry not familiar with that part of our infrastructure :( [22:26:24] + it has a bunch of very scary scripts :D [22:26:35] (03PS2) 10Ottomata: Setting up nrpe alert for varnishkafka process [operations/puppet] - 10https://gerrit.wikimedia.org/r/102325 [22:27:21] dan-nl: ran puppet on deployment-jobrunner08 to update the job loop shell script there. [22:27:34] notice: /Stage[main]/Mediawiki::Jobrunner/Service[mw-job-runner]: Triggered 'refresh' from 1 events [22:27:48] dan-nl: so the gwtoolset should be proceeded by the instance now \O/ [22:27:49] cool, thanks [22:28:15] dan-nl: I guess you could tell by looking at /data/project/logs/runJobs.log [22:28:29] could -> can [22:28:36] cool, hashar how can i watch that log on production? [22:28:44] you cant [22:28:54] well you need access on the production cluster [22:29:03] (03CR) 10Andrew Bogott: [C: 04-1] Start wikidata puppet module for builder (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96552 (owner: 10Addshore) [22:29:19] so your only way is to proxy with someone else who will look at the logs for you, remove the private data from the log and copy paste to you to some place :D [22:30:22] dan-nl: grepping prod logs right now [22:30:34] oh, no need now [22:30:35] !log ori synchronized php-1.23wmf7/extensions/NavigationTiming 'I933a1e3a2: Add 'mediaWikiLoadComplete' measurement' [22:30:42] (03CR) 10coren: [C: 032 V: 032] "Tested to work and reflects live code." [operations/software] - 10https://gerrit.wikimedia.org/r/102318 (owner: 10coren) [22:30:52] Logged the message, Master [22:30:57] dan-nl bath there is none obviously. I am definitely too tired [22:31:13] dan-nl: you can look at them on beta though. Ping me tomorrow if you need assistance [22:31:19] for now, I am really heading to bed. sorry [22:31:22] haven't submitted a job yet … was just hoping to be able to coordinate with david so that when he submits his job i'd be able to make sure the runJobs.log was showing what is expected [22:31:38] hashar: thanks bonne notte [22:31:48] dan-nl: anyone with deployment privs can tail the log for you; just ask here [22:31:59] cool, thanks [22:33:24] what are cwmd1-esams and cwdm1-knams? Racktables lists them as "MediaConverter" and "multiplexer", respectively [22:33:46] s/cwmd/cwdm/ [22:37:02] !log ori synchronized php-1.23wmf6/extensions/NavigationTiming 'I933a1e3a2: Add 'mediaWikiLoadComplete' measurement' [22:37:19] Logged the message, Master [22:37:25] (03CR) 10Andrew Bogott: [C: 032] create shell account for msyed [operations/puppet] - 10https://gerrit.wikimedia.org/r/102010 (owner: 10Dzahn) [22:42:48] (03CR) 10Ryan Lane: "I mostly like to wrap stuff like this in a class so that there's a possibility of less changes if the module called changes." [operations/puppet] - 10https://gerrit.wikimedia.org/r/83768 (owner: 10Dzahn) [22:43:21] dan-nl: well the runJobs.log on beta is kept / arhicved for a few days [22:43:33] dan-nl: so you can play test and we can extract the logs later on :) [22:43:42] sounds good hashar [22:47:32] (03PS1) 10Yurik: Cleaned up Zero regexes [operations/puppet] - 10https://gerrit.wikimedia.org/r/102333 [22:50:21] bblack, hi, have you had a chance to poke at gzip patch by any chance? [22:50:36] (03PS1) 10Ori.livneh: Update NavigationTiming metric processor for I933a1e3a2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102334 [22:51:33] (03PS1) 10Tim Landscheidt: Labs: Remove Bots project motd [operations/puppet] - 10https://gerrit.wikimedia.org/r/102335 [22:52:09] andrewbogott: back [22:53:25] LeslieCarr: Ah, oops, I'm about to go to dinner. But, briefly… no UDP traffic allowed between labnet and, um… [22:54:02] * andrewbogott backscrolls a whole lot [22:54:32] LeslieCarr: well, ultimately brewster, but the error is about vrrp-gw-1119.eqiad.wmnet [22:54:50] (03CR) 10Ori.livneh: [C: 032 V: 032] Update NavigationTiming metric processor for I933a1e3a2 [operations/puppet] - 10https://gerrit.wikimedia.org/r/102334 (owner: 10Ori.livneh) [22:54:56] Not-very-interesting error message is here: https://dpaste.de/tg94 [22:58:41] LeslieCarr, is that enough to go on? [23:02:09] back after dinner [23:02:47] yeah [23:02:49] should be ? [23:03:01] very colorful snippet [23:04:03] Reedy, thanks for the code shower ;) [23:04:31] If your JS wasn't killing my ide.... [23:09:03] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [23:09:43] that is me [23:10:07] the graphite alert, i mean [23:11:03] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [23:11:06] Reedy, the gwtoolset js is killing your ide? [23:11:21] indeed [23:11:32] I think it's tripping up the static analysis tools [23:11:45] hmm, that's odd … komodo ice seems fine with it ... [23:12:11] PhpStorm does a lot of things [23:12:17] I've already complained at their bug tracker [23:12:24] hrm, iptables is ok [23:14:15] shall i remove the gwtoolset references from wmf-config -labs files now? [23:14:48] no, leave it in, so we can still have things tested there before it hits production [23:14:51] (03PS2) 10Ryan Lane: Enable a labs site override option for nova config [operations/puppet] - 10https://gerrit.wikimedia.org/r/102185 [23:14:52] (03PS1) 10Ryan Lane: Use eth0 IP rather than localhost for multi-region [operations/puppet] - 10https://gerrit.wikimedia.org/r/102345 [23:18:03] andrewbogott_afk: i'm rebooting and tcpdumping to see wtf is going on with labnet1001 [23:19:05] (03PS1) 10Brian Wolff: Have gwtoolset assignable by crats [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102347 [23:22:18] so this is weird mutante ---it gets the dhcp request ... but not showing it's getting hte tftpd request ... but the routes are there, no firewall filters ... [23:23:20] LeslieCarr: that pretty much sums up what we found out earlier :/ [23:23:31] grrr [23:23:32] DHCP works for sure [23:23:39] i'm tcpdumping at the switch port as well now [23:23:40] and you can talk to that IP from brewster [23:23:48] per andrewb [23:23:56] well i can't be 100% certain, but iptables says i can [23:23:58] and routing says i can [23:24:32] (03PS1) 10Ryan Lane: Fix duplicate definition for openstack in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102349 [23:24:40] ok, there's the dhcp request and response [23:24:44] what does the full name mean btw, Leslie [23:24:50] vrrp-gw-1119 [23:24:50] huh ? [23:24:54] gw is gateway [23:24:59] something is routing protocol [23:25:09] and the number is vlan [23:25:10] ? [23:25:22] vrrp is a protocol we use so both routers can be the default gateway and fail between each other [23:25:25] and yep, number's vlan [23:25:36] k, then i pretty much got it, thx [23:25:42] hrm [23:25:44] actually [23:25:50] is carbon serving tftp for eqiad now ? [23:26:08] (03CR) 10Ryan Lane: [C: 032] Fix duplicate definition for openstack in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/102349 (owner: 10Ryan Lane) [23:26:13] thats the part i didnt know, said i may not be up2date about install server in eqiad [23:26:19] last time i did it was all brewster [23:26:24] i'm looking at the config files [23:27:14] what's the issue? [23:27:28] labnet1001 is getting an ip but tftp is timing out [23:27:37] paravoid: andrewB cant install labsnet in eqiad row c vlan [23:27:44] trying to use brewster [23:27:47] DHCP yes, TFTP no [23:28:08] 12:16 < mutante> andrewbogott: the issue will be that UDP 69 for TFTP doesnt go out of the VLAN over to brewster [23:28:48] ahha [23:28:50] andrews paste https://dpaste.de/tg94 [23:28:50] it is using carbon [23:28:54] 23:28:28.677762 IP carbon.wikimedia.org > labnet1001.eqiad.wmnet: ICMP carbon.wikimedia.org udp port tftp unreachable, length 109 [23:29:20] welll there's that mystery solved.. now as to why it's failing [23:29:31] yep, not listening on port 69 [23:29:54] !log starting atftpd on carbon [23:30:11] Logged the message, Mistress of the network gear. [23:30:11] :) [23:30:27] it probably won't work either [23:30:39] andrewbogott_afk: looks like atftpd wasn't running on carbon, which is why no tftp was working [23:30:43] well i'll restart and try again [23:30:49] since it's now running [23:30:49] that's part of the reason, yes [23:30:59] let's try it although I'm pretty sure it won't work [23:31:08] what other problems do you think are happening ? :) [23:31:17] carbon (and brewster) have a firewall now [23:31:21] i checked [23:31:25] 10/8 on port 69 are allowed [23:31:27] and tftp is a crazy protocol [23:31:30] greg-g: have you seen the user priv discussion in commons about gwtoolset? [23:31:37] you do the initial handshake on port 69 [23:31:39] oh yeah, sometimes it has random insane ports ? [23:31:44] dan-nl: no [23:31:46] then it opens random src/dst ports [23:32:03] hehe, so... we may have broken tftp ;) [23:32:04] this does make sense, i just said 69 UDP all the time [23:32:11] we didnt talk about random [23:32:27] yes because the handshake wasn't working [23:32:35] "modprobe nf_conntrack_tftp" should do it [23:32:36] greg-g there are two patches bawolff sorted out that i think make sense but want your feedback first [23:32:44] cool [23:33:11] they want to remove the sysop ability to add the gwtoolset group and the gwtoolset right https://gerrit.wikimedia.org/r/#/c/102343/3/GWToolset.php [23:33:38] then allow only bureaucrats the ability to add the group https://gerrit.wikimedia.org/r/#/c/102347/1 [23:33:51] seems fine to me … are you okay with us merging that now? [23:33:59] looks like it is starting ok [23:34:24] though if there's some timeout we'll know why [23:34:25] 23:34:21.319340 IP 208.80.152.171.58277 > 208.80.154.10.69: 42 RRQ "precise-installer/version.info" netascii [23:34:28] 23:34:21.319689 IP 208.80.154.10.59634 > 208.80.152.171.58277: UDP, length 64 [23:34:31] 23:34:21.559299 IP 208.80.154.10.40384 > 208.80.152.171.58277: UDP, length 64 [23:35:15] now it works [23:35:18] modprobe fixed it [23:35:20] (03PS1) 10Mwalker: Collection Renderer (Now a module!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 [23:35:31] I'll ping alex tomorrow to puppetize this as he sees fit [23:35:45] dan-nl: if that is all just wmf-config/CommonSettings.php and wiki user permission stuff you should get approvals from platform eng [23:36:00] dan-nl: if it makes sense to the local community, sure. Please report a bug (I haven't looked at the patches, if there is one already) so there's a paper trail with links to discussion. I'd wait on merging until there's some who can deploy it. [23:36:36] now, who can deploy it when... [23:36:44] Reedy: around? :) [23:36:46] k thanks [23:37:27] here are the bugs associated with the patches https://bugzilla.wikimedia.org/show_bug.cgi?id=58603 [23:37:27] https://bugzilla.wikimedia.org/show_bug.cgi?id=58607 [23:38:18] (03CR) 10Dan-nl: [C: 031] Have gwtoolset assignable by crats [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102347 (owner: 10Brian Wolff) [23:38:18] (03PS2) 10Mwalker: Collection Renderer (Now a module!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 [23:38:45] (03PS2) 10Odder: Have sysops and and remove users from the 'gwtoolset' group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102347 (owner: 10Brian Wolff) [23:38:52] (03CR) 10jenkins-bot: [V: 04-1] Collection Renderer (Now a module!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 (owner: 10Mwalker) [23:38:53] paravoid: Leslie, thanks in the name of andrewb who is at dinner and will be happy we got him one step closer [23:39:13] (03PS3) 10Odder: Have sysops and and remove users from the 'gwtoolset' group [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102347 (owner: 10Brian Wolff) [23:39:18] greg-g: ? [23:39:36] I'm waiting for my laptop to stop making a load of hot air [23:40:00] Reedy: they want you to merge CommonSettings.php that changes wiki permission stuff but it doesnt have +1s yet, not much :) [23:40:13] greg-g, Reedy looks like they want to wait a bit ... [23:40:16] thanks though [23:40:33] sorry to bother you … [23:40:40] (03CR) 10jenkins-bot: [V: 04-1] Collection Renderer (Now a module!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 (owner: 10Mwalker) [23:41:37] (03PS1) 10RashiqAhmad: Updated internal.ico [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102354 [23:41:47] (03PS3) 10Mwalker: Collection Renderer (Now a module!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 [23:43:56] (03PS2) 10RashiqAhmad: Updated internal.ico [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102354 [23:44:47] (03CR) 10jenkins-bot: [V: 04-1] Collection Renderer (Now a module!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 (owner: 10Mwalker) [23:46:49] (03PS4) 10Mwalker: Collection Renderer (Now a module!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 [23:47:34] andrewbogott_afk: when you're back, we ned to talk labs [23:54:09] or Ryan_Lane if you're around [23:54:09] since you are the openstack guru [23:54:24] LeslieCarr: sure. what's up? [23:54:43] I'm quickly approaching my billable quota for the week [23:54:47] uhoh [23:54:53] heh [23:55:00] (03CR) 10Brian Wolff: [C: 031] "I think this is a good idea pending the ultimate concensus of the commons community" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/102347 (owner: 10Brian Wolff) [23:55:10] well we will have two networks that we should think of as completely separate.... labs nodes will want to live on both [23:55:21] (03CR) 10jenkins-bot: [V: 04-1] Collection Renderer (Now a module!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 (owner: 10Mwalker) [23:55:26] so we could have the labs network node have connections to both, but it would be two discrete subnets [23:56:21] does neutron support multiple bridges? [23:56:54] I think I'm translating from network lingo into openstack lingo :> [23:57:24] why are they separate? [23:57:32] separate rows [23:57:35] -_- [23:57:39] why are they not on the same row? [23:58:01] redundancy, presumably? [23:58:11] this is just going to complicate things... [23:58:12] I have no idea really, but putting all of labs into one row might not be ideal? [23:58:13] greg-g: i think we're set for now … will probably want to make some permission changes later this week or whenever we can … want to at least give it a night rest :) [23:58:27] there's no real redundancy available [23:58:51] if the network node is in one row and that row dies, everything goes [23:58:54] sure, one network node and everything, but that's more easily fixable than reracking half of the labs nodes in the future [23:59:01] greg-g: anything left on your side? Reedy ? [23:59:02] (03PS5) 10Mwalker: Collection Renderer (Now a module!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/102352 [23:59:33] but yes, it's an easy way out I guess :) [23:59:55] well, you can have multiple networks