[00:02:19] (03PS2) 10Faidon Liambotis: Replace Linux RPS setting with a smart mechanism [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 [00:22:06] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours [00:25:06] PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours [00:33:48] (03PS1) 10Cmjohnson: adding site.pp entry for elastic1001-12 [operations/puppet] - 10https://gerrit.wikimedia.org/r/95969 [00:34:21] (03CR) 10jenkins-bot: [V: 04-1] adding site.pp entry for elastic1001-12 [operations/puppet] - 10https://gerrit.wikimedia.org/r/95969 (owner: 10Cmjohnson) [00:36:09] (03PS2) 10Cmjohnson: adding site.pp entry for elastic1001-12 [operations/puppet] - 10https://gerrit.wikimedia.org/r/95969 [00:37:52] (03CR) 10Cmjohnson: [C: 032] adding site.pp entry for elastic1001-12 [operations/puppet] - 10https://gerrit.wikimedia.org/r/95969 (owner: 10Cmjohnson) [00:37:53] (03PS3) 10Faidon Liambotis: Replace Linux RPS setting with a smart mechanism [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 [00:51:55] (03CR) 10Springle: "Given that the job queries are not all equal runtime is dependent on shard size and number of wikis, Nemo's approach seems potentially mor" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95876 (owner: 10MaxSem) [01:12:06] PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours [01:23:18] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Miscellaneous+pmtpa&h=hume.wikimedia.org&jr=&js=&v=355970&m=Global_JobQueue_length [01:23:21] wtf [01:24:16] springle: you know db47 has a failed LD, right? [01:24:40] paravoid: yep [01:24:59] ok [01:25:25] also, do you know I depooled db1050 back on Nov 8th? [01:25:41] some LVM issue, possibly kernel bug [01:25:41] db47 is about to be rotated out for decomm [01:25:44] it's still depooled [01:25:48] yes [01:25:51] okay [01:25:54] just making sure :) [01:26:09] i let it catch up, but snapshot slaves are al currently (or practically) depooled [01:26:16] lvm is ok there now [01:26:30] how did you fix it? [01:26:42] snapshots will become "the new tampa slaves" where we do stuff like dumps and slow queries [01:27:03] what do you mean by "snapshots"? [01:28:12] didn't say i fixed lvm, just that it's ok :) by the time I go to it it had taken new snapshots. they all mounted correctly etc, so i let mysql go and catch up [01:29:08] snapshots == snapshot slaves [01:29:11] sorry [01:36:37] RECOVERY - DPKG on stafford is OK: All packages OK [01:46:57] (03PS1) 10Faidon Liambotis: search: disable the lucene_search icinga check [operations/puppet] - 10https://gerrit.wikimedia.org/r/95971 [01:46:58] (03PS1) 10Faidon Liambotis: base: add check_disk check exception for Varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/95972 [01:48:19] (03CR) 10Faidon Liambotis: [C: 032] search: disable the lucene_search icinga check [operations/puppet] - 10https://gerrit.wikimedia.org/r/95971 (owner: 10Faidon Liambotis) [01:48:27] (03CR) 10Faidon Liambotis: [C: 032] base: add check_disk check exception for Varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/95972 (owner: 10Faidon Liambotis) [01:53:06] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours [01:54:08] PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours [01:56:19] (03PS1) 10Springle: db69 to s2 master db71 to s3 master depool db57 for decom depool db66 for shipping [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95973 [01:57:25] (03CR) 10Springle: [C: 032] db69 to s2 master db71 to s3 master depool db57 for decom depool db66 for shipping [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95973 (owner: 10Springle) [01:58:17] RECOVERY - Disk space on cp1064 is OK: DISK OK [01:58:33] lol [02:01:17] RECOVERY - Disk space on cp1059 is OK: DISK OK [02:03:06] RECOVERY - Disk space on cp1047 is OK: DISK OK [02:04:20] (03CR) 10Faidon Liambotis: [C: 04-1] contint: migrate firewall rules to ferm (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [02:05:17] RECOVERY - Disk space on cp1061 is OK: DISK OK [02:07:11] hmm.. modified stuff on tin /a/common, but only eof newlines removed [02:07:57] !log LocalisationUpdate completed (1.23wmf3) at Mon Nov 18 02:07:57 UTC 2013 [02:08:14] Logged the message, Master [02:08:27] oh er no.. actual differences. drat [02:08:36] RECOVERY - Disk space on cp1050 is OK: DISK OK [02:08:37] RECOVERY - Disk space on cp1046 is OK: DISK OK [02:10:16] RECOVERY - Disk space on cp1063 is OK: DISK OK [02:12:01] what's the best approach dealing with unstaged changes on tin /a/common? [02:12:08] paravoid: ^ ? [02:12:35] I'm not sure [02:12:55] I'm sure TimStarling knows better than I do [02:15:51] !log LocalisationUpdate completed (1.23wmf4) at Mon Nov 18 02:15:51 UTC 2013 [02:16:06] Logged the message, Master [02:16:27] springle: maybe stash and re-apply [02:18:33] (03CR) 10Faidon Liambotis: [C: 04-1] Setting up varnishkafka on mobile varnish caches (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [02:20:06] Aaron|home: that's safe enough? not too dangerous for any half-done background jobs i don't know about... [02:20:18] i know the merge would work [02:21:23] * springle tries it [02:22:58] !log springle synchronized wmf-config/db-pmtpa.php 'pmtpa replication reconfig for decomissioning' [02:23:11] Logged the message, Master [02:29:16] RECOVERY - Disk space on cp1048 is OK: DISK OK [02:32:06] RECOVERY - Disk space on cp1060 is OK: DISK OK [02:32:31] (03CR) 10Faidon Liambotis: [C: 04-1] "A quick search shows there's also a mobile version of msnbot, so we should make a generic solution for this." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95532 (owner: 10Dr0ptp4kt) [02:34:25] (03PS1) 10Springle: correct s3 groupLoadsBySection key [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95974 [02:34:58] (03CR) 10Springle: [C: 032] correct s3 groupLoadsBySection key [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95974 (owner: 10Springle) [02:35:06] RECOVERY - Disk space on cp1045 is OK: DISK OK [02:35:51] (03CR) 10Faidon Liambotis: [C: 032] contint: jenkins git config core.packedGitLimit=2G [operations/puppet] - 10https://gerrit.wikimedia.org/r/95123 (owner: 10Hashar) [02:36:06] !log springle synchronized wmf-config/db-eqiad.php 'correct s3 groupLoadsBySection key' [02:36:18] Logged the message, Master [02:38:00] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Nov 18 02:37:59 UTC 2013 [02:38:14] Logged the message, Master [02:39:45] (03PS4) 10Faidon Liambotis: Replace Linux RPS setting with a smarter script [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 [02:40:06] RECOVERY - Disk space on cp1051 is OK: DISK OK [02:42:13] (03PS2) 10Faidon Liambotis: Fix duplicate ensure in swift.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/95638 (owner: 10Akosiaris) [02:42:27] RECOVERY - Disk space on cp1049 is OK: DISK OK [02:42:30] (03Abandoned) 10Faidon Liambotis: Fix duplicate ensure in swift.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/95638 (owner: 10Akosiaris) [02:45:22] (03CR) 10Faidon Liambotis: [C: 032] "Fair enough." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95090 (owner: 10Mwalker) [02:48:06] RECOVERY - Disk space on cp1058 is OK: DISK OK [02:49:04] !log for index cardinality on nullable fields, testing innodb_stats_method=nulls_ignored on db1049, db1002, db1003, db1004, db1026, db1040, db1041 + reanalyzing tag_summary and change_tag [02:49:15] Logged the message, Master [02:50:06] RECOVERY - Disk space on cp1062 is OK: DISK OK [03:00:17] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [03:02:17] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [03:20:46] !log adding mw1045 mw1070 mw1085 mw1165 back to the apache pool; no evidence (RT tickets, pybal comments) as why they are not pooled [03:21:00] Logged the message, Master [03:23:06] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours [03:26:06] PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours [04:13:06] PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours [04:54:06] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours [04:55:06] PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours [05:54:59] (03CR) 10Jeremyb: "duplicate port (one is wrong?)" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [05:55:14] (03CR) 10Jeremyb: [C: 04-1] contint: migrate firewall rules to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [05:55:36] PROBLEM - Host ms-be9 is DOWN: PING CRITICAL - Packet loss = 100% [06:00:22] Hey. So... do you need me to post the error message or do you already know there's an issue? [06:00:29] ottomata ^ [06:00:46] Request: POST http://en.wikipedia.org/w/index.php?title=List_of_tallest_structures_in_the_world&action=submit, from 208.80.154.136 via cp1055 frontend ([10.2.2.25]:80), Varnish XID 3386140178 [06:00:47] Forwarded for: 108.133.51.103, 208.80.154.136 [06:00:48] Error: 503, Service Unavailable at Mon, 18 Nov 2013 05:59:05 GMT [06:01:06] Request: POST http://en.wikipedia.org/w/index.php?title=List_of_tallest_structures_in_the_world&action=submit, from 208.80.154.136 via cp1067 frontend ([10.2.2.25]:80), Varnish XID 2827401770 [06:01:08] Forwarded for: 108.133.51.103, 208.80.154.136 [06:01:09] Error: 503, Service Unavailable at Mon, 18 Nov 2013 06:00:29 GMT [06:05:06] RECOVERY - Host ms-be9 is UP: PING OK - Packet loss = 0%, RTA = 35.46 ms [06:10:08] Sven_Manguard: it's not really otto mata i think... [06:10:29] jeremyb: what does on RT duty mean up top then? [06:10:53] that's not for "the site's broken", outages, etc. [06:10:59] RT is for less urgent stuff [06:11:06] site's broken is whoever's available [06:11:19] see https://wikitech.wikimedia.org/wiki/RT_Triage_Duty#Duty_desk_rotation_-_who_is_next.3F [06:12:40] the topic is generally trustworthy only if it's been changed in the last ~6 days and also rotations start on mondays (but holidays are often mondays so that maybe messes things up) so if it was set before the last monday then it's probably outdated [06:13:08] i see no one signed up too recently onwiki [06:14:27] anyway, about the error... [06:14:42] Sven_Manguard: that's only on edit? or also reading? [06:15:22] jeremyb: it was only *that page* [06:15:36] everything else I loaded eventually, still can't load that page [06:15:40] but the edit went through [06:15:48] huh [06:16:34] Sven_Manguard: you can't load it at all? [06:16:40] i have no issues with it [06:18:25] I stopped trying a while ago [06:19:01] can you try again? [06:21:24] it seems like that's a consistently expensive page. NewPP limit report agrees with what i saw. at least compared to its own talk page [06:22:05] Sven_Manguard: pls /j #wikimedia-tech too [06:22:19] it works now [06:24:06] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours [06:27:06] PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours [06:28:48] !log powercycled ms-be9, it was unresponsive; doing xfs_repair on /dev/sde1 after 'corruption detected' messages on reboot [06:29:02] Logged the message, Master [06:32:32] https://gdash.wikimedia.org/dashboards/reqwiki/ don't look so hot [06:33:55] apergos: could your powercycle be related to the jump in green/blue? https://graphite.wikimedia.org/render/?title=Top%208%20FileBackend%20Methods%20by%20Max%2090th%20Percentile%20Time%20%28ms%29%20log%282%29%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle%28substr%28highestMax%28FileBackendStore.*.tp90,8%29,0,2%29%29 [06:36:41] the timing looks wrong [06:36:48] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Swift%2520pmtpa&tab=m&vn=&hide-hf=false [06:38:45] k [06:39:05] filed bug 57174 [06:41:39] https://gdash.wikimedia.org/dashboards/reqwiki/ seems to be better now [06:42:04] ok [07:14:06] PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours [07:54:56] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:55:06] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours [07:55:46] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [07:56:06] PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours [09:12:25] (03CR) 10Hashar: "Turns out the real memory hog is git repack-objects, its window memory limit" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95123 (owner: 10Hashar) [09:25:06] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours [09:28:06] PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours [09:30:33] (03CR) 10MaxSem: "You might just want to disable HT completely: http://www.fidian.com/problems-only-tyler-has/disabling-hyperthreading" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [09:31:43] for those who ignore puppet freshness spam: neon & tin:) [09:32:14] I know neon's problem, looking into tin [09:32:18] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [09:32:25] :( [09:32:51] akosiaris: you're there [09:33:00] poor amslvs1 :( [09:33:16] I've been trying to figure out bast1001 (presume same as tin) but no joy [09:33:17] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [09:34:40] apergos: btw we should lower the 1 hour puppet interval today back to 30 mins and see how the infra responds to that load [09:34:50] maybe even lower [09:34:52] that would be great [09:35:02] let's start with 1/2 hour and we can see [09:35:13] bear in mind neon takes a while to finish up [09:37:35] tin issue is that puppet-agent can't read from remote server :/ Error 502 on SERVER: 502 Proxy Error. The proxy server received an invalid#015#012response from an upstream server. [09:38:10] yes, and bast1001 [09:38:35] I should have started with terbium, it looks like a much more run of the mill issue :-/ [09:39:14] hashar, are you a root now? [09:39:27] MaxSem: nop [09:39:29] apergos: hashar all these seem the same [09:39:37] apache proxy module timeouts [09:39:39] yep [09:39:43] fixing it now [09:39:44] * apergos fixes the terbium typo [09:39:50] oh? [09:40:25] (03Draft1) 10Aude: Enable Wikidata build on beta labs [WIP] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [09:40:27] are you just raising the timeout? [09:40:56] (03PS1) 10ArielGlenn: fix typo ('mode' should have been 'group') [operations/puppet] - 10https://gerrit.wikimedia.org/r/95997 [09:40:57] (03CR) 10Aude: [C: 04-1] "this probably breaks localisation cache rebuild on labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [09:42:06] (03CR) 10ArielGlenn: [C: 032] fix typo ('mode' should have been 'group') [operations/puppet] - 10https://gerrit.wikimedia.org/r/95997 (owner: 10ArielGlenn) [09:42:18] meh... i did that again ? [09:42:34] easy to fix [09:42:48] why puppet parser validate does not catch these errors ? [09:42:56] they are way too simple to not catch [09:43:16] oh come on .... align correctly :P [09:43:38] I went into all that trouble to align => :-) [09:43:46] RECOVERY - Puppet freshness on terbium is OK: puppet ran at Mon Nov 18 09:43:36 UTC 2013 [09:43:47] (03PS1) 10Springle: depool db74 for move to S6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95998 [09:43:47] :-D [09:43:55] (03PS2) 10Aude: Enable Wikidata build on beta labs [WIP] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [09:44:10] (03CR) 10Aude: [C: 04-1] "this probably breaks localisation cache rebuild on labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [09:44:13] fine, as a penance I'll go through the whole file and clean up such things, if there are any others :-P [09:44:26] and terbium is happy so [09:44:41] (03CR) 10Springle: [C: 032] depool db74 for move to S6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95998 (owner: 10Springle) [09:44:54] huh... I don't wish that to my worst enemy... linting our puppet manifests [09:46:04] !log springle synchronized wmf-config/db-pmtpa.php 'depool db74 for move to S6' [09:46:15] Logged the message, Master [09:46:57] (03PS1) 10Akosiaris: Adjust proxytimeout for puppetmaster's mod_proxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/96000 [09:48:23] (03CR) 10Akosiaris: [C: 032] Adjust proxytimeout for puppetmaster's mod_proxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/96000 (owner: 10Akosiaris) [09:50:00] I'm only doing alignments and only in the one file [09:50:06] I'm not *that* crazy [09:50:29] :-) [09:53:06] so I was sure it had to be something more complicated than the timeout [09:53:16] that it must have been timing out for some other obscure reason :-/ [09:53:19] (03PS1) 10Springle: db74 to S6 during pmtpa decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/96003 [09:54:23] (03CR) 10Springle: [C: 032] db74 to S6 during pmtpa decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/96003 (owner: 10Springle) [09:55:57] Compiled catalog for neon.wikimedia.org in environment production in 7313.52 seconds [09:56:01] ????????????? [09:56:22] (03PS3) 10Aude: Enable Wikidata build on beta labs [WIP] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [09:56:41] (03CR) 10Aude: [C: 04-1] "this probably breaks localisation cache rebuild on labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [09:56:50] * aude needs sticky minus 1 [09:56:52] er?? [09:57:16] now that is a lie [09:57:24] so tin, bast1001, fenari are going to be fixed by the change [09:57:26] ( 7313.52 seconds ) [09:57:35] because they had compilation times of 300-500 secs [09:57:43] but neon ????? 2 hours ? [09:57:45] neon does not take 7000 seconds. [09:57:46] lol [09:57:49] that is completely bogus [09:57:56] i sure hope so [09:58:08] let's wait for this run and see [09:58:11] unles omsone changed something recently (last week or two [09:58:12] ) [09:58:36] to slow it down by an order of magnitude [09:58:40] *smeone [09:58:43] grrr... anyways. [10:00:08] !log xtrabackup db50 to db74 [10:00:17] RECOVERY - Puppet freshness on tin is OK: puppet ran at Mon Nov 18 10:00:12 UTC 2013 [10:00:21] Logged the message, Master [10:00:28] yippi! [10:01:44] one down [10:02:03] (03PS1) 10ArielGlenn: fix alignments in maintenance.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/96005 [10:02:05] I'll run neon by hand and watch it [10:02:06] notice: Finished catalog run in 90.43 seconds [10:02:12] i already do [10:02:15] ah ok [10:02:20] so those 90 secs are on tin [10:02:23] that is also a lie [10:02:35] how long was it then? [10:02:49] compilation took 490 secs [10:02:58] and 90 secs for the catalog to be applied [10:03:22] hm that's not exactly fixed alignment [10:03:54] puppet-lint liked it though (for that only) [10:04:23] are we doing spaces or tabs, anyone know any more? [10:05:45] spaces if you are starting from scratch [10:05:56] otherwise either stick to what the file already has [10:06:12] or convert it to space-only [10:06:18] nice ugly mix [10:06:21] spaces it is [10:10:24] (03PS2) 10ArielGlenn: fix alignments in maintenance.pp and tabs->spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96005 [10:11:26] PROBLEM - MySQL Processlist on db1006 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 451 statistics [10:12:06] woo [10:12:08] (03CR) 10ArielGlenn: [C: 032] fix alignments in maintenance.pp and tabs->spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96005 (owner: 10ArielGlenn) [10:12:16] PROBLEM - MySQL Processlist on db1015 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 34 copy to table, 125 statistics [10:12:20] uh [10:13:16] RECOVERY - MySQL Processlist on db1015 is OK: OK 0 unauthenticated, 0 locked, 17 copy to table, 1 statistics [10:13:17] PROBLEM - MySQL Processlist on db1040 is CRITICAL: CRIT 1 unauthenticated, 1 locked, 2 copy to table, 1383 statistics [10:13:26] PROBLEM - MySQL Processlist on db1006 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 15 copy to table, 1771 statistics [10:13:45] i suppose those are worrysome ? [10:13:47] that's bound to give me a merge conflict on that file isn't it [10:14:58] if you were working in it, odds are very good yep [10:14:59] watchlist query madness [10:15:04] oh joy [10:15:26] PROBLEM - MySQL Processlist on db1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:15:27] PROBLEM - MySQL Processlist on db1006 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 14 copy to table, 191 statistics [10:15:51] although, might be a symptom rather than cause. no MW connection errors yet. killing stuff [10:15:56] (03PS1) 10Akosiaris: Swap production puppetmasters to use db1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96007 [10:16:07] ooohhhh [10:16:16] RECOVERY - MySQL Processlist on db1040 is OK: OK 0 unauthenticated, 0 locked, 12 copy to table, 7 statistics [10:16:21] * apergos wonders if sockpuppet is still the puppetca [10:16:26] RECOVERY - MySQL Processlist on db1006 is OK: OK 0 unauthenticated, 0 locked, 8 copy to table, 16 statistics [10:17:38] nope [10:17:40] it is not [10:17:44] ah [10:17:56] and no longer you need to do those weird thingies [10:18:00] with puppet --ca_server [10:18:03] yep [10:18:05] and then again , and then not [10:18:22] and palladium is the (or a) salt master too? [10:18:26] just run puppet, logging to palladium, sign the certificate and you are done [10:18:36] hmmm i got to copy the new_install key ... [10:19:09] I have made no plans to migrate salt master to palladium [10:19:16] oh. nm then [10:19:29] ryan was saying that we should have two [10:19:32] one per DC [10:19:34] yes, a syndic [10:19:58] it's fine on sockpuppet for now [10:20:06] I suppose it would make sense to reuse the puppetca server in order to avoid [10:20:12] double CA work [10:20:38] RECOVERY - Puppet freshness on bast1001 is OK: puppet ran at Mon Nov 18 10:20:34 UTC 2013 [10:21:04] (03CR) 10Akosiaris: [C: 032] Swap production puppetmasters to use db1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96007 (owner: 10Akosiaris) [10:22:31] have anybody from ops seen https://bugzilla.wikimedia.org/show_bug.cgi?id=56769 ? [10:24:12] MaxSem: seems like mathoid is timing out [10:24:31] oh, mathoid is already live? [10:24:43] and that is what I was going to ask [10:25:02] well anyway the app server (whatever it is) times out [10:25:35] (03CR) 10Nemo bis: "After https://gerrit.wikimedia.org/r/96005 I wonder if it's easier to rebase this or to abandon and resubmit" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 (owner: 10Nemo bis) [10:26:09] MaxSem: https://gerrit.wikimedia.org/r/#/c/90733/ is not merged so I assume mathoid is not live [10:29:39] argggghhhhhhhhh [10:29:43] oh access denied [10:29:45] booo [10:29:49] that is mee [10:29:51] sorry [10:29:52] heh [10:29:54] * MaxSem finds precisely zro wfProfileIn() calls in Math [10:30:21] holy fuck [10:30:41] yay [10:30:59] even if we urgently slap a bunch of them and deploy we will not know if there's a perf regression [10:31:13] nope [10:31:46] PROBLEM - Puppetmaster HTTPS on palladium is CRITICAL: Connection refused [10:31:51] err? [10:31:56] yeah just saw that [10:32:28] and again me [10:32:39] figured as much [10:32:40] the freaking change to db1001 [10:32:45] ah crap [10:32:57] Access denied for user 'puppet'@'10.64.0.164' [10:33:02] the users are there [10:33:16] just like 'puppet'@'strontium.eqiad.wmnet' [10:33:36] what the ? [10:33:44] what's wrong? [10:34:10] palladium/strontium can't connect to db10001 [10:34:14] what is going on with math? [10:34:42] 504 Gateway Time-out [10:34:46] RECOVERY - Puppetmaster HTTPS on palladium is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.943 second response time [10:35:02] the only difference I see is that users are specified with fqdns and not ips [10:35:04] IPs [10:35:29] and I am wondering why it used to work on db9 and not on db1001 ... [10:35:46] db1001 has skip_name_resolve on [10:35:50] argh [10:35:50] heh [10:35:53] that would do it [10:35:59] sigh... [10:36:09] and d9 does not [10:36:23] any idea why ? [10:37:11] cause i see a lot of users in mysql.user with FQDNs and not IPs [10:37:24] yeah I was noticing that [10:37:33] (which seems odd to me, I'm used to having the ips in there) [10:37:37] grants on misc suck in general. too long with too many people doing it ad-hoc [10:38:54] ok I will redefine the users on db9 with IPs and db1001 will pick them up. [10:40:28] (03PS3) 10Nemo bis: Make the monthly querypages updates not hit each cluster on the same day [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 [10:40:31] I feel uneasy using IPs btw. I prefer hostnames. [10:40:37] i don;t [10:40:46] why ? [10:41:01] hostnames make mysql use reverse dns, which can suddenly block all sorts of things if your dns plays up [10:41:13] makes sense [10:41:14] and it needs extra network round trips on connect [10:41:41] i prefer them because they are easier to update and don't stay as stale as ips [10:41:51] yup, trade off [10:42:32] ingeneral I would agree with you (in the case where hostnames are not some obscure misc server name but a cname etc) [10:42:36] but in this case... [10:43:29] (03PS4) 10Nemo bis: Make the monthly querypages updates not hit each cluster on the same day [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 [10:43:30] yeah i get the point. That is not why I am not arguing to change that [10:50:51] excitingly, that S6 query spike before was not a traffic spike but a side effect of one of the slaves crashing then coming back up with cold caches :( i feel this could be a long night... [10:52:35] springle: well it may (just may) cheer you up to know that puppet is not longer using db9. One less db there :-) [10:52:50] that does cheer me up! [10:54:27] * apergos clears fenari from the puppet freshness whines (hopefully) [10:54:39] akosiaris: it may also cheer you up to know we're running userstat logging on all db boxes since last week with the aim of cleaning up then locking down grants (among other things) [10:55:09] That really cheers me up :-) :-) :-) [10:55:11] have to first find out which stuff is still used. pmtpa going away will make th ejob easier [10:55:19] it sure will [10:55:26] RECOVERY - Puppet freshness on fenari is OK: puppet ran at Mon Nov 18 10:55:21 UTC 2013 [10:55:30] yay [10:57:00] yes!!!! [10:57:10] neon is applying configuration!!! [10:57:15] sweet! [10:57:22] so... The 7030 secs got me wondering [10:57:24] and that's the last [10:57:26] RECOVERY - Puppet freshness on neon is OK: puppet ran at Mon Nov 18 10:57:22 UTC 2013 [10:57:29] yes? [10:57:55] turns out it was the latency of 35 ms between eqiad and pmtpa [10:58:04] that is why is changed puppet to db1001 [10:58:22] which was needed sooner or later anyways [10:58:29] yes :-) [10:58:35] perfect! [10:58:40] 7k secs was absurd [10:58:48] so... activerecord sucks ... [10:58:54] we need to move to puppetdb soon [10:59:10] hm aren't there space issues? [10:59:53] or have they fixed that? [11:00:09] I really don't know... We should evaluate it at least [11:00:31] Compiled catalog for neon.wikimedia.org in environment production in 324.21 seconds [11:00:43] huh.. even that was above the default 300 secs timeout [11:00:45] oh it's the dashboard piece with it, [11:00:46] RECOVERY - puppet disabled on analytics1011 is OK: OK [11:00:46] RECOVERY - SSH on analytics1011 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [11:00:46] RECOVERY - DPKG on analytics1011 is OK: All packages OK [11:00:46] RECOVERY - RAID on analytics1011 is OK: OK: no disks configured for RAID [11:00:51] so maybe we'll be ok [11:00:56] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [11:01:06] RECOVERY - Disk space on analytics1011 is OK: DISK OK [11:01:09] 324 is quite respectable [11:01:29] also much closer to all my runs :-D [11:02:55] moin [11:03:14] coin [11:03:16] :P [11:03:46] amslvs1 holding up, good [11:05:28] (03PS1) 10Akosiaris: Fix owner/perms ganglia_new::monitor::aggregator [operations/puppet] - 10https://gerrit.wikimedia.org/r/96012 [11:05:48] (03CR) 10Faidon Liambotis: "We typically disable HT in the BIOS, yes. In this case it doesn't matter much. That blog post's advice though is terrible, though :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [11:06:05] (03CR) 10jenkins-bot: [V: 04-1] Fix owner/perms ganglia_new::monitor::aggregator [operations/puppet] - 10https://gerrit.wikimedia.org/r/96012 (owner: 10Akosiaris) [11:10:30] (03PS2) 10Akosiaris: Fix owner/perms ganglia_new::monitor::aggregator [operations/puppet] - 10https://gerrit.wikimedia.org/r/96012 [11:26:37] (03PS1) 10Springle: try to stabilize innodb index cardinality [operations/puppet] - 10https://gerrit.wikimedia.org/r/96013 [11:29:03] (03CR) 10Springle: [C: 032] try to stabilize innodb index cardinality [operations/puppet] - 10https://gerrit.wikimedia.org/r/96013 (owner: 10Springle) [11:32:45] * paravoid smiles at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&servicestatustypes=28&hoststatustypes=3&serviceprops=2097162&nostatusheader [11:32:51] almost there :) [11:37:39] what's up with strontium? [11:41:17] apergos: alex is on it, it's a check that shouldn't exist [11:41:43] I guess it would make sense to check that the two back ends on 8141 exist [11:41:47] but not this check [11:41:51] right :) [12:00:44] (03PS1) 10Jack Phoenix: Correct capitalization of "ShoutWiki". [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/96018 [12:01:18] (03CR) 10Akosiaris: [C: 032] Fix owner/perms ganglia_new::monitor::aggregator [operations/puppet] - 10https://gerrit.wikimedia.org/r/96012 (owner: 10Akosiaris) [12:13:57] akosiaris1: can we finish up the ferm stuff for contint ? [12:14:06] got to grab a snack first though [12:15:51] hashar: yes [12:16:06] reviewing smt, will soon get back to you [12:16:23] akosiaris1: gonna get a snack, will be back in roughly 20 minutes [12:16:33] I left a comment last night [12:17:01] and jeremyb left another one too [12:17:08] oh [12:19:13] hi, any page documenting how application servers should look like? [12:20:45] hashar: any ops around atm? [12:22:13] AzaToth: hashar went to eat, there are some ops here [12:23:10] matanya: https://gerrit.wikimedia.org/r/#/c/95424/ needs merge, build and aptified [12:23:49] AzaToth: i'm not in the ops team :) [12:24:13] matanya: then point me to someone I can poke til the end of time :-P [12:24:59] AzaToth: I will not give in the ops :) the is a betray [12:25:03] *that [12:25:17] hehe [12:25:18] (03CR) 10Hashar: "Looks like I left some changes in my local repository, sorry :(" (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [12:27:29] (03PS3) 10Hashar: contint: migrate firewall rules to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 [12:27:56] (03PS4) 10Hashar: contint: migrate firewall rules to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 [12:28:46] (03CR) 10Hashar: "I had to do a rebase (PS3). PS4 address issues from PS2: https://gerrit.wikimedia.org/r/#/c/95162/3..4/modules/contint/manifests/firewall" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [12:28:55] now I get out to get something to eat :] [12:49:59] (03CR) 10Akosiaris: [C: 031] "LGTM in general, one small note" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [12:50:36] content => tempate('interface/enable-rps.conf.erb') [12:50:38] heh I just noticed that [12:50:53] I missed that entirely [12:51:03] crappy puppet parser validate [12:51:08] I remember mark saying that refresh on the upstart job doesn't work well [12:51:11] not sure though [12:51:22] could be... just asking [12:51:29] it's a good question [12:51:33] not like I trust upstart much [12:57:56] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:00:46] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [13:02:24] I noticed template, I do have a question though about the assigning of the remainders [13:02:44] in the case of 8 queues and 6 cpus (as an example) I'm not liking the outcome [13:03:12] the two last queues being assigned to all cpus ? [13:03:20] yep [13:03:48] yes it is an interesting cornercase [13:03:57] I'd rather give half the cpus to one and half to the other or some such [13:05:53] an even better example is 5 cpus, 8 queues to demonstrate this [13:06:08] yep [13:06:11] albeit a non-existent scenario [13:06:25] well that's why I stuck with an even number ;-) [13:08:53] I could complain a little that the 'leftover' cpus don't get assigned in the case where the queues are less than the cpus but it's not a performance issue, where this case could be [13:10:55] it's a good point [13:11:04] I'm not sure if it's going to be a problem in practice though [13:11:32] algorithmically we could take the remainder and slice the CPU array with it [13:11:38] but it'd still could be unevenly sliced [13:12:08] so we'd have to do yet another division basically and take the remainder again [13:12:15] it's a bit messy [13:15:47] you could walk through the list of cpus once, assigning 0 to first extra queue, 1 to second extra queue, 3rd to 3rd extra queue, wrapping around when you are out of extra queues... in the case of only 1 extra queue, all the cpus land on it, in the case of 2 extra queues, each get half, it won't be perfect in all cases but probably 'good enough' [13:17:34] you could have more CPUs than queues too though [13:18:03] (03PS5) 10Faidon Liambotis: Replace Linux RPS setting with a smarter script [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 [13:18:21] care to show it in code? [13:18:26] it sounds interesting [13:18:32] only talking about the end of the first if [13:18:36] where instead of [13:18:42] ah [13:18:44] assign all cpus and hope [13:18:45] we do that [13:19:18] doesn't have to be the end of the first if, it could be the whole first if altogether [13:19:34] yes [13:20:09] lemme double check that mentally [13:20:22] show me the code :) [13:20:32] no [13:20:32] it's very easy to simulate it [13:20:39] it can't replace the full if [13:20:51] cpu_list = range(0, N); rx_queues = range(0, M) [13:20:55] only the end; the point is we walkt through the cpu list *once* [13:20:59] then try your code, then print it [13:22:05] (03CR) 10Manybubbles: "Given that we're still running most of the wikis on lucene-search for the next few months I hate to see us lose monitoring for it. OTOH i" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95971 (owner: 10Faidon Liambotis) [13:22:31] manybubbles: hey [13:22:54] I'm willing to restore it, if you're willing to have a look at the error ;) [13:23:08] hey, I saw that was already merged but I figure I'd add my two cents. [13:23:08] zh wikis and a couple of others are saying FAILED [13:23:12] nice [13:23:17] for at least 158 days [13:23:34] I'm willing to live without it given that we've been doing that for half a year [13:23:34] I've pinged the channel a few times, noone really cares [13:23:41] that was the second half of the comment [13:23:44] not sure if you do :) [13:24:05] I think I care more than most but not enough to spend much time on it. [13:24:13] heh, fair enough [13:24:29] we still have the other check in place, so at least we'll get notified if a daemon dies [13:25:04] that was of a more high level check [13:25:13] er [13:25:16] more of a high level check [13:25:17] damn :) [13:39:21] (03PS4) 10Aude: Enable Wikidata build on beta labs [WIP] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [13:39:34] (03CR) 10Aude: [C: 04-1] "this probably breaks localisation cache rebuild on labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [13:39:52] annoying to keep having to apply -1 [13:54:34] paravoid: this is what I meant http://pastebin.com/DERVdteP (ignoring if it's not nice code), just replaced the commented out stanza with the one above it [13:57:10] akosiaris: trying to clean up certs that mistakenly got copied onto palladium, I get [13:57:12] err: Could not call revoke: Could not find a serial number for blondel.pmtpa.wmnet.pem [13:57:29] the cert is there and I guess some other openssl ca file is missing [13:58:07] paravoid: it is the nudge again :) [13:58:25] apergos: hmmm [13:58:27] lemme check [13:59:15] apergos: puppetca revoke blondel.pmtpa.wmnet [13:59:15] notice: Revoked certificate with serial 1136 [13:59:17] ? [14:00:15] hashar: https://github.com/davido/gerrit-wip-plugin might be wanted? [14:00:24] The .pem suffix i guess ? [14:02:31] AzaToth: Openstack (#openstack-infra) has a WIP hack on their Gerrit 2.4. Maybe they are founding that development. [14:02:54] AzaToth: feel free to ask for it by filling a bug against Wikimedia > git/gerrit [14:03:18] AzaToth: We kind of have that through git review -D but we can NOT go back and forth between the two states. Seems like though it needs "the new change screen". Whatever that is [14:04:52] on palladium right? [14:04:56] AzaToth: got confirmation that the plugin is being written for OpenStack so they can upgrade their Gerrit installation [14:05:05] apergos: yes [14:05:07] oh I did 'puppetca clean blahblah [14:05:24] akosiaris: https://groups.google.com/forum/#!topic/repo-discuss/6V769w5zQok "It is not related to DRAFT change/patch set at all. This is something completely different:" [14:05:26] this is what failed, it should revoke as part of its cleanup [14:05:30] hashar: I see [14:06:14] AzaToth: so yeah feature request it, I am sure the VisualEditor team will be happy to have it [14:06:17] cp1036.wikimedia.org.pem here's another one for you to test with [14:06:23] k [14:06:24] same symptom for puppetca clean [14:07:13] manybubbles elastic search is ready for you. forgot to email yesterdy [14:07:14] AzaToth: indeed. That for the link. [14:07:24] apergos: puppetca clean cp1036.wikimedia.org [14:07:25] notice: Revoked certificate with serial 941 [14:07:25] notice: Removing file Puppet::SSL::Certificate cp1036.wikimedia.org at '/var/lib/puppet/server/ssl/ca/signed/cp1036.wikimedia.org.pem' [14:07:25] notice: Removing file Puppet::SSL::Certificate cp1036.wikimedia.org at '/var/lib/puppet/server/ssl/certs/cp1036.wikimedia.org.pem' [14:07:34] stop using the .pem suffix :P [14:07:35] cmjohnson1: you are my hero! [14:07:35] ok I'm dumb [14:07:42] yep saw it just as you typd it :-D [14:07:49] sorry to waste cycles [14:07:52] ottomata: can you do the mount points? [14:07:54] no worries :-) [14:07:59] ottomata and cmjohnson1: I'm excited! [14:08:48] akosiaris: what prefix do you want? [14:08:50] cmjohnson1: can you email which servers are on which racks so I can make a decision about which should be masters? [14:09:00] k [14:09:04] .pem is pretty much standard [14:09:29] AzaToth: yes but nothing to do with puppetca commands [14:09:47] it takes node names as an argument [14:09:50] not .pem files [14:10:30] and I do this from time to time and usually see it immediately [14:10:31] akosiaris: I don't think I really understand [14:10:41] which means it's time to Do Something Else for a few minutes [14:10:44] akosiaris: who should stop using .pem suffix, and why? [14:11:17] who/what [14:11:45] AzaToth: he was talking to me [14:11:46] AzaToth: ok... misunderstanding here... So puppetca commands need node names as arguments. apergos was providing pem files as arguments which conveniently are named .pem [14:11:56] does this make sense now ? [14:12:20] and now my error has wasted even more cycles :-D [14:12:22] ah [14:12:25] lol [14:12:26] true [14:12:35] hehe [14:13:33] so, any ops around? [14:13:42] yes [14:13:53] not a one! [14:13:57] snicker [14:14:09] I've no idea whom is ops or not :( [14:14:22] * apergos raises hand [14:14:27] * apergos raises akosiaris' hand too [14:14:30] * akosiaris does too [14:14:31] 'lol [14:14:42] akosiaris, apergos: https://gerrit.wikimedia.org/r/#/c/95424/ needs merge, build and aptified [14:15:24] ok... inserting it in my queue [14:15:32] ETA required ? [14:15:39] have you been doing these generally? [14:15:46] more or less [14:15:50] ok [14:15:54] not this specific one [14:16:02] but hell... how different is it going to be ? [14:16:08] how much* [14:16:08] right [14:16:17] (famous last words) [14:16:43] hahaha [14:17:00] ok. 'something different' is going to involve food. [14:17:02] brb [14:17:31] akosiaris: after it's installed, I can probably try to build buck again :-P [14:17:38] on jenkins [14:18:15] akosiaris: are you planning to add graphios to your nagios/icinga refactor? [14:18:34] ok, so it is a blocker. Will try to have it ready soon [14:20:02] matanya: definitely not in a refactoring change. New features must have their own changes [14:20:58] akosiaris: and in general? would that be a useful feature? [14:21:56] I 've never seen it being really useful up to now. There are many solutions out there using nagios performance data and all rely on the fact that you do not already have a performance monitoring solution in place [14:22:11] we already have both ganglia and graphite [14:22:41] and ganglia has a lot better resolution anyway [14:22:57] good answer [14:26:23] so what do you think about going to puppet every half hour now, as we talked about? [14:26:56] good enough [14:27:00] let's try it out [14:28:52] helllooooo [14:29:50] (03CR) 10Dr0ptp4kt: "Good questions. I'm glad you asked them." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95532 (owner: 10Dr0ptp4kt) [14:29:57] (03PS1) 10ArielGlenn: Revert "run puppet once an hour instead of every half hour" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96025 [14:30:30] (03PS2) 10ArielGlenn: Revert "run puppet once an hour instead of every half hour" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96025 [14:32:27] ottomata: hello [14:32:31] haha, who is on RT duty? [14:32:35] i was not on RT duty last week [14:32:41] and the topic still says I am this week too! [14:33:34] (03CR) 10ArielGlenn: [C: 032] Revert "run puppet once an hour instead of every half hour" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96025 (owner: 10ArielGlenn) [14:33:53] ottomata: you are on for ever [14:34:28] ottomata: bad luck [14:35:08] (03PS2) 10Dr0ptp4kt: Ensure that Googlebot-Mobile gets redirected to mobile. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95532 [14:35:38] aww maaan [14:35:39] ok, in two hours we can see if palladium/strontium are falling over yet [14:35:52] :-) [14:35:56] manybubbles: servers are in eh? [14:35:57] I would expect them to hold up very well [14:36:12] huh ... i am not so certain [14:36:34] well they are both r610s with 16gb ram right? which stafford is as well [14:36:46] ^paravoid see https://gerrit.wikimedia.org/r/95532 and my comments. would you please review and +2 and deploy? [14:36:55] so two of them ought to be able to handle runs twice as often [14:37:10] they only have 8 CPUs [14:37:15] not 16 like stafford [14:37:21] hrm [14:37:29] that could be an issue [14:37:38] well there's always revert revert :-D [14:38:01] yeah and thankfully now we just add another machine and we are ok [14:38:14] indeed [14:38:22] because it would be nice to have us back at every half hour [14:38:28] dr0ptp4kt: could you wrap the commit message? [14:38:40] paravoid, yup one sec [14:38:49] also, typo: UAsaccessing [14:39:00] MaxSem: around? [14:39:55] (03PS3) 10Dr0ptp4kt: Ensure that Googlebot-Mobile gets redirected to mobile. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95532 [14:40:20] paravoid, thanks, fixed typo, too. just resubmitted ^^ [14:41:18] paravoid, yep [14:41:35] https://gerrit.wikimedia.org/r/#/c/95532/ ? [14:41:39] yup :) [14:42:00] you're the original author of that regexp iirc [14:42:17] and in any case, getting mobile web's take on this sounds like a good idea :) [14:42:29] nope, that was Asher or Patrick [14:42:39] oh, heh [14:43:59] ottomata: yeah! [14:44:01] hm [14:44:21] ottomata: I still can't get that code to run properly on my test instance though. that salt issue [14:44:24] needs some thinking [14:44:51] see that, manybubbles, getting through monday morning emails, responding to a couple of review things [14:45:00] then will check that out and see if we can create disk partitions for ya [14:45:20] thanks! [14:46:02] (03PS5) 10Manybubbles: Puppet configuration for new elasticsearch servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/95720 [14:48:20] ottomata: still not able to get puppet to run right on an boxes....tried an1013 [14:48:38] yeah [14:48:42] networking still doesn't work cmjohnson1 [14:48:50] ah I've been watching those [14:49:02] https://rt.wikimedia.org/Ticket/Display.html?id=6279 [14:49:06] MaxSem: are you thinking about it now or did you postpone it for some other time or...? [14:49:15] we need LeslieCarr to fix the ACL for the new subnet [14:49:18] thinking [14:49:45] ottomata: looks like I've got other problems on elasticsearch-puppet-tester: load average: 31.49, 17.32, 13.87 [14:49:47] rebooting.... [14:50:01] apergos: get this error http://p.defau.lt/?WuBuszbGwT0R4YHCMhkdmg [14:50:15] the only problem I can invent about it is that some bot will be redirected to mobile [14:50:30] cmjohnson1: DNS is not working [14:50:30] well no ping is going to mean no a lot of other tings too [14:50:32] *things [14:50:46] yes, but dr0ptp4kt grepped for such UAs [14:50:47] also network ? [14:50:54] on the other hand, proper wiki bots should be hitting api.php and not get redirected [14:50:55] (apparently) [14:51:07] if they screen-scrape it is their problem [14:51:14] paravoid, MaxSem, i'll gist a search script in a moment [14:51:23] (03CR) 10Ottomata: Setting up varnishkafka on mobile varnish caches (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [14:51:31] couldn't think of search bots that could be broken [14:51:38] MaxSem, i generally agree...as long as reputable search engines don't encounter problems [14:51:44] (03CR) 10MaxSem: [C: 031] Ensure that Googlebot-Mobile gets redirected to mobile. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95532 (owner: 10Dr0ptp4kt) [14:51:56] paravoid, ^^^ [14:51:59] hashar: err: /Stage[main]/Base::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/base/firewall/defs.labs at /etc/puppet/modules/ferm/manifests/conf.pp:15 [14:52:07] :-(((( [14:52:11] when testing contint in labs... not your fault though [14:52:12] MaxSem: thank you! [14:52:16] looking into it [14:53:42] (03CR) 10Faidon Liambotis: [C: 032] Ensure that Googlebot-Mobile gets redirected to mobile. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95532 (owner: 10Dr0ptp4kt) [14:54:08] dr0ptp4kt: should be live within the next hour [14:56:04] ottomata: 300000ms = 5 minutes [14:56:35] ottomata: if it takes 5 minutes for a message to get from esams to eqiad, we're kinda fucked, aren't we [14:59:09] paravoid, MaxSem, thanks. [14:59:28] (03CR) 10Dr0ptp4kt: "For analysis use https://gist.github.com/dr0ptp4kt/7529136 on the sampled-1000 log files." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95532 (owner: 10Dr0ptp4kt) [15:00:02] (03CR) 10Ottomata: Puppet configuration for new elasticsearch servers (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95720 (owner: 10Manybubbles) [15:00:28] paravoid: ja, we were just overkilling to make sure we didn't miss something [15:04:54] ottomata: I can make that change but that common config thing is kind of the opposite of how I was told to do it months ago. I can probably just have a single role if I go that way - no config or anything. [15:05:16] you can do what you like best, but i think the big issue there is parameterized role classes [15:07:19] manybubbles: don't make the config class change thing if you don't think you should, or others told you not to [15:07:39] i find it very convenient, and that's how it looks like the cache.pp role classes work too [15:07:55] but ja, no parameters on role classes, so that you can use them in labs [15:07:56] ottomata: that was months ago, so I imagine the advice is out of date. Let me see how it looks [15:08:11] manybubbles: it might not be, people give different advice on this :p [15:08:55] but, i don't actually care how you structure your role classes, I think that is up to your judgement mostly, [15:09:04] just gotta get rid of the parameters :) [15:17:31] ah Jeff_Green: are barium and grosley frack (i.e. served by frack puppet)? [15:17:43] still working my way through last of cleanup list [15:18:08] apergos: yes. barium is physically in frack, grosley is still in pmtpa's main cluster but tied to frack puppet [15:18:40] so for grosley do we serve dhcp for it still? [15:18:46] where by 'we' I mean 'not you' [15:18:50] hahaha [15:18:59] it calls my home phone and asks for an IP [15:19:02] hahaha [15:19:10] technically yes, but you can tear that down if you want [15:19:18] nah, I'll leave it in for now [15:19:28] thanks! [15:19:34] it's got it's address and we'll slay it as soon as the fundraising goal is met [15:19:38] ty [15:19:47] when it's slain we'll kill it good :-) [15:19:57] yay! [15:26:18] (03PS1) 10ArielGlenn: remove barium from everywhere and grosley mostly [operations/puppet] - 10https://gerrit.wikimedia.org/r/96031 [15:30:33] (03CR) 10ArielGlenn: [C: 032] remove barium from everywhere and grosley mostly [operations/puppet] - 10https://gerrit.wikimedia.org/r/96031 (owner: 10ArielGlenn) [15:36:07] (03PS1) 10ArielGlenn: remove last trace of maurus (decommed, last sighted in 2006) [operations/dns] - 10https://gerrit.wikimedia.org/r/96033 [15:36:31] ottomata: I think I like your way better. all the configuration across all sites fits on one screen [15:36:46] (03CR) 10ArielGlenn: [C: 032] remove last trace of maurus (decommed, last sighted in 2006) [operations/dns] - 10https://gerrit.wikimedia.org/r/96033 (owner: 10ArielGlenn) [15:41:33] (03CR) 10Akosiaris: [C: 04-1] "The localhost thingy as said inline does not work." (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [15:41:42] hashar: ^^ [15:41:50] I think I lost my original patch [15:41:53] or forgot to send it :( [15:42:47] ahh [15:42:48] no [15:43:04] manybubbles: i like it to, because, say you had some other random puppetization (monitoring somewhere? something else?) that needed to know somethign about elastic search configuration [15:43:05] * hashar blames labs [15:43:11] but didn't want to actually install elastic search [15:43:14] *too* [15:43:21] you could then include just he config class and do stuff with the variables [15:43:38] akosiaris: that seems to be a labs issue, there is no IPv6 defined for localhost: ::1 ip6-localhost ip6-loopback [15:43:45] ah, I haven't actually made a ::config thing yet - though it'd be super easy to do. let me just do that anyway [15:44:06] hashar: not a labs issue... [15:44:28] either way! [15:44:35] i mean, i'm not sure my way is the best [15:44:39] there are many ways to do it [15:45:25] ottomata: it is like 6 more lines - even if we aren't going to need it it'll force all the configs into a specific place for organization [15:46:24] (03PS1) 10ArielGlenn: salt for streber just like every other host [operations/puppet] - 10https://gerrit.wikimedia.org/r/96034 [15:46:47] manybubbles: i'm not in your search labs project [15:46:50] so I can't log into those hosts [15:47:00] let me fix it [15:47:37] (03CR) 10ArielGlenn: [C: 032] salt for streber just like every other host [operations/puppet] - 10https://gerrit.wikimedia.org/r/96034 (owner: 10ArielGlenn) [15:47:50] ottomata: are you Ottomata in labs? [15:48:06] I've added you if you are [15:48:07] yes [15:49:23] manybubbles: also, nc: getaddrinfo: Name or service not known [15:49:29] for elasticsearch-puppet-tests [15:49:45] elasticsearch-puppet-tester [15:49:47] sorry [15:49:54] (03CR) 10Hashar: "Sounds like a problem with /etc/hosts in labs ::1 is not mapped to `localhost`. Will amend with explicit IP v4 and v6 addresses (option c" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [15:51:02] manybubbles: elasticsearch-puppet-tester [15:51:02] has  role::elasticsearch included [15:51:07] instead of  role::elasticsearch [15:51:09]  role::elasticsearch::labs [15:51:17] is that correct? [15:51:18] 'cause I'm making your changes [15:51:33] ah ok awesome [15:51:36] hehe [15:51:36] (03PS5) 10Hashar: contint: migrate firewall rules to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 [15:51:48] ok i'll wait til you are done with that before I look into your salt problem them [15:51:49] then [15:51:53] oook , on to data partitions! [15:51:59] (03PS6) 10Manybubbles: Puppet configuration for new elasticsearch servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/95720 [15:52:00] infact [15:52:03] ^^^ [15:52:25] (03CR) 10Manybubbles: Puppet configuration for new elasticsearch servers (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95720 (owner: 10Manybubbles) [15:55:54] (03CR) 10Akosiaris: [C: 032] contint: migrate firewall rules to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [15:56:08] hashar: when do you want to merge ? Can I do it now ? [15:57:47] (03PS1) 10QChris: Enforce clone ownership for geowiki clones [operations/puppet] - 10https://gerrit.wikimedia.org/r/96037 [15:58:36] (03CR) 10Ottomata: "Aside from that LGTM" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95720 (owner: 10Manybubbles) [15:59:13] akosiaris: yup lets do it [15:59:24] at worth we loose ssh access on the box :D [15:59:43] hmmm [15:59:54] (03PS7) 10Manybubbles: Puppet configuration for new elasticsearch servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/95720 [15:59:56] in case the ferm rule deny us ssh [15:59:58] well we always have out of band :-) [16:00:18] ottomata: so as far as testing my puppet change, I'm still trying to get it working on that tester, but otherwise it seems right [16:00:24] its just that salt grain [16:00:44] ok [16:00:45] also, now that you are happy with the puppet change I can leave the puppet tester machine alone and you can poke it [16:00:51] if you want [16:00:53] :) [16:00:54] can I run puppet there? [16:00:55] k cool [16:01:36] hashar: done [16:02:09] manybubbles: you saw my one comment about the escaped variable in the message? [16:02:28] ottomata: yeah, and I pushed that [16:02:36] for review, that is [16:02:48] oh ok sorry [16:02:48] danke [16:03:09] (03CR) 10Manybubbles: Puppet configuration for new elasticsearch servers (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95720 (owner: 10Manybubbles) [16:03:25] didn't comment that I [16:03:30] I'd done it. just did [16:04:00] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [16:05:56] akosiaris: have you run puppet? running it rightnow [16:06:48] ok [16:06:57] !log gallium: running puppet to apply ferm firewall configuration from {{gerrit|95162}} [16:07:10] Logged the message, Master [16:08:06] manybubbles: i added a site.pp entry last night for elastic...remove if not needed. [16:08:22] more a fyi..so there are no conflicts [16:08:38] cmjohnson1: ah, sure. let me rebase my puppet changes so they make sense [16:10:07] (03CR) 10Hashar: "Ran puppet." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [16:10:14] akosiaris: I see that palladium and stronium are working pretty hard [16:10:20] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=palladium.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [16:10:26] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=strontium.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [16:10:30] yes they are [16:10:34] i am monitoring them [16:10:43] thoughts? [16:10:46] puppet 3 [16:10:50] is my thought [16:10:52] ottomata and cmjohnson1: so cmjohnson1's puppet change last night added all the new nodes to the cluster already [16:10:53] huh [16:11:11] puppet 3 will require a lot of rewritting before we go there [16:11:24] ok but a short-term thought is more what I had in mind :-) [16:11:26] akosiaris: the rules looks good. The default INPUT policy is ACCEPT though, so I can still directly ssh to gallium [16:11:33] short term it looks fine [16:11:46] ottomata: so with the mount point, you should probably shut down each node as you go [16:11:47] akosiaris: why do you say a lot of rewritting? [16:11:56] there's some 100% spikes here and there [16:11:57] a lot of unqualified variables [16:12:08] deprecated behaviour in puppet 3 [16:12:19] not deprecated... removed [16:12:22] shut down each? [16:12:24] it was deprecated in 2.7 [16:12:24] manybubbles: ? [16:12:31] these nodes aren't in produciton yet, are they? [16:12:32] yes, which is why we get the warninfs [16:12:37] *warnings [16:12:38] looks like they are [16:12:40] akosiaris: we get warnigns for these though [16:12:41] surprise! [16:12:44] so it's easy to spot them [16:12:49] they are out there [16:12:55] the unqualified variables in puppet are quite easy to fix though. You can even monitor them by grepping in the syslog [16:12:56] elastic search is on them? [16:13:04] ottomata: cmjohnson1 put them in production last night with his puppet change [16:13:07] bahh [16:13:07] ha [16:13:08] ok [16:13:16] can we wipe it? [16:13:17] probably not intentionally [16:13:17] ahhh [16:13:18] oits ok [16:13:22] one at a time is ok [16:13:26] hm [16:13:27] ok [16:13:31] so, turn off es [16:13:39] just on one machine [16:13:40] do I need to copy existing data? [16:13:44] no [16:13:44] it's easy to find them hashar, it's not always easy to fix them, depending on how your classes are writte [16:13:45] ok [16:13:46] cool [16:13:46] n [16:14:02] /var/lib/elasticsearch ? [16:15:50] paravoid: ferm::conf { 'main': [16:15:50] ensure => present, [16:15:50] prio => '00', [16:15:50] # we also have a default DROP around, postpone its usage for later [16:15:50] source => 'puppet:///modules/base/firewall/main-minimal.conf', [16:15:50] } [16:15:56] the postpone still holds ? [16:16:18] yes, since we used ferm in at least one host (git.d.o) without a proper ruleset [16:16:25] erm [16:16:29] .wm.o even :) [16:16:52] fix that and switch to default DROP [16:17:11] kind of got lost [16:17:20] sorry [16:17:22] git.wikimedia.org has ferm applied ? [16:17:25] yes [16:17:28] but not correctly ? [16:17:30] but only with a single ACCEPT/DROP rule [16:17:42] so it assumes that whatever is not covered is going to be accepted [16:17:44] (iirc) [16:17:52] if you switch to default DROP, you may drop real traffic [16:18:04] so you need to make sure that whatever we use on that box is allowed by the firewall [16:18:43] ah, that is what you mean... ok [16:18:56] but in concept, yes, I'm all for it [16:20:25] 288 dynamic lookup errors in 59 files [16:20:27] not too bad [16:20:53] (03PS1) 10Hashar: contint: firewall out ssh access (restrict to bastion) [operations/puppet] - 10https://gerrit.wikimedia.org/r/96040 [16:21:01] (03CR) 10Hashar: "Ssh filtering did not work because the default policy for the INPUT chain is ACCEPT. Filtering proposed with https://gerrit.wikimedia.org" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [16:21:18] (03CR) 10Hashar: "Follow up https://gerrit.wikimedia.org/r/#/c/95162/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96040 (owner: 10Hashar) [16:21:47] akosiaris: https://gerrit.wikimedia.org/r/#/c/96040/ should restrict ssh access on gallium.wikimedia.org from bastion only. [16:22:49] any volunteers to go through that list and qualify variables? :-) [16:25:10] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:25:11] PROBLEM - ElasticSearch health check on testsearch1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:25:31] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:25:31] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:25:40] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:25:40] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:25:40] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:25:49] manybubbles: ^^^ [16:25:50] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:25:50] PROBLEM - ElasticSearch health check on testsearch1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:25:50] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:25:50] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:25:51] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:25:51] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:26:00] oh my [16:26:00] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:26:01] PROBLEM - ElasticSearch health check on testsearch1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1217: active_shards: 1217: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 87 [16:26:18] manybubbles: that message is not obvious at all. But you can figure that out later on :D [16:26:28] I have the gid one on my list as part of the account rewrite [16:26:34] I can look at some others too, why not [16:26:43] also that's a lot of icinga spam [16:27:02] hashar: will file a bug about that.... can someone ack that and I'll figure it out [16:27:12] apergos: palladium:/root/deprecated-u [16:27:13] I can't ack :/ [16:27:21] apergos: but let's not wait for the account rewrite [16:27:39] no, I'll do some of them right away (i.e. starting tomorrow) [16:27:52] and the account rewrite I want to do with an LDAP export at some point, we agreed to that a few months ago [16:27:54] it will get added to my dailies [16:29:13] :-) [16:30:00] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Nov 18 16:29:52 UTC 2013 [16:30:00] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [16:30:13] decommed? :-D [16:30:30] what is? [16:30:56] just watching dysprosium's behavior [16:31:11] * apergos places a mental bet [16:31:23] and wins [16:31:38] active checks enabled for puppet freshness, who knows why. [16:32:45] fixed [16:33:33] (03CR) 10Addshore: "Looks fine in relation to using the build and the specific wmg var used." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [16:34:11] and won bet #2, it is indeed in decommissioned.pp [16:35:08] 3312d2f11f99b1f0d35b340da2dfdaed05564628 Jul 12 2013 [16:36:21] open rt ticket to find out its status stalled, I will meh... at least it should become a spare [16:37:56] (03PS1) 10Milimetric: [not ready for review] Productionizing Wikimetrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/96042 [16:45:47] I am off, see you later tonight [16:46:00] * greg-g waves [16:52:17] akosiaris: I'm doing more puppet doc updates… is it safe to say that all references to 'sockpuppet' can be replaced with 'palladium'? [16:52:53] (03PS2) 10Ottomata: Enforce clone ownership for geowiki clones [operations/puppet] - 10https://gerrit.wikimedia.org/r/96037 (owner: 10QChris) [16:52:58] (03CR) 10Ottomata: [C: 032 V: 032] Enforce clone ownership for geowiki clones [operations/puppet] - 10https://gerrit.wikimedia.org/r/96037 (owner: 10QChris) [16:54:17] andrewbogott: btw, see above for puppet deprecated errors; might be useful to have this in mind when modularizing [16:55:17] (03PS1) 10Jgreen: remove civicrm dev sites from aluminium [operations/puppet] - 10https://gerrit.wikimedia.org/r/96051 [16:55:54] oof, any reason I shouldn't just remove any docs about how to manage solaris boxes? [16:59:53] is gerrit/jenkins known-fail at the moment? [17:00:00] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Nov 18 16:59:58 UTC 2013 [17:00:03] oop there its is [17:00:15] (03CR) 10Jgreen: [C: 032 V: 031] remove civicrm dev sites from aluminium [operations/puppet] - 10https://gerrit.wikimedia.org/r/96051 (owner: 10Jgreen) [17:01:10] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:01:10] RECOVERY - ElasticSearch health check on testsearch1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:01:24] ^d and ottomata: so what happened with elasticsearch is this: a long while back we set the default number of replicas to 0 and configured prod to use 2. [17:01:30] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:01:31] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:01:35] well that didn't take and we didn't have something to warn us of that [17:01:40] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:01:40] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:01:40] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:01:41] now icinga-wm will spam us [17:01:50] RECOVERY - ElasticSearch health check on testsearch1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:01:50] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:01:50] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:01:50] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:01:50] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:01:51] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:02:00] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:02:00] RECOVERY - ElasticSearch health check on testsearch1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 1304: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [17:02:10] ^d and ottomata: so we lost all our redundancy [17:02:17] uh ohhh [17:02:31] and this morning when ottomata went move data to a new partition we turned off a node [17:02:31] sooo, did I actually delete data when I trashed /var/lib/es on es1001? [17:02:37] yup [17:02:40] yay! [17:02:44] but that shouldn't have happened [17:02:55] so we'll have to rebuild all those wikis from scratch now [17:03:00] but we were going to do that any way [17:03:05] bwerp ok [17:03:12] AND we'll need to add the redundancy back today [17:03:18] because, wtf, where did it go? [17:03:32] k [17:03:36] I've updated https://wikitech.wikimedia.org/wiki/Search/New#Stuck_in_red [17:03:55] with some bash/awk/curl foo to get all the unassigned shards assigned again [17:04:10] es1001? [17:04:16] please tell me you meant elastic1001 [17:04:37] yeah, elastic10XX and testsearch100X [17:04:39] and if so, let's avoid the es notation, it'll get way too confusing [17:04:56] there is an es1001 too and it's a different thing [17:04:58] (external store) [17:04:59] /var/lib/elasticsearch is the dir any way [17:05:03] oh my [17:05:18] <^d> (This is why I said don't use the es name in that e-mail thread :)) [17:05:26] let me go and figure out what happened to our redundancy [17:05:43] :) [17:08:06] hm ok, paravoid that was my irc abbrev [17:08:11] yeah I figured [17:08:20] but "trashed" es1001 sounds omg so bad :) [17:08:23] haha [17:08:49] yes it sure does [17:08:55] paravoid, i have a puppet dependency cycle I don't get [17:08:55] (Exec[git_pull_geowiki-data-private] => Git::Clone[geowiki-data-private] => File[/a/geowiki/data-private] => Exec[git_pull_geowiki-data-private]) [17:08:58] that would give several people a quick heart attack [17:09:51] git::clone { 'geowiki-data-private': [17:09:51] directory => $geowiki_private_data_path, [17:09:51] ... [17:09:51] } [17:09:51] and [17:09:52] file { "$geowiki_private_data_path": [17:09:52] ensure => directory, [17:09:53] require => Git::Clone['geowiki-data-private'], [17:09:53] … [17:09:54] } [17:10:02] <^d> !paste | ottomata [17:10:11] I think you don't need the File [17:10:14] git::clone does that [17:10:18] it doesn't [17:10:20] ^d: I actually prefer it that way :-) [17:10:20] i thought it did too [17:10:30] it does implicitly [17:10:32] i was going to change it, but then I was afraid I would break other things [17:10:40] no, i don't think it does? [17:10:43] for example [17:10:45] when you do "git clone git://foo.git" git will create "foo" [17:10:48] git::clone has a mode param [17:10:50] but it doesn't do anything [17:10:54] yeah but we're worried about permissions [17:11:08] <^d> git::clone sucks. [17:11:19] it does [17:11:38] paravoid, it isn't even used though [17:11:41] yup [17:11:46] it's broken [17:11:49] would it solve this if I implement the mode param right now? [17:11:52] yes [17:11:53] :) [17:11:56] i was going to do that too [17:12:03] oh, well, don't let me stop you... [17:12:06] buuuut, i was afraid i'd break stuff with duplicate File definition errors [17:12:07] haha [17:12:15] (03PS1) 10ArielGlenn: remove technetium and tingxi [operations/dns] - 10https://gerrit.wikimedia.org/r/96055 [17:12:21] since there are people that reference the $directory outside of git::clone [17:12:30] I don't immediately know how to do it, but I'll look. [17:12:30] could wrap them all in an if defined(...) [17:12:34] "git grep" is your friend [17:12:45] oh you want to do it in the exec? :p [17:12:50] hmmmm [17:12:53] no [17:12:58] I meant to fix the call sites [17:13:11] ? [17:13:14] why git grep then? [17:13:19] git grep git::clone [17:13:22] (03CR) 10ArielGlenn: [C: 032] remove technetium and tingxi [operations/dns] - 10https://gerrit.wikimedia.org/r/96055 (owner: 10ArielGlenn) [17:13:27] ohohoh [17:13:27] sorry [17:13:47] for files /in/ a repo, mode is managed by git, right? [17:13:51] thought you were suggesting some crazy git grep + exec or something to fix perms [17:13:53] So we're only talking about the top-level dir? [17:13:56] no :) [17:14:00] andrewbogott: yes [17:14:04] yes [17:14:31] yeah just do a file { $directory: ensure => directory, require => git_clone_$title …} somethign in there [17:14:37] but you have to check all of the callers of git::clone [17:14:53] and either remove any place that the file { $directory is used [17:15:04] and/or wrap them all in an if (defined(…)) block [17:15:49] who's stopping you? :) [17:16:23] baaaaaaaaaaaaaa because I didn't wanna [17:16:24] OKOOK [17:16:26] I WILL DO IT [17:16:28] ottomata: can just use a different path and title [17:16:29] qchris: ^^ :p [17:16:36] ? [17:16:45] I'm pretty sure. Give me a second, I'll show you [17:16:54] that's what the 4 people before you said and that's how we ended up having this discussion :P [17:17:47] (03PS1) 10Chad: Clean up replica configuration for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96056 [17:17:56] <^d> manybubbles: ^ [17:18:30] (03PS2) 10Chad: Clean up replica configuration for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96056 [17:18:57] (03PS3) 10Chad: Clean up elasticsearch replica configuration for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96056 [17:20:06] (03CR) 10Manybubbles: Clean up elasticsearch replica configuration for all wikis (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96056 (owner: 10Chad) [17:20:13] (03CR) 10Manybubbles: [C: 031] Clean up elasticsearch replica configuration for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96056 (owner: 10Chad) [17:20:24] ^d: ^^^ [17:20:27] I don't have +2 [17:20:34] but is less broken, I think [17:20:41] <^d> I shall fix that later too. [17:20:58] (03CR) 10Chad: [C: 032] Clean up elasticsearch replica configuration for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96056 (owner: 10Chad) [17:21:07] (03Merged) 10jenkins-bot: Clean up elasticsearch replica configuration for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96056 (owner: 10Chad) [17:22:00] <^d> Reedy: You have uncommitted changes to docroots on tin. [17:22:18] ergh [17:22:20] paravoid [17:22:22] i stopped doing this before [17:22:32] because labs_vagrant class has [17:22:40] file { '/mnt/vagrant': [17:22:40] recurse => true, [17:22:41] owner => 'vagrant', [17:22:41] group => 'www-data', [17:22:48] and git::clone { 'vagrant': [17:22:48] directory => '/mnt/vagrant/', [17:22:48] (03PS1) 10Andrew Bogott: Implement the $mode param for git::clone [operations/puppet] - 10https://gerrit.wikimedia.org/r/96057 [17:22:52] recurse true! [17:22:55] grr [17:23:12] ottomata: I mean like ^ [17:23:17] Although I haven't tested it yet :) [17:23:25] andrewbogott: not sure about that, but maybe [17:23:29] also, don't forget owner and group [17:23:39] !log demon updated /a/common to {{Gerrit|I3b5eac7fb}}: depool db74 for move to S6 [17:23:40] owner and group are handled by git I believe. [17:23:44] also [17:23:50] you need $title at least in the file name [17:23:53] Logged the message, Master [17:23:55] so that it is unique per use of git::clone [17:24:03] Ah, good point. [17:24:04] andrewbogott: not if they get set incorrectly [17:24:13] puppet is supposed to ensure that this is true [17:24:18] they might get set properly on initial clone [17:24:23] ? [17:24:23] but not later [17:24:32] !log demon synchronized wmf-config/CirrusSearch-common.php 'I114677d0' [17:24:33] The initial clone is done with the same class using the same owner/group... [17:24:34] say someone does [17:24:44] chown boogerman /path/to/clone [17:24:45] manually [17:24:45] Logged the message, Master [17:24:46] You're talking about if it's changed in puppet after the clone exists? [17:24:48] eah [17:24:49] no [17:24:52] not in puppet [17:24:52] ^d and ottomata: adding replicas now [17:24:54] but by someone [17:25:05] anyway, andrewbogott, i have a similar commit coming [17:25:06] oh, well… ok [17:25:11] !log demon synchronized wmf-config/InitialiseSettings.php 'I114677d0' [17:25:13] with changes to the uses [17:25:16] mind if I use mine? [17:25:20] nope [17:25:20] i will try your path things [17:25:21] thing [17:25:25] Logged the message, Master [17:28:28] (03PS1) 10Chad: Fix aggressive splitting setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96059 [17:28:32] (03CR) 10jenkins-bot: [V: 04-1] Fix aggressive splitting setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96059 (owner: 10Chad) [17:28:39] (03PS2) 10Chad: Fix aggressive splitting setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96059 [17:29:19] (03CR) 10Chad: [C: 032] Fix aggressive splitting setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96059 (owner: 10Chad) [17:29:30] (03Merged) 10jenkins-bot: Fix aggressive splitting setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96059 (owner: 10Chad) [17:29:57] (03PS1) 10Ottomata: Fixing git::clone so that it respects permissions on directory. [operations/puppet] - 10https://gerrit.wikimedia.org/r/96060 [17:30:05] andrewbogott, paravoid ^ [17:30:11] ack .! [17:30:29] (03PS2) 10Ottomata: Fixing git::clone so that it respects permissions on directory [operations/puppet] - 10https://gerrit.wikimedia.org/r/96060 [17:30:29] (avert your eyes paravoid! don't look at the period!) [17:30:30] ok fixed [17:30:36] :p [17:30:58] !log demon synchronized wmf-config/InitialiseSettings.php 'I593f62a5' [17:31:11] Logged the message, Master [17:31:18] ottomata: That looks reasonable to me -- have you tried it? :) [17:31:29] !log demon synchronized wmf-config/CirrusSearch-common.php 'I593f62a5' [17:31:32] The if defined seems like overkill at this point, but harmless [17:31:43] Logged the message, Master [17:31:51] andrewbogott: nope! [17:32:00] but, the only place I could see it would break would be the labs_vagrant class [17:32:06] I can test if you're not set up to try... [17:32:09] i was thinking about wrapping it [17:32:13] oo, if you can test please do [17:32:14] thank you [17:32:40] (03PS1) 10Chad: Fixing replica counts one last time [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96061 [17:32:44] (03CR) 10jenkins-bot: [V: 04-1] Fixing replica counts one last time [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96061 (owner: 10Chad) [17:33:11] (03PS2) 10Chad: Fixing replica counts one last time [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96061 [17:35:10] (03CR) 10Manybubbles: [C: 031] Fixing replica counts one last time [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96061 (owner: 10Chad) [17:35:23] (03CR) 10Chad: [C: 032] Fixing replica counts one last time [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96061 (owner: 10Chad) [17:35:32] (03Merged) 10jenkins-bot: Fixing replica counts one last time [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96061 (owner: 10Chad) [17:36:18] !log demon synchronized wmf-config/CirrusSearch-common.php 'If920e0e9' [17:36:32] Logged the message, Master [17:40:33] anomie: thanks for responding with patches on that CentralAuth bug so quickly [17:40:48] anomie: I'm suspecting it's killing us in all layers of the infrastructure [17:41:51] I might be causing more icinga spam - sorry in advance for when/if it comes [17:41:58] adding new replicas [17:42:05] to recover from my mistake this morning [17:42:06] (03CR) 10Andrew Bogott: [C: 032] "I tested the $mode behavior on labs, it worked fine." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96060 (owner: 10Ottomata) [17:47:10] (03CR) 10Ottomata: "Great, thanks." [operations/puppet] - 10https://gerrit.wikimedia.org/r/96060 (owner: 10Ottomata) [18:03:06] Jeff_Green: do you know what the current PDF box hardware configuration is? I'm trying to put together a RT ticket for requisition [18:06:13] mwalker: checking [18:07:20] paravoid: can you please describe the application server? I'd like to know where the imagescalers should go [18:07:33] sorry, I'm a little busy atm [18:08:19] np, let me know when there is no outage or meeting or whatever :) [18:08:27] heh [18:08:29] good luck with that [18:08:39] this seems to be my daily routine these days [18:09:13] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 2962: relocating_shards: 0: initializing_shards: 21: unassigned_shards: 477 [18:09:13] PROBLEM - ElasticSearch health check on testsearch1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 2962: relocating_shards: 0: initializing_shards: 17: unassigned_shards: 481 [18:09:20] sounds like a lot a fun to be a fireman [18:09:23] mwalker: uptime 1179 days [18:14:21] mwalker: they're pretty modest machines: dual Xeon L5420's, 8GB RAM, single 73GB SAS HDD [18:15:14] PROBLEM - ElasticSearch health check on testsearch1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 3188: relocating_shards: 0: initializing_shards: 13: unassigned_shards: 307 [18:16:13] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 3204: relocating_shards: 0: initializing_shards: 15: unassigned_shards: 297 [18:19:43] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 3236: relocating_shards: 0: initializing_shards: 17: unassigned_shards: 295 [18:19:44] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 3236: relocating_shards: 0: initializing_shards: 17: unassigned_shards: 295 [18:19:44] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 3236: relocating_shards: 0: initializing_shards: 17: unassigned_shards: 295 [18:20:02] (03PS1) 10Manybubbles: Increase Cirrus pool counter for new servers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96064 [18:20:33] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 3255: relocating_shards: 0: initializing_shards: 13: unassigned_shards: 288 [18:20:34] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1304: active_shards: 3255: relocating_shards: 0: initializing_shards: 13: unassigned_shards: 288 [18:20:57] gah [18:22:03] checking [18:23:09] ignore it - sorry [18:23:17] If I do things too fast I think it causes this [18:23:44] this is me adding more replicas and Elasticsearch taking longer than expected to catch up [18:25:31] Jeff_Green: if you replied, I missed it; my IRC client froze; and I just noticed... [18:25:48] mwalker: they're pretty modest machines: dual Xeon L5420's, 8GB RAM, single 73GB SAS HDD [18:26:52] manybubbles: soooo [18:26:58] should I do more, orr waat? [18:27:25] LeslieCarr: can you help meeeeee? [18:27:53] AaronSchulz_: hey [18:28:24] Jeff_Green: that was probably kick ass back in the day :p -- we're still not sure of load; but we do expect to be CPU and RAM heavy [18:29:00] I'm planning on working off of a scratch local disk; so it would be nice if we had an SSD in there; but it's probably not strictly required [18:30:01] paravoid: I was just reading an email about whether to put MW tarballs in swift [18:30:58] AaronSchulz: I didn't see that [18:32:29] mwalker: for temp files? would shm make sense? [18:33:18] Jeff_Green: probably not; temporary could mean several days [18:33:34] k [18:34:18] (03CR) 10Ottomata: Writing JSON statistics to log file rather than syslog or stderr (031 comment) [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/95473 (owner: 10Ottomata) [18:34:51] (03PS1) 10Reedy: NOT COMMITTING YOUR CHANGES IS BAD, YO [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96066 [18:35:04] ^d: Not my fault, blame ori-l [18:35:14] <^d> :p [18:35:29] (03PS2) 10Reedy: NOT COMMITTING YOUR CHANGES IS BAD, YO [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96066 [18:35:50] that commit summary bug needs fixing ;) [18:36:05] Reedy: it's filed. [18:36:09] (03CR) 10Reedy: [C: 032] Update bits static symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96066 (owner: 10Reedy) [18:36:15] !bug grrrit [18:36:15] https://bugzilla.wikimedia.org/grrrit [18:36:26] MatmaRex: So is bug 1 in our bugzilla ;) [18:36:29] what is that crappy bangcode. [18:36:34] !bugsearch grrrit [18:36:41] seriously? [18:37:00] https://bugzilla.wikimedia.org/show_bug.cgi?id=54372 [18:37:15] Jeff_Green: back to hardware configuration, my initial thought is to take the old configuration and just modernize it; so 2xquad core CPU, 16 or 32 GB RAM, and a small hard drive (SSD or spinny). Cscott pointed out that latex is probably disk heavy; so we should probably prefer SSDs [18:38:01] mwalker: sounds reasonable [18:38:02] ottomata: I'll let you know. I'll have a write up this afternoon about wtf happened. [18:38:11] its unpleasant [18:38:33] ok cool [18:39:08] Jeff_Green: how many should I request? and/or is their a standard configuration we have close to that? [18:39:46] i think just throw it in RT with essentially what you've described here and we'll follow up [18:40:03] kk [18:41:34] (03PS8) 10Manybubbles: Puppet configuration for new elasticsearch servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/95720 [18:42:30] (03CR) 10Ottomata: Setting up varnishkafka on mobile varnish caches (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [18:42:52] ottomata: so testsearch100[1-3] seem to be freaking out. because: they are still running 0.90.3 and it looks like there is some kind of bug preventing it from working properly [18:43:08] wuh oh! [18:43:17] I'm shuffling the shards off of them [18:43:24] then we need to just shoot them [18:43:38] that last puppet update removes them from the list [18:43:51] ah, let me decommission them. we have a puppet class for that [18:45:34] (03PS9) 10Manybubbles: Puppet configuration for new elasticsearch servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/95720 [18:45:58] yeah k [18:48:19] (03Merged) 10jenkins-bot: Update bits static symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96066 (owner: 10Reedy) [18:48:21] mwalker, Jeff_Green, we currently have 3 boxes and they're running close to full cpacity. new renderer should theoretically be faster, but still I'd say we should start with 4 servers [18:48:35] sounds good [18:49:19] (03PS7) 10Ottomata: Setting up varnishkafka on mobile varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 [18:49:25] (03CR) 10Chad: [C: 031] Increase Cirrus pool counter for new servers [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96064 (owner: 10Manybubbles) [18:49:27] we're pretty close to one of our standard builds, so we could just pull the SSD and toss it into the spare pool if it proves unnecessary [18:49:57] Can they swim? [18:50:11] we dredge them was necessary [18:50:34] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1308: active_shards: 3580: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [18:50:34] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1308: active_shards: 3580: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [18:50:43] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1308: active_shards: 3580: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [18:50:43] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1308: active_shards: 3580: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [18:50:43] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1308: active_shards: 3580: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [18:51:12] so manybubbles, just fyi, I can create the new mounts without deleting the data [18:51:13] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1308: active_shards: 3580: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [18:51:13] RECOVERY - ElasticSearch health check on testsearch1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1308: active_shards: 3580: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [18:51:15] for the future ones [18:51:26] i can just stop elastics and then copy the data into the new parittion [18:51:35] ottomata: it _shouldn't_ matter [18:51:41] mostly elatic will throw out the old data any way [18:51:47] Jeff_Green: MaxSem: https://rt.wikimedia.org/Ticket/Display.html?id=6335 [18:51:53] because it'll rarely decide that that machine should host the same shard [18:51:59] people complain about that bug all the time [18:52:04] but that is how it is [18:52:15] mostly it'll have replicated by the time you bring everything back up [18:52:26] "status" : "green" [18:52:35] ottomata: as you see from the recovery above [18:53:54] "No permission to view ticket" [18:54:10] k, manybubbles you tell me when and what to do :p [18:54:41] MaxSem: hah! even though I CC'd you on it [18:54:50] ottomata: it is now safe to shoot them one at a time like we tried this morning [18:55:20] oh; great [18:55:24] someone triaged it [18:55:30] now I don't have permissions to view it either :p [18:56:04] I luv RT <3 [18:56:27] (03CR) 10Chad: [C: 032] Simplify wmfBlockJokerEmails [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95283 (owner: 10Reedy) [18:56:32] mwalker, or ops who configured it?:P [18:57:33] Procurement queueue? [18:57:45] Jeff_Green: max and I can no longer view the ticket; I'm guessing that's because you moved it to the requisitions queue? if so; can you take the lead on forwarding any questions to me? (or give me permissions or something.) [18:57:46] <^d> Reedy: Sync that when it merges ^ so nobody yells at us :p [18:58:23] I'll add you as requestors [19:00:51] mwalker: better? [19:01:47] Jeff_Green: still no permission to view ticket [19:02:22] move it back to core-ops for now to have the discussion [19:02:36] then just put it in procurement when it's ready for ordering from vendor [19:02:52] alright [19:02:56] the permission thing is because we cant make the vendor part public [19:03:02] but thats all [19:09:05] mutante: can you set a template in RT? if you will, i can resolve rt128 [19:09:46] matanya: permission-wise.. yea [19:10:38] (03CR) 10jenkins-bot: [V: 04-1] Simplify wmfBlockJokerEmails [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95283 (owner: 10Reedy) [19:11:30] stfu grrrit-wm [19:14:20] (03PS4) 10Reedy: Simplify wmfBlockJokerEmails [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95283 [19:14:25] (03CR) 10Reedy: [C: 032] Simplify wmfBlockJokerEmails [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95283 (owner: 10Reedy) [19:16:28] Jeff_Green: if Ryan_Lane sticks to his guns and won't let us use labs as a proving grounds; do you know if we have a spare box we could use until we get actual hardware? [19:16:45] mwalker: we always do [19:16:49] for this purpose ;) [19:17:04] mwalker: the SSD is the only thing we probably don't have handy, [19:17:11] mwalker: add a ticket to RT in procurement queue [19:17:51] ok, we don't need SSDs to start with; it'll likely just be slower -- or maybe not; we don't really know yet :p [19:17:56] ha [19:18:03] throw it on shm :-) [19:18:24] (03PS10) 10Ottomata: Puppet configuration for new elasticsearch servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/95720 (owner: 10Manybubbles) [19:18:27] disk and anything pretending to be disk is so 1998 [19:18:31] (03CR) 10Ottomata: [C: 032 V: 032] Puppet configuration for new elasticsearch servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/95720 (owner: 10Manybubbles) [19:20:18] * AND status = 'new' [19:20:20] so, mutante if you will create the template and help me test a bit i can right the needed puppet code [19:20:30] *write [19:22:39] (03Merged) 10jenkins-bot: Simplify wmfBlockJokerEmails [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95283 (owner: 10Reedy) [19:22:48] PROBLEM - ElasticSearch health check on testsearch1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.137 [19:23:18] PROBLEM - ElasticSearch health check on testsearch1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [19:23:57] !log reedy synchronized wmf-config/CommonSettings.php 'I0921b9e853fdacd554de40f138644d6187ae7efe' [19:24:11] Logged the message, Master [19:24:38] PROBLEM - LVS HTTP IPv4 on search.svc.eqiad.wmnet is CRITICAL: Connection refused [19:25:20] eh? [19:25:33] is that elasticsearch? [19:25:35] I think it is [19:25:38] RECOVERY - LVS HTTP IPv4 on search.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 405 bytes in 0.001 second response time [19:25:52] it was a page, whatever it is [19:25:59] manybubbles, ^d: ping? [19:25:59] it was elastic [19:26:13] seriously, lvs, [19:26:16] that is elasticsearch [19:26:30] ottomata just merged something that includes ::decomission to testsearchNNNN [19:26:31] 11:27 <+icinga-wm> PROBLEM - ElasticSearch health check on testsearch1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.137 [19:26:44] it seems testsearch1001 and 1002 went down [19:26:54] they did because of https://gerrit.wikimedia.org/r/95720 [19:27:01] heh, stupid gmail. won't let me delete 110,000 emails? I'll script it through imap, one message at a time [19:27:08] mutante: lvs should have picked up elastic10XX [19:27:36] picked up how? [19:27:38] noone configured it to [19:27:38] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.110 [19:28:02] Ryan_Lane: prepping for your departure? [19:28:04] paravoid: grrr - I thought class { "lvs::realserver": realserver_ips => [ "10.2.2.30" ] } was in the old config [19:28:05] eqiad/search:{ 'host': 'testsearch1001.eqiad.wmnet', 'weight': 30, 'enabled': True } [19:28:07] can you tell me which servers should it be pointed to? [19:28:08] eqiad/search:{ 'host': 'testsearch1002.eqiad.wmnet', 'weight': 30, 'enabled': True } [19:28:10] or just a LOT of spam? [19:28:12] manybubbles: that does not alter the LVS config [19:28:15] cron spam [19:28:21] paravoid: seriously? [19:28:21] i figure we disable them [19:28:29] manybubbles: seriously, on purpose [19:28:33] it's most of my mailbox [19:28:38] this is the realserver part, not the balancer part [19:28:46] paravoid: are you setting False in pybal ? edit conflict [19:28:56] I haven't edited anything [19:29:01] Jeff_Green: https://rt.wikimedia.org/Ticket/Display.html?id=6336 is the ticket for the temporary box -- I put it in the procurement queue; so once again I can't read it; but I shouldn't need to have much input in it [19:29:07] manybubbles: what should the new realservers be? [19:29:18] mwalker: ok. i'll shop it around [19:29:26] paravoid: elastic1001-elastic1012 [19:29:40] mutante: you have the file open, are you fixing or should I? [19:29:56] i'm adding the new ones, what about testsearch1003 [19:30:11] mutante: it should be getting retired as well [19:31:01] mutante: it looks like it will get retired when puppet runs for it [19:31:11] !log pybal - eqiad/search: disabling testsearch100[13], adding elastic10[12] [19:31:20] paravoid: is this all though? [19:31:23] Logged the message, Master [19:31:35] mutante: it wasn't, I fixed it [19:31:44] it was 01-12, not 01-02 [19:32:21] yeah 01-12 [19:32:25] oh, i see, we have that many now [19:32:37] yea, i seem then all now,thx [19:32:53] (03PS1) 10Jgreen: remove manifests from erzurumi, it's already on frack puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/96072 [19:32:54] fixed now [19:33:30] paravoid and mutante: thanks. can you point me to documentation on this? [19:33:36] I don't want to break it again [19:33:50] manybubbles: https://wikitech.wikimedia.org/wiki/Pybal [19:33:52] https://wikitech.wikimedia.org/wiki/LVS [19:34:04] the latter has more, the first one is just the software [19:34:08] yea, that's the better one [19:34:15] just noticed the HowTo is over there [19:35:06] manybubbles: technically, ottomata caused this :) [19:35:17] huh [19:35:21] is search on mw.org broken? [19:35:25] "We could not complete your search due to a temporary problem. Please try again later." [19:35:31] i tried later, still didn't work. :P [19:35:36] https://www.mediawiki.org/w/index.php?title=Special%3ASearch&profile=default&search=commit+message+guidelines&fulltext=Search [19:35:41] yup, still doesn't work [19:35:46] manybubbles: ^^ [19:36:08] * greg-g sighs [19:36:11] so, why did we decom testsearchNNNN before we tried switching LVS? [19:36:28] this is a weird way of migrating a service [19:37:10] MatmaRex: I'm seeing it [19:37:25] paravoid: I'm brain dead after all the other excitement today. [19:38:02] paravoid: I'll send an email, but they were broken in other ways and needed to go. I thought we'd already added the new nodes to lvs. [19:38:05] do you want to roll back to testsearch ? [19:38:07] Not for any good reason [19:38:07] (03PS1) 10Andrew Bogott: Added a 'no longer puppetmaster' motd on sockpuppet and stafford [operations/puppet] - 10https://gerrit.wikimedia.org/r/96073 [19:38:12] Jeff_Green: ^ [19:38:23] mutante: give me a moment - probably not [19:38:27] andrewbogott: yay! [19:38:44] manybubbles: kk [19:39:05] * andrewbogott logs into sockpuppet to merge that patch... [19:39:24] (03CR) 10Andrew Bogott: [C: 032] Added a 'no longer puppetmaster' motd on sockpuppet and stafford [operations/puppet] - 10https://gerrit.wikimedia.org/r/96073 (owner: 10Andrew Bogott) [19:39:59] PROBLEM - ElasticSearch health check on testsearch1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [19:40:13] hehe [19:40:51] so, uh, is this blocking people saving edits for some reasons? [19:40:52] -s [19:40:54] ottomata: wait until you write the outage report before you "hehe" :) [19:41:18] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [19:41:29] hey! [19:41:39] ha [19:41:44] uh, wait this is a real outage? [19:41:45] I'm getting 503s [19:41:46] yes. [19:41:49] these are new search servers? [19:42:04] you merged a patch that decom'ed the old ones [19:42:06] yeah [19:42:08] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [19:42:19] yesshhhh, manybubbles said he put everything on the new ones [19:42:35] ottomata: there was no pybal change [19:42:48] ottomata: so they just went down but still in LVS [19:42:58] oh so search reqs still routed to them? [19:42:59] ottomata: paravoid added all the new ones now.. but it doesn't work [19:43:05] a) you didn't change pybal at the same time, which made pybal look at the old ones, b) the new ones weren't confirmed to be working before we decom'ed the old servers [19:43:27] manybubbles: still working on it? need anything? [19:43:43] ok sorry, i wasn't aware that manybubbles' change could actually bring things down [19:44:17] ottomata: turns out my clever use of 'path =>' doesn't help, we still get conflicts. [19:44:27] search is considered a critical service, and it pages us too [19:44:30] andrewbogott: just on labs_vagrant instances, or elsewhere? [19:44:35] So I guess we need to add some more if (!defined) [19:44:42] mistakes happen, but please don't CR+2/merge if you're not sure how things work :) [19:44:43] well, on palladium and sockpuppet at the moment. [19:44:54] yeah, for sure, sorry about that paravoid [19:45:25] you're writing an outage report now though :) [19:45:32] what sockpuppet? [19:45:35] manybubbles and I were confused this morning, because we didn't actually expect these machines to already be in the elasticsearch cluster [19:45:38] well, after this is over [19:46:02] manybubbles: still here? [19:46:25] when cmjohnson1 puppetized them, elasticsearch was installed and they were added to the cluster, which we wanted to do slowly [19:46:54] <^d> Ok, well that's not how the puppet setup is designed. [19:46:55] i think in the future, for new nodes of any system like this, the base system can/should be puppetized, just not any of the functional softwares [19:47:21] that should've been done under manybubbles watch [19:47:23] so, yeah, future perfects and all, but what's going to happen now to fix the issue? [19:47:46] yea, revert or not [19:47:56] i don't know, hoping manybubbles chimes in! [19:47:58] can we revert? nik said he didn't want to [19:48:00] I'm here [19:48:02] ^d: manybubbles isn't responding to pings, I have no idea if he's deep in debugging or afk; I don't think any ops know anything about elasticsearch, could you please help? [19:48:06] sorry, working around more stuff [19:48:12] oh [19:48:14] there he is :) [19:48:27] so we somehow got unassigned shard [19:48:32] I reall yreally really have no idea how [19:48:35] lots of unassinged shards [19:48:39] and I'm assigning them [19:48:49] using a script I've shoved on wikitech/wiki/Search/New [19:48:54] i have no idea how complicate it is for you guys, but first thought is maybe you just wanna bring testsearch back up for now and then figure out the rest [19:49:11] That should get us better [19:49:33] i have to run in 10 minutes [19:49:37] ottomata: we can bring testsearch back up only if we upgrade elasticsearch on them [19:49:42] ugh [19:49:58] manybubbles: can we just turn off the new ones? [19:50:06] I see some results from https://www.mediawiki.org/w/index.php?title=Special%3ASearch&profile=default&search=commit+message+guidelines&fulltext=Search at any rate [19:50:09] if you want to bring them all down [19:50:09] will it work if we have testsearch with the old versions and no one using the new versions? [19:50:42] that's fine, i can do that real quick [19:50:42] ottomata: OK, nevermind, this is actually because of a different mistake I think. [19:50:45] at this point all the state has bee migrated to elastic1001-1012 and bringing the old ones back online will require migrating it back [19:50:52] ok... [19:50:53] so [19:50:54] what? [19:51:01] better to just get the new ones up? [19:51:16] sounds like that's what he's doing (assigning shards, on the new ones, I assume) [19:51:19] yeah [19:52:00] yeah [19:52:22] at this point we can either switch back to lsearchd or as primary on all wikis or wait until this is done [19:52:58] please switch to lsearchd [19:52:59] how long til this is done? [19:53:02] manybubbles: how long with the fix take? and can the switch back to lsearchd happen without any other intervention? [19:53:08] it just finished moving everything back [19:53:09] I'm leaning towards lsearchd until we're good [19:53:12] (03PS1) 10Andrew Bogott: Remove explicit creation of /var/lib/git/operations/software [operations/puppet] - 10https://gerrit.wikimedia.org/r/96076 [19:53:12] ok [19:53:19] lsearchd can be done without my help [19:53:39] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1312: active_shards: 3592: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:53:52] and now that we're recovered the exceptions have stopped [19:54:28] but it'd still be worth switching back until I can figure out what knocked out all those shards [19:54:45] in fact, that is pretty critical. I can't really trust the system without understanding that [19:54:58] <^d> On it. [19:55:01] alright, who's going to switch us ... thanks [19:55:02] <^d> lsearch thing. [19:55:10] (03CR) 10Andrew Bogott: [C: 032] Remove explicit creation of /var/lib/git/operations/software [operations/puppet] - 10https://gerrit.wikimedia.org/r/96076 (owner: 10Andrew Bogott) [19:55:26] (03PS1) 10Chad: Disable Cirrus on all wikis except test2wiki, back to lsearchd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96077 [19:55:52] (03CR) 10Chad: [C: 032] Disable Cirrus on all wikis except test2wiki, back to lsearchd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96077 (owner: 10Chad) [19:55:58] (03CR) 10Faidon Liambotis: [C: 032] Disable Cirrus on all wikis except test2wiki, back to lsearchd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96077 (owner: 10Chad) [19:56:03] (03Merged) 10jenkins-bot: Disable Cirrus on all wikis except test2wiki, back to lsearchd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96077 (owner: 10Chad) [19:56:15] ^d: thanks. [19:56:43] <^d> yw [19:56:55] are you deploying too? [19:56:56] !log demon synchronized wmf-config/InitialiseSettings.php 'Disabling Cirrus on all wikis but test2wiki' [19:56:58] there [19:56:59] <^d> Done. [19:56:59] ok you are [19:57:00] :) [19:57:05] thank you [19:57:10] Logged the message, Master [19:57:10] <^d> No problem. [19:57:19] <^d> Left test2wiki in place so we can have some place for debugging. [19:57:21] happy monday morning, everyone! [19:57:31] people, 12 outages in 12 days [19:57:49] 12 consecutive days [19:57:58] (03PS1) 10Andrew Bogott: Remove explicit creation of /var/lib/git/operations/puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/96078 [19:58:01] yikes sorry all :( [19:58:32] ok, i'm teaching a class today, have to run to make it [19:58:42] paravoid: happy to write outage report, but I won't be able to until tomorrow [19:58:46] or late tonight [19:59:34] ok, i have to run like right now, send me an email with what I need to do [19:59:43] byyeee [20:00:09] well, sooner the better, either chad or nik can also do it [20:00:30] (03CR) 10Andrew Bogott: [C: 032] Remove explicit creation of /var/lib/git/operations/puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/96078 (owner: 10Andrew Bogott) [20:02:28] greg-g: I'm starting onw [20:02:35] manybubbles: thanks much man [20:02:39] brbs [20:02:44] I'm still not sure of everything, I'll get what I know [20:13:33] apergos: salt-key -F | grep master [20:13:47] on palladium, which I'd imagine will be the new salt master? [20:14:16] well I dunno [20:14:22] to switch clients to the new master, you'll need to: 1. scp the keys from sockpuppet to the new master [20:14:29] the minion keys, of course [20:14:56] 2. change puppet to point the minions to the new master, listing the new master fqdn and the master's public fingerprint [20:15:00] I have the vague notion that palladium might already have the salt master class in the node manifest [20:15:48] I wonder if I'm going to have to manually toss the master key on all the clients [20:16:05] they cache the pub key I believe [20:18:03] I don't believe you'll need to [20:18:13] I wonder how multi-master is set up [20:18:50] oh [20:18:52] simple: http://docs.saltstack.com/topics/tutorials/multimaster.html [20:19:21] it's really just a matter of pointing the minions at both [20:19:24] so let's do that [20:19:29] then we'll have no salt outage at all [20:19:45] ok [20:20:07] I will do this tomorrow morning (believe it or not I am actually eating dinner now, shocking I know) [20:20:37] unless we have some other disaster. good thing about mornings is they are usually a quiet time so if something does go awry I'll be able to fix it up wthout inconveniencing folks [20:21:06] the multimaster thing seems good [20:21:37] yep [20:21:41] it's kind of basic [20:22:07] but assuming we keep everything in sync (which puppet does) we'll be fine [20:22:21] do we have shared files? [20:23:07] hm same master private key on bot, ok [20:23:10] *both [20:23:43] nah, puppet installs everything [20:23:55] perfect [20:23:57] only the key data needs to be sync'd [20:24:11] seems, pretty basic like you say [20:24:15] and eventually it'll use puppet's keys [20:24:21] oh ho [20:25:05] so this is not th syndic approach we were talking about ealrier [20:25:17] this is just 'get ready to get us the heck off of sockpuppet' [20:25:59] (which is fine, I just want to flag that we do eventually want more than one salt master anyways for spof reasons) [20:26:41] right [20:26:53] I wonder how the master finger stuff works with this [20:27:10] * Ryan_Lane asks in #salt [20:27:30] forgot to rejoin after nome-3 crash [20:27:32] *gnome [20:31:19] hm. maybe I'll dig through the code until I get a response [20:31:43] :-D [20:31:53] I'm pretty done for the day or I'd poke around in it [20:32:06] but if I had energy it would be for poking around gtting docker running on f20 bea [20:32:08] beta [20:32:57] seems you sync the master key too [20:35:10] yeah, I mentioned that above [20:35:26] one key to rule them all etc [20:35:47] yep [20:36:07] but I was wondering how it would work with a syndic, because isn't that with separate keys? [20:36:25] I have not read any of the docs on that if you can't tell [20:37:09] (03Abandoned) 10Andrew Bogott: Implement the $mode param for git::clone [operations/puppet] - 10https://gerrit.wikimedia.org/r/96057 (owner: 10Andrew Bogott) [20:38:01] Ryan_Lane: any objection to https://gerrit.wikimedia.org/r/#/c/95699/ ? [20:38:41] that's fine [20:39:10] (03CR) 10Andrew Bogott: [C: 032] Removed misc::deployment::scripts class. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95699 (owner: 10Andrew Bogott) [20:39:36] !log disabled a bunch of rt users who no longer login and dont work here anymore (so if you do work here and i messed up, opps ;) [20:39:49] Logged the message, RobH [20:44:07] (03PS1) 10RobH: removing access from dsc [operations/puppet] - 10https://gerrit.wikimedia.org/r/96143 [20:45:11] (03PS2) 10Jgreen: remove manifests from erzurumi, it's already on frack puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/96072 [20:45:15] apergos: the syndic is via the minion [20:45:41] * apergos goes to rtfm [20:45:45] (03CR) 10Jgreen: [C: 032 V: 031] remove manifests from erzurumi, it's already on frack puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/96072 (owner: 10Jgreen) [20:46:03] I believe the syndic subscribes and relays [20:46:19] a passthrough minion running on the one master [20:46:21] hm [20:46:44] yep, so you configure the syndic to point at the master [20:46:50] then accept the syndic on the masyer [20:46:52] *master [20:47:28] I"ll have to play with that [20:47:43] (not tomorrow morning on the cluster though :-D) [20:48:12] heh [20:48:19] well, I want syndic for esams, ulsfo and labs [20:48:38] paravoid https://gerrit.wikimedia.org/r/#/c/88261/ :) [20:48:39] local master per dc ? [20:48:49] except that I don't think we want labs to be in that list [20:48:55] ori-l: Reedy: et.al. fyi, "procurement" queue should be open and usable for you now [20:48:59] (03Abandoned) 10Jgreen: remove manifests from erzurumi, it's already on frack puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/96072 (owner: 10Jgreen) [20:49:04] a syndic just syndicates from a master [20:49:16] so labs instances would point to the labs master [20:49:24] only one way, ok [20:49:25] * ori-l procures all the things [20:49:33] but they'd get syndicated calls from production as well [20:49:36] right [20:49:39] so you could call labs from production [20:49:47] which would be convenient :) [20:49:51] I dunno why I thought initially it was bidirectional [20:49:59] yeah, I thought so too [20:50:06] but this is better [20:50:09] indeed [20:51:07] let us never have so many layers of syndics that we have to change the syndic_wait value :-D [20:53:04] heh [20:53:13] we may have up to 3 [20:53:18] production -> labs [20:53:24] labs -> project master [20:53:32] project master? eh? [20:53:42] that way projects like deployment-prep can point their instances to a master they manage [20:53:46] oh hm [20:53:55] and the calls will get chained all the way through [20:54:19] so many opportunities for someone to break configuration on many labs instances at once :-D [20:54:32] oh, that's possible right now :D [20:54:48] puppet would maintain everything anyway ;) [20:56:29] well it will after the first time we break something for a user [20:56:31] :-D [21:04:51] !log deployed Parsoid fd3d6dc [21:05:06] Logged the message, Master [21:10:06] (03PS1) 10Jgreen: add list of fundraising hosts to site.pp for reference [operations/puppet] - 10https://gerrit.wikimedia.org/r/96148 [21:14:51] Jeff_Green: thanks, that list worksforme [21:14:58] cool [21:15:14] also yay that to more node entries are gone :-D [21:17:05] (03CR) 10Jgreen: [C: 032 V: 031] add list of fundraising hosts to site.pp for reference [operations/puppet] - 10https://gerrit.wikimedia.org/r/96148 (owner: 10Jgreen) [21:55:04] (03PS1) 10Jgreen: remove deprecated fundraising config [operations/puppet] - 10https://gerrit.wikimedia.org/r/96153 [21:59:52] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [22:00:53] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [22:01:29] (03CR) 10Jgreen: [C: 032 V: 031] remove deprecated fundraising config [operations/puppet] - 10https://gerrit.wikimedia.org/r/96153 (owner: 10Jgreen) [22:03:55] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [22:30:15] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Nov 18 22:30:05 UTC 2013 [22:30:57] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [22:33:28] (03PS1) 10EBernhardson: Disable VisualEditor inside the Flow project [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96161 [22:35:16] (03CR) 10Cmcmahon: [C: 031] "needed to properly disable VE for Flow on beta labs before deployment" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96161 (owner: 10EBernhardson) [22:37:58] (03PS2) 10EBernhardson: Disable VisualEditor inside the Flow project [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96161 [22:58:46] (03PS1) 10Jgreen: resurrecting ocg role [operations/puppet] - 10https://gerrit.wikimedia.org/r/96165 [22:59:27] (03PS2) 10RobH: removing access from dsc [operations/puppet] - 10https://gerrit.wikimedia.org/r/96143 [22:59:55] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Nov 18 22:59:49 UTC 2013 [22:59:56] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [23:02:12] (03CR) 10Jgreen: [C: 032 V: 031] resurrecting ocg role [operations/puppet] - 10https://gerrit.wikimedia.org/r/96165 (owner: 10Jgreen) [23:12:00] Jeff_Green: tabs! [23:12:09] tabs? [23:12:15] that file has tabs, we use 4 spaces now :) [23:12:15] OH i forgot. dammit. [23:12:18] it was just a revert? [23:12:27] i'll fix it and add the fancy vim header [23:15:37] (03PS1) 10Jgreen: !tabs! fixed tabs to spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96167 [23:17:06] paravoid: hi, any hope of getting that patch? we want to deploy in 2 days some code that depends on it. Otherwise we have tons of users seeing incorrect "free" message [23:18:26] I'll have a look [23:18:34] I can't promise anything, it's been pretty busy [23:21:08] (03CR) 10Jgreen: [C: 032 V: 031] !tabs! fixed tabs to spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96167 (owner: 10Jgreen) [23:21:47] (03CR) 10RobH: [C: 032] removing access from dsc [operations/puppet] - 10https://gerrit.wikimedia.org/r/96143 (owner: 10RobH) [23:21:51] mhoover: Probably good for you to subscribe, although you can ignore most of the traffic: https://lists.wikimedia.org/mailman/listinfo/ops [23:22:09] *shrug* Or you can remain blissfully ignorant; your choice. [23:22:21] (03PS1) 10Jgreen: sigh. finished fixing tabs-->spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96168 [23:22:36] andrewbogott: i think i was told anyone who has deployment *must* be on that list? [23:22:50] but that was perhaps 6 months ago, no idea if folks follow that guideline, or if its still valid. [23:22:54] mhoover: There you have it! Please subscribe. [23:23:05] since we announce downtimes [23:23:14] RobH: Yeah, I'm convinced. [23:23:14] and many times its our first line of 'wtf does this server do?' [23:23:38] but its also NDA or employee only (but prettysure all deployers fall into that) [23:23:42] paravoid: let me know asap if there are any problems with it - we have had tons of complains with incorrect traffic labeling [23:23:46] just fyi [23:23:49] thanks!!! [23:24:03] how's that homepage thing coming along? :) [23:27:13] paravoid: funny you asked - i went through several revisions on that one - but it seems it will be a to the dir with full install of wiki, with a modrewrite treating all paths as "/" (root). That root will internally be a request to our special page that will redirect to the needed page. [23:27:20] (03CR) 10Jgreen: [C: 032 V: 031] sigh. finished fixing tabs-->spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96168 (owner: 10Jgreen) [23:27:54] yeah, we had agreed to that with Adamweeks ago [23:28:05] mhoover: also, please request a cloak here: https://spreadsheets.google.com/viewform?hl=en&formkey=dG1FTWV1RnNBVHFOSnExMHF6aUhya2c6MA [23:28:10] paravoid: hehe, we just met today and he told me that :))) [23:28:15] So that we can reliably identify you on IRC :) [23:29:01] how come puppet has indentation policy oppsite to virtually all the rest repos?:P [23:30:04] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Nov 18 23:29:55 UTC 2013 [23:30:55] PROBLEM - Puppet freshness on dysprosium is CRITICAL: No successful Puppet run in the last 3 hours [23:32:21] andrewbogott: doing [23:32:40] mhoover: Filling out that google form is step one, step two is 'nag constantly' [23:32:50] although I can't remember who to nag :( [23:32:53] heheheeh [23:32:59] got it [23:43:25] (03PS7) 10Dzahn: move IRC server to module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94407 [23:45:46] (03CR) 10Dzahn: [C: 032] move IRC server to module [operations/puppet] - 10https://gerrit.wikimedia.org/r/94407 (owner: 10Dzahn) [23:47:18] andrewbogott mhoover: I suspect the folks listed in https://meta.wikimedia.org/wiki/IRC/Cloaks#People_who_deal_with_Wikimedia_cloaks can help if needed [23:47:47] barras is the only one online but flagged as away right now [23:49:09] andrewbogott: ^ and it still works :) [23:49:13] have them process my request from..... March [23:49:18] thanks for the labs test the other day, btw [23:49:41] greg-g, time to wrangle some additional group contacts? :) [23:50:46] Eloquence: a bit :) [23:51:01] I'm actually no longer an Ubuntu Member (recinded a while ago, heh) [23:51:10] andrewbogott: thx [23:51:28] they're just being nice and letting me keep it until I have the wikimedia one.... :) [23:51:46] your cloak is a lie. it's a lie! [23:51:47] ^demon|busy: when you're not busy; if you have time before the end of the day; I have a couple of gerrit repos pending in the request queue [23:52:09] Eloquence: we're all just facades, make them fun. [23:52:41] (03CR) 10Dzahn: "i don't know what to do with this patch currently, so i'm removing myself from reviewers. feel free to re-add me once there is a change" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80760 (owner: 10Dereckson) [23:53:43] <^demon|busy> collectoid? Really? /me sighs [23:54:21] I don't have a cloak :-) [23:54:26] nor do I want one [23:56:13] support your favorite project! debian/paravoid [23:56:23] debian uses OFTC [23:56:33] * hashar hides [23:57:18] > 50% of oftc "staff" are Debian people [23:59:55] RECOVERY - Puppet freshness on dysprosium is OK: puppet ran at Mon Nov 18 23:59:50 UTC 2013