1 Sep 2010 01:03
Re: Question about data warehousing and mining through Mahout
hdev ml <hdevml <at> gmail.com>
2010-08-31 23:03:26 GMT
2010-08-31 23:03:26 GMT
Thanks Chris for the answers. 1. The data is just going to grow. This 1.5TB of data is just from one module. There are other modules, which may have similar kind of data in the log. After discarding the data this 3.0TB becomes 1.5TB. It is still huge. Since this is just a months data, and we would want to make use of atleast past 6-12 months, the data size goes in the 10TB-20TB area. So I am guessing Hadoop is the right answer. I am just not sure which sub-project to use in this case. 2. When you say querying with Hive, note that I want to use the same hive data for future data mining, so my question was -- can that be done with Mahout integrating with Hive layer, instead of Hadoop layer directly. or Maybe we can use the Hive data directly if at all I can reverse engineer the data format that hive uses internally. Hopefully it does not have compressed data. 3. Hhhmm..That seems like a very good suggestion. I am not averse to the idea of writing my own implementation of mining algorithms. I am just worried about their accuracy and stability. So summary is basically do the transformation and statistical part first. When it comes to data mining, write your own algorithms or use Mahout (if at all hive integration is possible, or maybe reuse the raw text files or output dump of Hive tables) On Tue, Aug 31, 2010 at 3:38 PM, Chris Bates < christopher.andrew.bates <at> gmail.com> wrote: > From my experience (merging machine learning with business goals), I'll > offer a few pieces of advice that may help guide you. >(Continue reading)
RSS Feed