Kee Siong Ng | 3 Nov 00:47 2011
Picon

Re: SVM Features in MADLib

Sorry for the delay in responding, Shankha. I have been travelling.

> 1) How well does MADLib SVM implementation scale in PostgreSQL or
> Greenplum?

In short, very well.

The SVM implementation in MADlib uses an online stochastic gradient
descent algorithm to do learning, which means examples are processed
one at a time, allowing it to scale to massive datasets. In
particular, we
don't have problems like computing the large kernel matrix that batch
algorithms need to do.

Perhaps the main limitation with scalability is the size of the model.
For a billion row data set, even a compression factor of 0.01% would
give you a fairly big model, which needs to be evaluated often.

> 2) Does MADLib SVM implementation utilize parallel processing?

Yes, you can learn an ensemble of SVMs. We don't use parallel
processing
when learning a single SVM though.

> 3) If my training set were 150k instances and 3000 features, roughly
> how long would the training time take when using the MADLib SVM
> implementation?

Probably seconds. At most minutes.

(Continue reading)

Shankha Chaudhuri | 3 Nov 11:48 2011
Picon

Re: Re: SVM Features in MADLib

Thank you for your time and thorough answers.  I will let you know if we decide to utilize your implementation
of Svm.  

Regards,
Shankha

On Nov 2, 2011, at 7:47 PM, Kee Siong Ng <keesiong.ng@...> wrote:

> Sorry for the delay in responding, Shankha. I have been travelling.
> 
>> 1) How well does MADLib SVM implementation scale in PostgreSQL or
>> Greenplum?
> 
> In short, very well.
> 
> The SVM implementation in MADlib uses an online stochastic gradient
> descent algorithm to do learning, which means examples are processed
> one at a time, allowing it to scale to massive datasets. In
> particular, we
> don't have problems like computing the large kernel matrix that batch
> algorithms need to do.
> 
> Perhaps the main limitation with scalability is the size of the model.
> For a billion row data set, even a compression factor of 0.01% would
> give you a fairly big model, which needs to be evaluated often.
> 
>> 2) Does MADLib SVM implementation utilize parallel processing?
> 
> Yes, you can learn an ensemble of SVMs. We don't use parallel
> processing
(Continue reading)

schaudhuri | 3 Nov 13:39 2011
Picon

Re: SVM Features in MADLib

Thanks Kee Siong Ng.  I appreciate your answers and time.  I will let
you know if we decide to go with your SVM implementation.

-SC

On Nov 2, 7:47 pm, Kee Siong Ng <keesiong...@...> wrote:
> Sorry for the delay in responding, Shankha. I have been travelling.
>
> > 1) How well does MADLib SVM implementation scale in PostgreSQL or
> > Greenplum?
>
> In short, very well.
>
> The SVM implementation in MADlib uses an online stochastic gradient
> descent algorithm to do learning, which means examples are processed
> one at a time, allowing it to scale to massive datasets. In
> particular, we
> don't have problems like computing the large kernel matrix that batch
> algorithms need to do.
>
> Perhaps the main limitation with scalability is the size of the model.
> For a billion row data set, even a compression factor of 0.01% would
> give you a fairly big model, which needs to be evaluated often.
>
> > 2) Does MADLib SVM implementation utilize parallel processing?
>
> Yes, you can learn an ensemble of SVMs. We don't use parallel
> processing
> when learning a single SVM though.
>
(Continue reading)

schaudhuri | 3 Nov 15:01 2011
Picon

Re: SVM Features in MADLib

Kee Siong,

I would like it if you could elaborate on one specific question.

1) If SVM training on a single model is not done in parallel (multi-
threads), how can it be done so fast?

Regards,
SC

On Nov 2, 7:47 pm, Kee Siong Ng <keesiong...@...> wrote:
> Sorry for the delay in responding, Shankha. I have been travelling.
>
> > 1) How well does MADLib SVM implementation scale in PostgreSQL or
> > Greenplum?
>
> In short, very well.
>
> The SVM implementation in MADlib uses an online stochastic gradient
> descent algorithm to do learning, which means examples are processed
> one at a time, allowing it to scale to massive datasets. In
> particular, we
> don't have problems like computing the large kernel matrix that batch
> algorithms need to do.
>
> Perhaps the main limitation with scalability is the size of the model.
> For a billion row data set, even a compression factor of 0.01% would
> give you a fairly big model, which needs to be evaluated often.
>
> > 2) Does MADLib SVM implementation utilize parallel processing?
(Continue reading)

Kee Siong Ng | 6 Nov 01:29 2011
Picon

Re: SVM Features in MADLib

> I would like it if you could elaborate on one specific question.
>
> 1) If SVM training on a single model is not done in parallel (multi-
> threads), how can it be done so fast?

The MADlib SVM online algorithm is more about memory-efficiency
than speed-efficiency.

The single model SVM learning can process massive datasets pretty
much unconstrained by available main memory because it processes
training examples one at a time.

Many existing SVM algorithms don't scale too well because of the
need to compute the large kernel matrix [ k(x_i,x_j) ] for all i,j.
Memory efficient algorithms avoid doing that.

In this sense, the MADlib implementation is not necessarily a lot
faster than existing batch algorithms on the same dataset, but it can
handle large datasets that some existing algorithms cannot solve at
all.

The SMO algorithm is also efficient like that, in that it breaks the
overall optimisation problem into 2-dimensional sub-problems that
can be solved analytically. One could implement the SMO algorithm
in MADlib, but the random data access pattern required by SMO isn't
as well-suited to implementation in (procedural) SQL.

Hope this helps.

(Continue reading)

schaudhuri | 7 Nov 19:45 2011
Picon

Re: SVM Features in MADLib

Thanks for your quick response.

Another few questions:
Do your answers correspond with a linear kernel or RBF-Gaussian kernel
for SVM?

-----------------------------------------------------
I am still struggling to wrap my head around the answer to this
question below.

> 4) What is the largest training set that your team has tried with the
> MADLib SVM implementation and how long did it take?

We have tried millions to tens of millions of training points. They
all finish pretty fast.
------------------------------------------------

I do agree that the memory requirements make in-database SVM
desirable; but I still do not understand how it could compute the
support vectors rapidly in a single threaded environment.

jpm | 10 Nov 21:20 2011
Picon

installation problems on Mac OS X with PostgreSQL 9.0 and Python 2.6.1

I'm trying to install the MADlib in-database analytics components into
a PostgreSQL 9.0 installation on Mac OS X (Snow Leopard).  When I try
to register the database I get an error telling me that Python is not
up to date however it appears I'm running 2.6.1?  Any help
appreciated.

JPMs-MacBook-Pro:~ johnmcdonald$ /usr/local/madlib/bin/madpack -p
postgres -c Admin <at> localhost:5432/bpsimple install
Password for user Admin:
madpack.py : INFO : MADlib tools version    = 0.2.1beta (/usr/local/
madlib/bin/../madpack/madpack.py)
madpack.py : INFO : MADlib database version = None (host=localhost:
5432, db=bpsimple, schema=madlib)
madpack.py : INFO : Testing PL/Python environment...
madpack.py : ERROR : PL/Python version too old: 2.5.4. You need 2.6 or
greater
madpack.py : ERROR : MADlib installation failed.
JPMs-MacBook-Pro:~ johnmcdonald$ python
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> quit()
JPMs-MacBook-Pro:~ johnmcdonald$

Aleks | 10 Nov 21:29 2011

Re: installation problems on Mac OS X with PostgreSQL 9.0 and Python 2.6.1

Your Python looks fine but your Postgresql must be running with older 
PL/Python and you need to rebuild it.
See: http://stackoverflow.com/questions/5921664/how-to-change-python-version-used-by-plpython-on-mac-osx

Kee Siong Ng | 12 Nov 12:51 2011
Picon

Re: SVM Features in MADLib

> Do your answers correspond with a linear kernel or RBF-Gaussian kernel
> for SVM?

My answer is basically independent of the choice of kernel functions.

> -----------------------------------------------------
> I am still struggling to wrap my head around the answer to this
> question below.
>
> > 4) What is the largest training set that your team has tried with the
> > MADLib SVM implementation and how long did it take?
>
> We have tried millions to tens of millions of training points. They
> all finish pretty fast.
> ------------------------------------------------
>
> I do agree that the memory requirements make in-database SVM
> desirable; but I still do not understand how it could compute the
> support vectors rapidly in a single threaded environment.

The point is not so much about in-database or not, but online
learning
via stochastic gradient descent.
This page may help you:
  http://leon.bottou.org/projects/sgd

Xiaobo.Gu | 17 Nov 02:30 2011
Picon

Where to download the lapack rpm for centos 5.5 X64

Googled for a while , but have not found.


Gmane