Limin Wang | 1 Mar 2007 02:57
Picon

Re: Scalability


Hi,

* Christian Bienia <cbienia <at> CS.Princeton.EDU> [2007-02-28 13:43:10 -0500]:

> Hi,
> 
> > For testing, change the #define X264_THREAD_MAX 16 (in common.h) to some
> > bigger number. Then I'll see about dynamic allocation.
> 
> The performance doesn't improve significantly with more than 16 threads. Here's the results with a
sufficiently high X264_THREAD_MAX value:

X264_THREAD_MAX is only one limitation, see below code:
    h->param.i_threads = x264_clip3( h->param.i_threads, 1, X264_THREAD_MAX );
    h->param.i_threads = X264_MIN( h->param.i_threads, 1 + (h->param.i_height
    >> h->param.b_interlaced) / (X264_THREAD_HEIGHT + 16) ); // FIXME exact limit?

So in your case, actual threads will be 
interlaced: 1+(1080 >> 1)/(24+16) = 13
not interlaced: 1+1080/(24+16) = 27

In my opinion, it's better for x264 to return error if input threads exceed
X264_THREAD_MAX instead of clip it directly.

Thanks,
Limin
Alex Izvorski | 1 Mar 2007 17:44
Picon

Re: Scalability

On Wed, 2007-02-28 at 15:49 -0700, Loren Merritt wrote:
> On Wed, 28 Feb 2007, Alex Izvorski wrote:
> > Thank you for some very interesting results.  Could you do a run with:
> >
> > x264 --bitrate=8000 --ref=1 --keyint=30 --scenecut=-1 --bframes=1
> > --no-b-adapt --threads ${NTHREADS} -o output.264 input_1920x1080_512.y4m
> >
> > and with threads up to 256 or 512?  I suspect that will scale much
> > better.  Thanks ;)
> 
> OK, I remembered a limiting factor: The motion estimation prepass
> (for b-adapt, scenecut, and ratecontrol) is singlethreaded. It runs
> concurrently with all the other threads, but only 1 thread is allocated to
> the prepass.
> If you disable those features (--qp 20 --no-b-adapt --scenecut -1) or if
> you run a 2nd pass (which doesn't do the prepass again), it should scale
> better.

Yes, that is what I was trying to do, I just didn't realize the
ratecontrol needs a prepass also ;)

> I would guess that Alex's command would scale worse than the original, 
> since Alex left in --bitrate (so a prepass is still needed), but increased 
> the speed of all the other options (so the prepass takes a larger 
> fraction of the total cpu-time).

Quite right.

Christian, could you change the options I sent to: 

(Continue reading)

Hao Shen | 1 Mar 2007 18:38
Picon

Looking for the Unit Test Case of X264

Hello everyone,

I want to modify some parts of the X264 and I want to know is there
any Unit Test Case for this project. Please tell me if you get it.

Best regards,
Hao Shen

Christian Bienia | 1 Mar 2007 19:52
Picon

Re: Scalability

On Thu, 2007-03-01 at 11:44, Alex Izvorski wrote:
> On Wed, 2007-02-28 at 15:49 -0700, Loren Merritt wrote:
> > On Wed, 28 Feb 2007, Alex Izvorski wrote:
> > > Thank you for some very interesting results.  Could you do a run with:
> > >
> > > x264 --bitrate=8000 --ref=1 --keyint=30 --scenecut=-1 --bframes=1
> > > --no-b-adapt --threads ${NTHREADS} -o output.264 input_1920x1080_512.y4m
> > >
> > > and with threads up to 256 or 512?  I suspect that will scale much
> > > better.  Thanks ;)
> > 
> > OK, I remembered a limiting factor: The motion estimation prepass
> > (for b-adapt, scenecut, and ratecontrol) is singlethreaded. It runs
> > concurrently with all the other threads, but only 1 thread is allocated to
> > the prepass.
> > If you disable those features (--qp 20 --no-b-adapt --scenecut -1) or if
> > you run a 2nd pass (which doesn't do the prepass again), it should scale
> > better.
> 
> Yes, that is what I was trying to do, I just didn't realize the
> ratecontrol needs a prepass also ;)
> 
> > I would guess that Alex's command would scale worse than the original, 
> > since Alex left in --bitrate (so a prepass is still needed), but increased 
> > the speed of all the other options (so the prepass takes a larger 
> > fraction of the total cpu-time).
> 
> Quite right.
> 
> Christian, could you change the options I sent to: 
(Continue reading)

Loren Merritt | 2 Mar 2007 04:04

Re: Scalability

On Thu, 1 Mar 2007, Christian Bienia wrote:
>
> Overall, the current parallelization of x264 still seems a little
> limited. Loren mentioned that fine-granular parallelization is possible,
> but more difficult. But a combined approach seems to be promising: A
> central work queue would contain all work units which await processing.
> A work unit can be a chunk of a frame (no more than 2-4 chunks per
> frame). As soon as a thread has finished working on a work unit, it
> enqueues any work units of subsequent frames which are now possible to
> encode (and dequeues its next work unit). This approach would also
> eliminate the need to have more threads than CPUs to get a high
> utilization. :-)

I didn't say fine-granular parallelization is hard, I said it doesn't work 
as well. But yes, it would also require much more rearranging of x264 
internals than the current threading did.

You can't divide a frame into large independent chunks without slices. And 
even if you did use slices, that's completely incompatible with 
frame-threading. The only temporal work division compatible with 
slice-threading is non-referenced B-frames and GOP-threading.

The sub-frame work division XviD uses is to encode consecutive macroblock 
rows in separate threads, making sure each row stays at least 2 MBs behind 
the previous. Then run another thread behind them all to do the bitstream 
writing. However, this reduces the temporal splitting possible by almost 
as much as it increases the spatial splitting, because there's that much 
more data in-progress that the next frame has to wait for. It also 
prevents bit-exact CABAC RDO, though I haven't simulated how much that 
would cost in compression quality.
(Continue reading)

elispezzi@tiscali.it | 2 Mar 2007 18:53
Picon
Favicon

Progiling x264

How I can profile x264 with gprof?
Someone has any suggestion?
Thanks. 
Elisa

Naviga e telefona senza limiti con Tiscali     
Scopri le promozioni Tiscali adsl: navighi e telefoni senza canone Telecom

http://abbonati.tiscali.it/adsl/

Tomas Carnecky | 2 Mar 2007 19:49

Re: Progiling x264

elispezzi <at> tiscali.it wrote:
> How I can profile x264 with gprof?
> Someone has any suggestion?

$ ./configure --help

which will eventually lead to

$ ./configure --enable-gprof

tom

telomere 1 | 3 Mar 2007 01:54
Picon

Speed-up method

What do you think?

QX6700
27.9GFlops

Geforce 8800GTX
(MADD(2flops) + MUL(1flop)) × 1,350 MHz × 128 SPs = 518.4G Flops

CUDA(compute unified device architecture)
http://developer.nvidia.com/object/cuda.html

Patrice Bensoussan | 3 Mar 2007 02:16
Picon
Favicon

[PATCH] Fix compilation of altivec code on Mac OS X with gcc 3.3

Hello,

Here is a patch to fix the compilation of x264 altivec code with gcc  
3.3 on Mac OS X.
Patrice

Index: common/ppc/ppccommon.h
===================================================================
--- common/ppc/ppccommon.h	(revision 624)
+++ common/ppc/ppccommon.h	(working copy)
 <at>  <at>  -268,15 +268,16  <at>  <at> 
 #define PREP_DIFF_8BYTEALIGNED \
 LOAD_ZERO;                     \
 vec_s16_t pix1v, pix2v;        \
+vec_u8_t pix1v8, pix2v8;        \
 vec_u8_t permPix1, permPix2;   \
 permPix1 = vec_lvsl(0, pix1);  \
 permPix2 = vec_lvsl(0, pix2);  \

 #define VEC_DIFF_H_8BYTE_ALIGNED(p1,i1,p2,i2,n,d)     \
-pix1v = vec_perm(vec_ld(0,p1), zero_u8v, permPix1);  \
-pix2v = vec_perm(vec_ld(0, p2), zero_u8v, permPix2); \
-pix1v = vec_u8_to_s16( pix1v );                      \
-pix2v = vec_u8_to_s16( pix2v );                      \
+pix1v8 = vec_perm(vec_ld(0,p1), zero_u8v, permPix1);  \
+pix2v8 = vec_perm(vec_ld(0, p2), zero_u8v, permPix2); \
+pix1v = vec_u8_to_s16( pix1v8 );                      \
+pix2v = vec_u8_to_s16( pix2v8 );                      \
(Continue reading)

Loren Merritt | 3 Mar 2007 03:00

Re: Speed-up method

On Sat, 3 Mar 2007, telomere 1 wrote:

> QX6700
> 27.9GFlops
>
> Geforce 8800GTX
> (MADD(2flops) + MUL(1flop)) * 1,350 MHz * 128 SPs = 518.4G Flops

QX6700
2.67 GHz * 4 cpus * (1 or 2 sse op/cycle) * (8 int16 or 16 int8 in a sse reg)
= 85 to 341 GIPS

Though x264 at hq settings spends about 60% of its time in simd code, so 
the real estimated improvement is a factor of 2.5 for offloading all of 
that to the gpu.

--Loren Merritt


Gmane