Michael Grier | 2 Apr 2011 21:04
Picon

Re: IMDb IDs aren't really necessary for names and titles

Thanks, Petite Abeille, for the information on the other type of
search urls, which includes searching for a company. I'll go with
those as they are more "standardized."

------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself; 
WebMatrix provides all the features you need to develop and 
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
Davide Alberani | 6 Apr 2011 20:09
Picon
Gravatar

Re: IMDb IDs aren't really necessary for names and titles

On Thu, Mar 31, 2011 at 06:13, Michael Grier
<mr.michael.grier@...> wrote:
>
> It does work; you have to have the comma in there... (%2C)
>
> http://www.imdb.com/Name?Gibson%2C%20Mel%20%28I%29
>
> redirects to
>
> http://www.imdb.com/name/nm0000154/

Right; I'll look into integrating this solution instead of the current search
done to convert from titles/names to imdbID, thanks!

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
Michael Grier | 8 Apr 2011 23:38
Picon

Re: IMDb IDs aren't really necessary for names and titles

On 4/6/11, Davide Alberani <davide.alberani@...> wrote:
> On Thu, Mar 31, 2011 at 06:13, Michael Grier <mr.michael.grier@...>
> wrote:
>>
>> It does work; you have to have the comma in there... (%2C)
>>
>> http://www.imdb.com/Name?Gibson%2C%20Mel%20%28I%29
>>
>> redirects to
>>
>> http://www.imdb.com/name/nm0000154/
>
> Right; I'll look into integrating this solution instead of the current
> search
> done to convert from titles/names to imdbID, thanks!

Some notes:
1. If you want to grab the nm ids, you will have to do it before you
canonicalize any names that IMDb did not in the flat files, OR save
the original name when you canonicalize a name.

2. You do not incur a "too many requests" type of penalty (I forget
what the actual message is, but you probably know what I'm talking
about) when you use the method I mentioned earlier to get the id from
the Location header, but I would be wary of doing it too much (like
during flat files import). Your ip could get banned. It also would
cause import to take much longer.

3. It won't find:
    A: Anything with a + (plus symbol) in the name or title.
(Continue reading)

Michael Grier | 8 Apr 2011 23:44
Picon

Re: IMDb IDs aren't really necessary for names and titles

> 1. If you want to grab the nm ids, you will have to do it before you
> canonicalize any names that IMDb did not in the flat files, OR save
> the original name when you canonicalize a name.

Maybe just add another BOOL field to the db to indicate if the import
script canonicalized the name or not. Then later, if it was
canonicalized, normalize it before trying to send off the url.

On 4/8/11, Michael Grier <mr.michael.grier@...> wrote:
> On 4/6/11, Davide Alberani <davide.alberani@...> wrote:
>> On Thu, Mar 31, 2011 at 06:13, Michael Grier <mr.michael.grier@...>
>> wrote:
>>>
>>> It does work; you have to have the comma in there... (%2C)
>>>
>>> http://www.imdb.com/Name?Gibson%2C%20Mel%20%28I%29
>>>
>>> redirects to
>>>
>>> http://www.imdb.com/name/nm0000154/
>>
>> Right; I'll look into integrating this solution instead of the current
>> search
>> done to convert from titles/names to imdbID, thanks!
>
>
> Some notes:
> 1. If you want to grab the nm ids, you will have to do it before you
> canonicalize any names that IMDb did not in the flat files, OR save
> the original name when you canonicalize a name.
(Continue reading)

Davide Alberani | 10 Apr 2011 17:16
Picon
Gravatar

Re: IMDb IDs aren't really necessary for names and titles

On Fri, Apr 8, 2011 at 23:38, Michael Grier
<mr.michael.grier@...> wrote:
>
> 3. It won't find:
>    A: Anything with a + (plus symbol) in the name or title.

Have you tried replacing the plus symbol with '%2B' ?

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
darklow | 11 Apr 2011 18:35
Picon
Gravatar

imdbpy2sql 4.7 - invalid byte sequence for encoding "UTF8"

Hello,

Getting error all the time at the same place:
psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xc320

System:
imdbpy 4.7 (also tried with latest version from SVN: 4.8dev20110317)
python 2.6.6
PostgreSQL: 8.4.7 (Database encoding is en_US.UTF8)
IMDB data: (tried with the latest and also version from february)

Lines i tried to run:
./imdbpy2sql.py -d /www/imdb/data/ -u postgres://imdb:imdb <at> localhost/imdb2
also tried:
./imdbpy2sql.py -d /www/imdb/data/ -u postgres://imdb:imdb <at> localhost/imdb2 -e 'AFTER_CREATE:SET client_encoding TO utf8'

Some facts to help diagnose problem:
IMDBPy is not installed, running from sources.
Dependancies like SQLObject are installed (SQLObject-0.12.4)
Running on Debian Linux.
Some time ago we used installed version IMDBPy and everything went fine even with the same data files as now, but since there is no stable version 4.7 for debian yet, we uninstalled and now we are running from source.

After 30 minutes of script running i recieve following errors:

Error:
SCANNING actor: Havel, Jir?
 * FLUSHING CharactersCache...
Traceback (most recent call last):
  File "./imdbpy2sql.py", line 2959, in <module>
    run()
  File "./imdbpy2sql.py", line 2820, in run
    castLists(_charIDsList=characters_imdbIDs)
  File "./imdbpy2sql.py", line 1584, in castLists
    doCast(f, roleid, rolename)
  File "./imdbpy2sql.py", line 1543, in doCast
    cid = CACHE_CID.addUnique(role)
  File "./imdbpy2sql.py", line 966, in addUnique
    else: return self.add(key, miscData)
  File "./imdbpy2sql.py", line 959, in add
    self[key] = c
  File "./imdbpy2sql.py", line 869, in __setitem__
    self.flush()
  File "./imdbpy2sql.py", line 892, in flush
    self._toDB(quiet)
  File "./imdbpy2sql.py", line 1194, in _toDB
    CURS.executemany(self.sqlstr, self.converter(l))
psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xc320
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

Any suggestions? I found similar topic, but there were also no solutions.
Run out of ideas :/
Anyone could help?
Thank you.

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev
_______________________________________________
Imdbpy-help mailing list
Imdbpy-help@...
https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Davide Alberani | 11 Apr 2011 21:46
Picon
Gravatar

Re: imdbpy2sql 4.7 - invalid byte sequence for encoding "UTF8"

On Mon, Apr 11, 2011 at 18:35, darklow <darklow@...> wrote:
>
>   File "./imdbpy2sql.py", line 1194, in _toDB
>     CURS.executemany(self.sqlstr, self.converter(l))
> psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xc320
> HINT:  This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by "client_encoding".
>
> Any suggestions? I found similar topic, but there were also no solutions.

Yes, I've had other reports about this bug.
Seems to be related to some garbage in the actors.list.gz file.
I hope to have time to investigate the problem within a week or two.

Thanks for the bug report!

--

-- 
Davide Alberani <davide.alberani@...>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Forrester Wave Report - Recovery time is now measured in hours and minutes
not days. Key insights are discussed in the 2010 Forrester Wave Report as
part of an in-depth evaluation of disaster recovery service providers.
Forrester found the best-in-class provider in terms of services and vision.
Read this report now!  http://p.sf.net/sfu/ibm-webcastpromo
darklow | 13 Apr 2011 08:45
Picon
Gravatar

Re: imdbpy2sql 4.7 - invalid byte sequence for encoding "UTF8"

Since i am not familiar with python, maybe you could suggest some fast fix so that scripts doesn't hangs?
Maybe this helps: In PHP we have perfeclty same error with encoding when importing some wrong decoded data. When we have no control over data and we cant all the time do utf8_encode since it could encode string twice - to bypass this error i use this function which at least prevents from postgresql error:

function  fix_encoding($in_str) {
        $cur_encoding = mb_detect_encoding($in_str) ;
        if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8")){
            return $in_str;
        }else{
            return utf8_encode($in_str);
        }
}

Maybe you can help to adapt this function to Python if similar functions are available so we can use it as a quick fix?
Thanks a lot.


On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani <davide.alberani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
On Mon, Apr 11, 2011 at 18:35, darklow <darklow-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>   File "./imdbpy2sql.py", line 1194, in _toDB
>     CURS.executemany(self.sqlstr, self.converter(l))
> psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xc320
> HINT:  This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by "client_encoding".
>
> Any suggestions? I found similar topic, but there were also no solutions.

Yes, I've had other reports about this bug.
Seems to be related to some garbage in the actors.list.gz file.
I hope to have time to investigate the problem within a week or two.

Thanks for the bug report!

--
Davide Alberani <davide.alberani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

------------------------------------------------------------------------------
Forrester Wave Report - Recovery time is now measured in hours and minutes
not days. Key insights are discussed in the 2010 Forrester Wave Report as
part of an in-depth evaluation of disaster recovery service providers.
Forrester found the best-in-class provider in terms of services and vision.
Read this report now!  http://p.sf.net/sfu/ibm-webcastpromo
_______________________________________________
Imdbpy-help mailing list
Imdbpy-help@...
https://lists.sourceforge.net/lists/listinfo/imdbpy-help
darklow | 13 Apr 2011 08:46
Picon
Gravatar

Re: imdbpy2sql 4.7 - invalid byte sequence for encoding "UTF8"

Maybe someone knows some fast dirty fix at least how to skip such invalid byte sequence strings while there are no official fix, so i can finish the import?
Can we detect invalid byte characters? Maybe we can somehow replace or get rid of 0xc320 character, which mostly is appearing. Or skip these rows.

Ananlyzed error a bit more. Mostly these errors occur in Japanese actors (actors.list), in filmography there apperars strange characters:

Hayakawa, Yuzo
Burai hij8)

Tried to delete these rows manually, but the are too much of them :/
Thank you.


On Wed, Apr 13, 2011 at 9:45 AM, darklow <darklow-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
Since i am not familiar with python, maybe you could suggest some fast fix so that scripts doesn't hangs?
Maybe this helps: In PHP we have perfeclty same error with encoding when importing some wrong decoded data. When we have no control over data and we cant all the time do utf8_encode since it could encode string twice - to bypass this error i use this function which at least prevents from postgresql error:

function  fix_encoding($in_str) {
        $cur_encoding = mb_detect_encoding($in_str) ;
        if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8")){
            return $in_str;
        }else{
            return utf8_encode($in_str);
        }
}

Maybe you can help to adapt this function to Python if similar functions are available so we can use it as a quick fix?
Thanks a lot.


On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani <davide.alberani <at> gmail.com> wrote:
On Mon, Apr 11, 2011 at 18:35, darklow <darklow-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>   File "./imdbpy2sql.py", line 1194, in _toDB
>     CURS.executemany(self.sqlstr, self.converter(l))
> psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xc320
> HINT:  This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by "client_encoding".
>
> Any suggestions? I found similar topic, but there were also no solutions.

Yes, I've had other reports about this bug.
Seems to be related to some garbage in the actors.list.gz file.
I hope to have time to investigate the problem within a week or two.

Thanks for the bug report!

--
Davide Alberani <davide.alberani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/


------------------------------------------------------------------------------
Forrester Wave Report - Recovery time is now measured in hours and minutes
not days. Key insights are discussed in the 2010 Forrester Wave Report as
part of an in-depth evaluation of disaster recovery service providers.
Forrester found the best-in-class provider in terms of services and vision.
Read this report now!  http://p.sf.net/sfu/ibm-webcastpromo
_______________________________________________
Imdbpy-help mailing list
Imdbpy-help@...
https://lists.sourceforge.net/lists/listinfo/imdbpy-help
darklow | 14 Apr 2011 09:54
Picon
Gravatar

Re: imdbpy2sql 4.7 - invalid byte sequence for encoding "UTF8"

Unfortunately adding this line

k = k.replace('\xec\x8c\xa0', '') in the place you mentioned wont help.

Still same error on same place :(

SCANNING actor: Havel, Jir?
 * FLUSHING CharactersCache...
Traceback (most recent call last):
 .........
    self.flush()
  File "./imdbpy2sql.py", line 1195, in _toDB
    CURS.executemany(self.sqlstr, self.converter(l))
psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xc320

On Wed, Apr 13, 2011 at 11:56 PM, Davide Alberani <davide.alberani-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
On Mon, Apr 11, 2011 at 18:35, darklow <darklow-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>   File "./imdbpy2sql.py", line 1194, in _toDB
>     CURS.executemany(self.sqlstr, self.converter(l))
> psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xc320
> HINT:  This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by "client_encoding".

Hi all,
I'm writing regarding the recent "0xc320" problem with IMDbPY.
The above notice is extremely interesting, and should be investigated:
how can it be that 0xc320 is not UTF8 encodable?
It should work; from the Python prompt:
 >>> unichr(0xc320).encode('utf8')
 '\xec\x8c\xa0'

Anyway, as a very fast and dirty fix (the main problem is probably some
crap in the data files), try this: after line 1181 of imdbpy2sql.py, add:
 k = k.replace('\xec\x8c\xa0', '')

So that the nearby lines will become:
           try:
               k = k.replace('\xec\x8c\xa0', '')
               t = analyze_name(k)
           except IMDbParserError:

Please be aware that this fix was not tested at all, but I'm
almost sure that, at the above point, 'k' is a string encoded in utf8.

Anyway, beside the "garbage theory", I have another idea
about the source of the error, but I have to verify it later...

Bye, and let me know if it works!

--

------------------------------------------------------------------------------
Benefiting from Server Virtualization: Beyond Initial Workload 
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve 
application availability and disaster protection. Learn more about boosting 
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________
Imdbpy-help mailing list
Imdbpy-help@...
https://lists.sourceforge.net/lists/listinfo/imdbpy-help

Gmane