Re: Feature Request: Improved duplicate checking for improved downloading.
jsnavely.geo <jsnavely.geo <at> yahoo.com>
2002-12-01 05:37:42 GMT
--- In bnr1 <at> y..., "viking_v" <viking_v <at> y...> wrote:
> Currently duplicates are eliminated based on name and approximate
> size range. This may result in many files incorrectly being tagged
> as duplicates and not downloaded. For example, there may be several
> files with name image001.jpg which are 240-260 KB.
We're starting to mix two different things here: FileCheck and Decoding. The FileCheck code uses the
guesstimated filenames and approximate file sizes within the user-specified tolerance to flag
articles as possibly matching existing files on the HD on in a CSV file. After the decoder has downloaded an
article listed in the queue and decoded the attached file(s), it performs a completely different check to
see if there is already a file stored in the same directory with the same filename and whether or not that
file is a duplicate to determine whether to discard this second copy or rename it to a unique filename.
>
> Many other programs (e.g. Newsgrabber, Zeonews (R.I.P.) and Newsbin
> Pro (to mention a few)) use a "signature CRC", e.g. the first 25KB
> of the file is used to calculate a unique CRC. If a subsequent file
> has the same name and 25KB CRC, it will not be downloaded (the
> download stops after the initial 25KB). The chance for missing an
> original file is thus quite small.
>
> Since BNR already calculates a complete CRC, wouldn't it be
> relatively easy to add code to eliminate duplicates based on name
> and CRC (or a "signature CRC")?
An option to have the program cancel all remaining parts of a multi-part article set when the first part
matches the beginning of an existing file may be useful provided there is a way for the user to override it
when they really do want to redownload something that the program thinks already exists. Maybe they
deleted it, or only part of it was previously downloaded, or one of the later parts was bad the first time, so
now they want to download it again. Having the program go further than that by having it hunt down and
(Continue reading)