kragen | 3 Nov 2005 09:37
Picon
Favicon

computing with really simple machines

Universal cellular automata
---------------------------

There's a rumor from Stephen Wolfram that certain one-dimensional
two-state three-cell-neighborhood cellular automata are
Turing-complete, in particular rule 110.  Here's a simple Python
script for running such 1-D CAs:

    #!/usr/bin/python
    import sys

    def display(line):
        sys.stdout.write(''.join([' #'[x] for x in line]) + '\n')

    def run(n=110, ng=1000, nc=400, display=display):
        cells = [0] * nc
        cells[nc/2] = 1
        for ii in range(ng):
            display(cells)
            newcells = [0] * len(cells)
            for jj in range(1, len(newcells)-1):
                index = 4 * cells[jj-1] + 2 * cells[jj] + cells[jj+1]
                newcells[jj] = n & (1 << index) and 1 or 0
            cells = newcells

    if __name__ == '__main__':
        run()

Supposing this rumor is correct, and rule 110 is indeed a universal
computer (which seems plausible even from the output from this
(Continue reading)

kragen | 7 Nov 2005 09:37
Picon
Favicon

making BitTorrent work better with aggregation

Observations of BitTorrent behavior, with an oversimplified model
-----------------------------------------------------------------

(This varies a lot from torrent to torrent.)

On average, the number of seeders on a BitTorrent torrent is around
10% of the number of leechers, a number that gradually decreases until
the torrent dies.  This is not true for all torrents, but it's true of
a substantial number of them.

If you're using such a torrent to publish a file (such as a home video
or collection of multi-megapixel photographs) that you wish to remain
continuously available, you must operate your own permanent seed.
However, you'd like most of the file data in your torrent to travel
from one peer to another, rather than from your own seed to the peers.

To simplify, I assume that there are no incomplete downloads, and each
leecher becomes a seeder for some period of time after they finish ---
about 10% of the time they had been leeching.

As time goes on, the number of leechers in the torrent approaches the
time to download a complete copy of the file, multiplied by the
frequency of new requests; P, the number of total peers aside from the
permanent seed, is the number of leechers multiplied by a constant
around 1.1.

Whenever P is 1 or 0, all of the data transmitted in the system must
comes from your permanent seed, and BitTorrent degrades to a download
protocol similar to FTP or HTTP, but with better protection against
corruption of data in transit.
(Continue reading)

kragen | 10 Nov 2005 09:37
Picon
Favicon

factoring numbers in your head

An efficient way to eliminate possible (prime) factors in your head is
to add and subtract multiples of the possible factor to the number to
be factored so as to make a multiple of 10, then divide the number by
10.  The resulting number will be divisible by the possible factor
(unless the possible factor contains 2 or 5) iff the original number
was.

How do you know what to multiply your possible factor by?

If your possible factor ends with 1 or 9, then:
- if your putative composite ends in 1 or 9, multiply by 1;
- if it ends in 2 or 8, multiply by 2;
- if it ends in 3 or 7, multiply by 3;
- if it ends in 4 or 6, multiply by 4;
- if it ends in 5, multiply by 5.

>>> lastdigits = Numeric.multiply.outer([1, 3, 7, 9], Numeric.arange(10)) % 10
>>> complements = 10 - lastdigits
>>> Numeric.minimum(lastdigits, complements)
array([[0, 1, 2, 3, 4, 5, 4, 3, 2, 1],
       [0, 3, 4, 1, 2, 5, 2, 1, 4, 3],
       [0, 3, 4, 1, 2, 5, 2, 1, 4, 3],
       [0, 1, 2, 3, 4, 5, 4, 3, 2, 1]])

So if your possible factor ends with 3 or 7, then:
- if your putative composite ends in 1 or 9, multiply by 3;
- if composite ends in 2 or 8, multiply by 4.
- if composite ends in 3 or 7, multiply by 1;
- if composite ends in 4 or 6, multiply by 2;
- if composite ends in 5, multiply by 5.
(Continue reading)

kragen | 14 Nov 2005 09:37
Picon
Favicon

potential for ebook reader platforms

E-Ink is selling $3000 "technology evaluation kits" for its "digital
paper" --- 400MHz XScale Gumstix board pre-installed with Linux (with
Bluetooth, USB, MMC), 800x600 6" E-Ink display, batteries, power
supply.  No case or human input device.  US$3000; fax them an order
form.

<http://www.eink.com/kits/>
<http://www.linuxdevices.com/news/NS9257262400.html>

This is nice, because for a little over $100 ($120 or so) you can get
2GB MMC cards <http://www.pricewatch.com/prc.aspx?i=226&a=396866>.  At
present, Project Gutenberg's FTP site lists 25004 text files totaling
17.45 GB (by my crude count) <insert reference here>.  You could
compress almost 5GB of text files onto this 2GB MMC card; in the case
of Project Gutenberg, that would be around 7000 text files, probably
7000 books --- a substantial fraction of the English literary canon.

>From the Internet Archive's Million Books scanning project, I
downloaded a scan of a dictionary (Walter Farquhar Hook's Church
Dictionary, 1842).  It weighs in at 24.5 MB for high-res scans of 558
pages, compressed with DjVu.  2GB, then, is space for almost 49000
scanned pages.

In December 2004, English Wikipedia's TomeRaider file was 537MB
compressed
<http://en.wikipedia.org/wiki/Wikipedia:TomeRaider_database>, but now
it appears to be close to 900MB
<http://download.wikimedia.org/tomeraider/current_tr3/> including
images and text but excluding page history and sounds.

(Continue reading)

Kragen Sitaker | 17 Nov 2005 09:37
Picon
Favicon

tolerance for ambiguity makes things easier (such as search)


Disclaimer
----------

I know basically nothing about OCR, handwriting recognition, speech
recognition, or automatic translation.

Tolerance for Ambiguity Makes Things Easier
-------------------------------------------

Traditionally, the goal of OCR, handwriting recognition, speech
recognition, or automatic translation has been to find the single
correct interpretation of marks on paper or sounds in a signal stream.
Often, though, it's much easier to produce a description of a set of
possible interpretations (perhaps as a finite-state machine or
infinite-order Markov model) which reliably includes the correct
interpretation than it is to reliably choose the correct
interpretation.  When is this ambiguous representation good enough to
be useful?

Downstream disambiguation
-------------------------

Traditional (e.g.) OCR software proceeds as a pipeline of stages:
convert an image into marks; convert the marks into symbols; convert
the symbols into letters; turn the letters into lines of words,
columns, paragraphs, and so forth.  In its simplest form, this works
very badly; one way to remedy this is to pass sets of candidates from
one stage to the next.

(Continue reading)

kragen | 21 Nov 2005 09:37
Picon
Favicon

A low-tech way to apply ECC for disk storage

Suppose you're in the not terribly unusual position of having a bunch
of big files (megabytes at least) on a bunch of unreliable disks.
Every once in a while, one of the disks fails, or user error deletes a
file.  But you would prefer not to lose the data in any of these
files, because it's valuable.  What do you do?

A straightforward and reliable solution is to keep multiple copies of
each file, say two or three, and a message-digest of each file (such
as MD5 or SHA-1) so you can detect silent corruption.  When you notice
a disk has failed or a file is missing, you make another copy of one
of the remaining copies.

This scheme is appealingly simple, but the copies are expensive, which
creates a temptation to skimp on the mirroring in some applications.

ECC
---

An error-correcting code transforms some number of bits N into a
larger number of bits M, and can do the reverse transformation
(usually very quickly) even if you flip or erase some of the M bits.
Typically you can erase up to any M-N bits or flip up to any
ceil((M-N-1)/2) bits.  In many codes, the first N of the M bits are
identical to the original M bits --- you just append M-N bits to the
original bits.  (I think these codes are called RCH codes.)

The simplest such code is the parity code: M=N+1, and you XOR your N
bits together to get your Mth bit.  This can detect single-bit errors,
and if you know which bit is lost, it can correct them, too.

(Continue reading)

Kragen Sitaker | 24 Nov 2005 09:37
Picon
Favicon

folkschemas and semistructured data


(1200 words, sorry)

I was talking with Tantek Celik tonight (2005-11-23) for an hour or two
about folksonomies, why they work, and user-defined data structures.
The ideas in this post came almost entirely from our discussion, but I
don't know which ones he deserves credit for and which ones I deserve
blame for, but he definitely deserves credit for the term "folkschemas".

A couple of new services (Google Base[insert ref], Ning/24HL[insert
ref], JotSpot <http://www.jot.com/>) allow people to put arbitrary sets
or bags of key-value pairs into a big shared store, and then run
arbitrary queries on that store, queries more or less of the type that
Partial Match Indexing [insert ref] optimizes: "field foo has value bar,
and field baz has value quux," etc.  This is sort of the thing RDF was
intended to support [insert ref], but without a single centralized
database --- just a bunch of XML files on the web.  So if a group of
people happen to use the same field name (predicate name, edge label,
relationship, etc.), you can search across all of their data records
with a single query, at least if that data is indexed in some index that
you have access to.

Obviously, this can result in the Tower of Babel problem: it's not at
all certain that different people will in fact use the same field names
when they mean the same thing, and, worse, it's not at all certain that
they will use different field names when they mean different things.

This is also a potential problem with folksonomy systems like del.icio.us:
I may choose 'sf' as my tag for a semantic category that you call
'sanfrancisco' or 'san-francisco', and so searches by tag won't have
(Continue reading)

Kragen Sitaker | 28 Nov 2005 09:37
Picon
Favicon

Make for URLs, or dependency-directed compositing

(3800 words, terribly sorry)

Lots of web pages are computed dynamically from some set of resources.
For example, a blog's front page is computed dynamically from a
(potentially large) set of blog posts.  The display of an RSS aggregator,
such as the LiveJournal friends page, is computed dynamically from a
subscription list and a bunch of friends' blogs.  Many web pages use some
kind of templating system, so the page consists at least of "content"
and the template.

Performance problems in web page compositing
--------------------------------------------

These applications have a set of performance problems.  Retrieving the
resources they depend on can take time, use bandwidth, and cost other
people money; dynamically generating the web pages themselves uses CPU
time; and all of this often happens while a human being is impatiently
waiting for a response.

Dependency-directed software builds: make
-----------------------------------------

Normally people write software as a human-readable "source code file"
which gets "compiled" (by a slow program) into an "object file" containing
"machine code", which then gets "linked" with other object files (by a
fast program) into an "executable" which they can run.  When they make
changes to the source code, they often want to see the results of their
changes, and they have to rerun this whole process.  But typically, they
haven't changed all the source files, so they only have to recompile
some of the source files.
(Continue reading)

Kragen Sitaker | 1 Dec 2005 09:37
Picon
Favicon

semistructured data: summary of six years of wishes

Now I am feeling the need for a personal semistructured store more than
ever.  This is a searchable database that supports "data first,
structure later" <http://www.betaversion.org/~stefano/linotype/news/93/>
(i.e. you can add a schema incrementally to existing data entered
without one, or with a very primitive one), but supports enough
structure to render web pages and things like that.  Here are some
options:

- RDF/Sniki-style: label edges in a directed graph with nodes, which
  have names.  Define tables with Sniki-style queries.  Add field
  metadata with edges from the field label node.  Queries are different
  from edges; perhaps stored in nodes, perhaps defined in some manner by
  a local graph structure.  Adding data to a query result is the most
  convenient way to add it.
- UnQL-style: label edges in a directed graph with strings; nodes are
  anonymous; define queries by replacing regular expressions of edges
  with edges; display tables perhaps by going two levels deep from a
  particular node.  (OEM, object exchange model, is essentially
  identical to this.)
- Python/Perl/JavaScript-style: nodes are anonymous, but edges are
  uniquely labeled within a node with strings, and some nodes are
  special primitive types, like strings.  Some edges are 'virtual' and
  computed on demand by arbitrary code that walks the neighborhood.
  There are nodes that are lists.
- askSam-style: nodes are small text files, with a syntactic convention
  to specify fields.  Queries are full-text searches, perhaps limited to
  certain fields; reports can extract values of certain fields to get
  tabular output.
- XML-style: data is stored in a single hierarchy in which the children
  of each node are ordered, and each node is labeled with an element
(Continue reading)


Gmane