Joshua Haberman | 1 Feb 11:31 2009

hash metamethod?

Languages like Ruby, Python, etc. allow you to define hash functions that apply to your objects.  You define a hash function and a comparison function, and this lets user-defined types be keys in hash tables based on their values.

Does anyone have a Lua PowerPatch that implements anything like this?

I know this question comes up from time to time on this list (like [0] and [1]), and I'm familiar with the objections.  Let me address them:

1. writing fast/correct hash functions in Lua is too hard.  To me this is the easiest objection to answer.  Lua internally already has good hash functions for its built-in types.  And every Lua value you might want to create a hash function for is nothing but a combination of Lua's existing types.  So expose to Lua programs Lua's internal hash functions, along with a function that combines multiple hash values in an intelligent way.  Every Lua-defined hash function would boil down to nothing but calls to Lua's internal hashing functions and the combining function.  Presto: good, high-performing hash functions.

2. hashing mutable structures is too risky, because bad things happen if you mutate them.  it's true that programmers can shoot themselves in the foot by doing the wrong thing, and this is a bit subtle.  but Ruby and Python do it, and I've never come across any horror stories of the one time I was up until 2am tracking down a bug where I mutated my hash keys.  Add some stern warnings in the __hash documentation and be done with it.

3. You can do this by having a canonical instance of each object you want to hash, by using a memotable on construction.  This works to a certain extent, but is annoying for many practical reasons.  For example, your objects might not be immutable for their entire lifetimes, you might just want to take a second to find and remove duplicates among objects that at other times will be modified.  Also, it's non-intuitive in general to have code that looks like it's constructing an object but may actually return an existing object that you're sharing with someone else.  The people you're sharing the object with could mess up and mutate the object even though they're not supposed to.  It's just a strange paradigm in general, and personally I'd much rather live with the potential strangeness of having a __hash metamethod than deal with the strangeness of canonical objects (and I say this as someone who's tried both).

There is one problem with my proposed scheme that bothers me a bit.  The question is: what would a table return as its default __hash and __eql methods?  On one hand I'd want it to treat the table by value, calling __hash and __eql on all of its individual members, so that tables with the same contents hash the same.  On the other hand, I wouldn't want to give up the ability to hash by table identity, because it's sometimes very useful.  I don't have a good answer for this yet.

In any case, like I said I'm mainly wondering if anyone has a PowerPatch, I'm not going to argue that it should go into the main Lua (though I wouldn't mind that either :)

Josh

[0] http://lua-users.org/lists/lua-l/2003-02/msg00525.html
[1] http://lua-users.org/lists/lua-l/2005-06/msg00127.html

Alex Davies | 1 Feb 14:46 2009
Picon
Picon

Re: hash metamethod?

Hey Joshua,

> There is one problem with my proposed scheme that bothers me a bit.  The 
> question is: what would a table return as its default __hash and __eql 
> methods?  On one hand I'd want it to treat the table by value, calling 
> __hash and __eql on all of its individual members, so that tables with the 
> same contents hash the same.

I just want to say definitely not :).  I don't think even a second should be 
spent considering that - this is Lua, not Python.  We're not afraid to call 
two different tables just that - different.  (Don't get me wrong, I think 
Python's great - but I do think that this as default behaviour of its kind 
of weird/pointless, and I cannot see how it could be implemented remotely 
efficiently.)  Similarly simply adding tables as keys (oftenly done) would 
be very slow -potentially many comparisons of equality have to be done on 
each key lookup - tables are all we have, lets keep them fast.

Anyway, a patch for a simple __hash would not be hard to make.  Only a 
couple of places would have to be modified.  Again though, you have to 
remember that tables are integral to Lua's performance - anything that 
bloats or slows them really has to be considered very carefully.  And with 
that in mind, Lua has no extra space in the nodes in its tables to store the 
hash of each key. So whenever the table gets resized, table keys would have 
to be checked for a __hash metamethod and recalled (how does Python handle 
it?).  Along with being slow, this could also cause problems in the future 
in the unlikely case that the gc starts resizing tables.  (Actually I lie - 
if you want to 8 byte align the Node data structure you could fit a hash 
value in there).

Personally I do think a __hash metamethod would be a large improvement where 
Lua is used largely as a standalone language.  As an extension/scripting 
language not quite so important.

Anyway, good luck keep us informed. :)

- Alex 

Graham Wakefield | 1 Feb 18:11 2009
Picon

Re: anti...require?

Here's what I use for this situation. You create a lua_CFunction as  
the module unload handler. When the module is collected, either when  
the lua_State is closed, or when the package.loaded["modulename"] is  
nilled, this lua_CFunction is called.

Use it right after your luaL_register call in luaopen_x, e.g.:

luaL_register(L, "modulename", module_lib);
gc_sentinel(L, -1, module_unload_function);

/*
	get a callback when the value at stack index idx is collected:
*/
static void gc_sentinel(lua_State * L, int idx, lua_CFunction  
callback) {

	lua_pushvalue(L, idx); // value  <at> idx
	lua_newuserdata(L, sizeof(void *)); // sentinel userdata
		lua_newtable(L);	// userdata metatable with __gc = callback
		lua_pushcfunction(L, callback);

		lua_setfield(L, -2, "__gc");
		lua_setmetatable(L, -2);

	/* check for (weak-valued) sentinel table; create if needed */
	lua_getfield(L, LUA_REGISTRYINDEX, "gc_sentinels");
	if (lua_isnoneornil(L, -1)) {
		lua_pop(L, 1);
		lua_newtable(L);

		// make weak-keyed
		lua_pushstring(L, "k");
		lua_setfield(L, -2, "__mode");
		lua_pushvalue(L, -1);
		lua_setfield(L, -2, "__index");
		lua_pushvalue(L, -1);
		lua_setmetatable(L, -2);
		lua_pushvalue(L, -1);
		lua_setfield(L, LUA_REGISTRYINDEX, "gc_sentinels");
	}

	lua_insert(L, -3);
	lua_settable(L, -3); // lua::sentinel[value  <at> idx] = sentinel userdata
	lua_pop(L, 1); // lua::sentinel
}

On Jan 31, 2009, at 11:31 AM, Luiz Henrique de Figueiredo wrote:

>>> You cannot set __gc for tables. But you can say use a udata as  
>>> upvalue for
>>> all functions registered in the module and then clean up when that  
>>> udata
>>> is collected.
>>
>>
>> I have also thought about what to do in this situation and I am still
>> learning these tricks. Any examples would be very helpful. Thanks.
>
> See loadlib.c, which sets __gc for library handles to tell the OS to
> unload libraries from the app when they become garbage in Lua. This is
> not the same technique I described above, though.

Sam Roberts | 1 Feb 19:09 2009
Picon

Re: hash metamethod?

On Sun, Feb 1, 2009 at 2:31 AM, Joshua Haberman <joshua <at> reverberate.org> wrote:
> Languages like Ruby, Python, etc. allow you to define hash functions that
> apply to your objects.  You define a hash function and a comparison
> function, and this lets user-defined types be keys in hash tables based on
> their values.

I'm curious what you would use this for?

I've only seen it used once in ruby, in resolve.rb where DNS names are
a sub-class String to make a case-insensitive (but case-preserving)
string class. It worked fine, but if the language hadn't supported
this it would have been easy to achieve the goals in other ways.

I think you could build in pure lua a data structure class, say Hash,
that uses metamethods to index it's entries based on their __hash().
Would that be an easier approach than patching the runtime, or not
work for what you have in mind?

For userdata, a common approach is to cache them in a weak valued
table, to avoid creating identical userdata.

Cheers,
Sam

Picon

Re: hash metamethod?

> Languages like Ruby, Python, etc. allow you to define hash functions that
> apply to your objects.  You define a hash function and a comparison
> function, and this lets user-defined types be keys in hash tables based on
> their values.

In Lua, any value can be used as a table key. But for objects, this means
that their address is used rather than the contents. Is that the issue?
If so, can't it be addressed by calling new/put/get functions that compute
new keys for objects based on whatever is needed by the app? And of course
use a table as the underlying container.

Joshua Haberman | 1 Feb 20:42 2009

Re: hash metamethod?

Alex Davies <alex.mania <at> iinet.net.au> writes:
> Hey Joshua,
> 
> > There is one problem with my proposed scheme that bothers me a bit.  The 
> > question is: what would a table return as its default __hash and __eql 
> > methods?  On one hand I'd want it to treat the table by value, calling 
> > __hash and __eql on all of its individual members, so that tables with the 
> > same contents hash the same.
> 
> I just want to say definitely not :).  I don't think even a second should be 
> spent considering that - this is Lua, not Python.  We're not afraid to call 
> two different tables just that - different.  (Don't get me wrong, I think 
> Python's great - but I do think that this as default behaviour of its kind 
> of weird/pointless, and I cannot see how it could be implemented remotely 
> efficiently.)  Similarly simply adding tables as keys (oftenly done) would 
> be very slow -potentially many comparisons of equality have to be done on 
> each key lookup - tables are all we have, lets keep them fast.

I see your point.  Luckily, thinking about this more I realized I could get
value-based table hashing by doing it myself in Lua.  If __hash and __eql were
supported, I could do this:

-- The hash function exposed by the Lua C runtime.  Does something like:
function hash(value)
  local hash_metamethod = getmetatable(value).__hash
  if hash_metamethod then
    return hash_metamethod(value)
  else
    return lua_internal_hash(value)
  end
end

-- The hash combiner function exposed by the Lua C runtime.  Does
-- something like:
function combinehash(existing_hash, hash_to_add)
  -- xor doesn't exist in Lua but it does in C!
  return existing_hash ^ hash_to_add
end

-- A hash function for tables that hashes all tables and subtables that
-- have no __hash metamethod by their *value*.
function table_value_hash(table)
  local hash_val = 0
  for k,v in pairs(table) do
    hash_val = combinehash(hash, hash(k))
    if type(v) == "table" and not getmetatable(v).__hash then
      -- Hash sub-tables with no __hash metamethod by value also.
      hash_val = combinehash(hash_val, table_value_hash(v))
    else
      hash_val = combinehash(hash_val, hash(v))
    end
  end
  return hash_val
end

t = {}
t[{}] = "foo"  -- works like it always has

value_table1 = {foo={bar="baz"}}
value_table2 = {foo={bar="baz"}}

value_mt = {__hash=table_value_hash, __eql=table_value_eql}
setmetatable(value_table1, value_mt)
setmetatable(value_table2, value_mt)

-- the two value tables now hash to the same thing, despite the fact that
-- they are completely distinct objects and have nested data.
t[value_table1] = true
t[value_table2] = true  -- not inserted, already present

> Anyway, a patch for a simple __hash would not be hard to make.  Only a 
> couple of places would have to be modified.  Again though, you have to 
> remember that tables are integral to Lua's performance - anything that 
> bloats or slows them really has to be considered very carefully.  And with 
> that in mind, Lua has no extra space in the nodes in its tables to store the 
> hash of each key. So whenever the table gets resized, table keys would have 
> to be checked for a __hash metamethod and recalled (how does Python handle 
> it?).

Yep -- seems like a simple time/space tradeoff to be made here.  Either give
table nodes extra space to save their hash value, or take the time hit of
recomputing it.

> Along with being slow, this could also cause problems in the future 
> in the unlikely case that the gc starts resizing tables.  (Actually I lie - 
> if you want to 8 byte align the Node data structure you could fit a hash 
> value in there).

Excellent.  :)

> Personally I do think a __hash metamethod would be a large improvement where 
> Lua is used largely as a standalone language.  As an extension/scripting 
> language not quite so important.

Understood -- I am using it as a standalone language.  I might as well point you
all to it now.  I'm using Lua to write the compiler (and soon the extension
language too) for a new parsing framework called Gazelle:

http://www.gazelle-parser.org/

I've been working on it for almost two years now, and while it still has a long
way to go, I'm making great strides.  One of my next goals is for it to be able
to parse Lua -- see my prototype Lua grammar for Gazelle here (it doesn't
actually work yet, because it depends on features of Gazelle that aren't
finished yet):

http://github.com/haberman/gazelle/blob/d2393dbe1ea61c2281bf273761e083970f8a84c1/sketches/lua.gzl

In any case, the compiler involves a lot of set manipulation, so being able to
define identity with __hash and __eql would be really nice.

The reason I've chosen to use Lua standalone here is that I LOVE how lightweight
and clean its runtime is.  No other language runtime comes close. It's the only
high-level language that I'd feel comfortable asking my users to link into their
binaries.  And I also like how fast it is.

> Anyway, good luck keep us informed. :)

Thanks!  :)

Josh

Joshua Haberman | 1 Feb 20:54 2009

Re: hash metamethod?

Sam Roberts <vieuxtech <at> gmail.com> writes:
> On Sun, Feb 1, 2009 at 2:31 AM, Joshua Haberman <joshua <at> reverberate.org>
wrote:
> > Languages like Ruby, Python, etc. allow you to define hash functions that
> > apply to your objects.  You define a hash function and a comparison
> > function, and this lets user-defined types be keys in hash tables based on
> > their values.
> 
> I'm curious what you would use this for?

Lots of set manipulation.  I'm using Lua to write the compiler for a parsing
framework (http://www.gazelle-parser.org), and this compiler does quite a lot of
Set manipulation:

$ grep Set:new compiler/* | wc -l
      63

Several times I've been forced to resort to stringifying my objects to get value
semantics, which is inefficient and feels like a hack.

> I've only seen it used once in ruby, in resolve.rb where DNS names are
> a sub-class String to make a case-insensitive (but case-preserving)
> string class. It worked fine, but if the language hadn't supported
> this it would have been easy to achieve the goals in other ways.
> 
> I think you could build in pure lua a data structure class, say Hash,
> that uses metamethods to index it's entries based on their __hash().
> Would that be an easier approach than patching the runtime, or not
> work for what you have in mind?

I would have to completely reimplement hash tables in Lua to get this to work,
which would be slow.  There's no way from Lua to use the Lua table
implementation's hash collision resolution algorithm.

Josh

Joshua Haberman | 1 Feb 20:59 2009

Re: hash metamethod?

Luiz Henrique de Figueiredo <lhf <at> tecgraf.puc-rio.br> writes:
> > Languages like Ruby, Python, etc. allow you to define hash functions that
> > apply to your objects.  You define a hash function and a comparison
> > function, and this lets user-defined types be keys in hash tables based on
> > their values.
> 
> In Lua, any value can be used as a table key. But for objects, this means
> that their address is used rather than the contents. Is that the issue?
> If so, can't it be addressed by calling new/put/get functions that compute
> new keys for objects based on whatever is needed by the app? And of course
> use a table as the underlying container.

Using table keys that fully distinguish the object from all other objects is
inefficient, because the key has to contain as much information as the object
itself.  Creating such keys (which are usually strings) feels inelegant, hacky,
and is a poor way of using a hash table, because the string is just going to get
hashed again by Lua.  This is essentially two-pass hashing.

Using table keys that do not fully distinguish the object from all other objects
means that you have to implement your own hash collision algorithm, which is
inefficient and error prone compared to using the algorithms that Lua already
uses internally.

Josh

Jerome Vuarand | 2 Feb 02:30 2009
Picon

Re: hash metamethod?

2009/2/1 Joshua Haberman <joshua <at> reverberate.org>:
> Languages like Ruby, Python, etc. allow you to define hash functions that
> apply to your objects.  You define a hash function and a comparison
> function, and this lets user-defined types be keys in hash tables based on
> their values.
>
> [...]
>
> In any case, like I said I'm mainly wondering if anyone has a PowerPatch,
> I'm not going to argue that it should go into the main Lua (though I
> wouldn't mind that either :)

You don't need a power-patch, with the help of existing metamethods,
you can implement your own hashing algorithm. Here is an example:

http://lua-users.org/wiki/ComparisonByValue

Joshua Haberman | 2 Feb 03:33 2009

Re: hash metamethod?

Jerome Vuarand <jerome.vuarand <at> gmail.com> writes:
> 2009/2/1 Joshua Haberman <joshua <at> reverberate.org>:
> > Languages like Ruby, Python, etc. allow you to define hash functions that
> > apply to your objects.  You define a hash function and a comparison
> > function, and this lets user-defined types be keys in hash tables based on
> > their values.
> >
> > [...]
> >
> > In any case, like I said I'm mainly wondering if anyone has a PowerPatch,
> > I'm not going to argue that it should go into the main Lua (though I
> > wouldn't mind that either :)
> 
> You don't need a power-patch, with the help of existing metamethods,
> you can implement your own hashing algorithm. Here is an example:
> 
> http://lua-users.org/wiki/ComparisonByValue

This scheme requires that two objects that are not equivalent do *not* have the
same "hash" value.  At this point it's not really a hash anymore, it's a
fingerprint:

http://en.wikipedia.org/wiki/Fingerprint_(computing)

The fingerprint you give to Lua as a table key will then be run through the
usual hash table algorithm, and any two fingerprints that hash to the same value
will undergo the usual hash table collision resolution.

Generating short fingerprints means implementing something like Rabin's
algorithm.  That's complicated, so more likely a fingerprint will just be all
the fields of your object concatenated together in some strange and ad-hoc way.
 And once you've done that you *still* haven't even begun the actual hashing
part yet.

Also, with this scheme the table doesn't have any reference to the real key,
only to the fingerprint.  So you can't use that table to iterate over the
original keys.  Though if you're only using the table as a set, you can always
make the set members the table values also.

All of these things can be worked around in various ways.  But it's awkward and
the performance will probably not be as good as having the real hash table
implementation call your hash function and equality comparisons.

The fingerprint method would probably be more palatable if there was an
implementation of Rabin's algorithm available for Lua and written in C that
worked really nicely on Lua types.  That way you could generate unique
fingerprints without having to do ugly string concatenation of all your table's
data, which is both inefficient and error-prone.  And annoying.  :)

Josh


Gmane