Python实现的Google simhash算法

[python]#!/usr/bin/env python
# -*- coding=utf-8 -*-

# Implementation of Charikar simhashes in Python
# See: http://dsrg.mff.cuni.cz/~holub/sw/shash/#a1

class simhash():
def __init__(self, tokens='', hashbits=128):
self.hashbits = hashbits
self.hash = self.simhash(tokens)

def __str__(self):
return str(self.hash)

def __long__(self):
return long(self.hash)

def __float__(self):
return float(self.hash)

def simhash(self, tokens):
# Returns a Charikar simhash with appropriate bitlength
v = [0]*self.hashbits

for t in [self._string_hash(x) for x in tokens]:
bitmask = 0
#print (t)
for i in range(self.hashbits):
bitmask = 1 << i
#print(t,bitmask, t & bitmask)
if t & bitmask:
v[i] += 1 #查看当前bit位是否为1，是的话则将该位+1
else:
v[i] += -1 #否则得话，该位减1

fingerprint = 0
for i in range(self.hashbits):
if v[i] >= 0:
fingerprint += 1 << i
#整个文档的fingerprint为最终各个位大于等于0的位的和
return fingerprint

def _string_hash(self, v):
# A variable-length version of Python's builtin hash
if v == "":
return 0
else:
x = ord(v[0])<<7
m = 1000003
mask = 2**self.hashbits-1
for c in v:
x = ((x*m)^ord(c)) & mask
x ^= len(v)
if x == -1:
x = -2
return x

def hamming_distance(self, other_hash):
x = (self.hash ^ other_hash.hash) & ((1 << self.hashbits) - 1)
tot = 0
while x:
tot += 1
x &= x-1
return tot

def similarity(self, other_hash):
a = float(self.hash)
b = float(other_hash)
if a>b: return b/a
return a/b

if __name__ == '__main__':
#看看哪些东西google最看重？标点？
s = '看看哪些东西google最看重？标点？'
hash1 =simhash(s.split())
#print("0x%x" % hash1)
#print ("%s\t0x%x" % (s, hash1))

s = '看看哪些东西google最看重！标点！'
hash2 = simhash(s.split())
#print ("%s\t[simhash = 0x%x]" % (s, hash2))

print '%f%% percent similarity on hash' %(100*(hash1.similarity(hash2)))
print hash1.hamming_distance(hash2),"bits differ out of", hash1.hashbits

[/python]

VIA

传统的hash函数能够将一样的文本生成一样的hash函数，但是，通过simhash方法，能够差不多相同的文档得到的hash函数也比较相近。

Charikar's hash 通过Charikar‘s hash，能够将比较相似度的文档得到比较相近的fingerprint。

该算法的流程如下：

* Document is split into tokens (words for example)

1
2

* Document is split into tokens (words for example)

or super-tokens (word tuples) * Each token is represented by its hash value; a traditional hash function is used * Weights are associated with tokens * A vector V of integers is initialized to 0, length of the vector corresponds to the desired hash size in bits * In a cycle for all token's hash values (h), vector V is updated: o ith element is decreased by token's weight if the ith bit of the hash h is 0, otherwise o ith element is increased by token's weight if the ith bit of the hash h is 1 * Finally, signs of elements of V corresponds to the bits of the final fingerprint 该hash不是将文档总体计算hash值，而是将文档中的每个token计算哈希值，对文档中每个token的hash值，按照位对hash值进行求和，如果当前token的hash值在该位上是0，则减去1，如果在该位上是1，则加上1.将所有的token按照这种方式累加，求的最终的值作为fingerprint。

simhash简介:http://www.cnblogs.com/linecong/archive/2010/08/28/simhash.html

Googlecode simhash算法简介:http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf

Author Info :

From:Python实现的Google simhash算法

URL:https://blog.ihipop.com/2010/10/1754.html

Please Reserve This Link,Thanks!

本作品采用知识共享署名-非商业性使用 3.0 Unported许可协议进行许可。

《Python实现的Google simhash算法》上有2条评论

王晓珂说道：

2017年05月11日 5:28 下午

我自己也实现了一下simhash，发现基本不能用，来测下，你这个感觉也不能用啊，你找到原因了吗？而且你怎么不分词呢？

回复
panda说道：

2011年02月6日 11:56 上午

刚对上面这个算法做了一些调试测试，发现算法抗随机字符干扰差，而且两条很不相像的字符串相似度有时也会很高（如下面的测试结果），不太适合我所研究的领域。。

:新年快乐好消息，见副件 [simhash = 0x7b452ce82fc2468927e08904d80c18995d4c4131859a6bb1be0dc66c3a1acb1f]
Re:232adt股权激励的主要形式3e3 [simhash = 0xa1414dbe70f726d26a68822f98fb5e2ae11bd2adb92fdea6bcb5bb3fe23491cd]
========================================
76.444232% percent similarity on hash
126 bits differ out of 256

回复

Python实现的Google simhash算法

相关阅读

《Python实现的Google simhash算法》上有2条评论

发表回复取消回复

2010 年 10 月
一	二	三	四	五	六	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

相关阅读

《Python实现的Google simhash算法》上有2条评论

发表回复 取消回复

发表回复取消回复