{
"metadata": {
"name": "66-fast-nearest-neighbor"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Title: Fast Nearest Neighbor Methods\n",
"Author: Thomas Breuel\n",
"Institution: UniKL"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import cv2"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Fast Nearest Neighbor and Range Search Methods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fast Nearest Neighbor Methods:\n",
"\n",
"practical:\n",
"\n",
"- FLANN\n",
"\n",
"approaches:\n",
"\n",
"- k-D tree (and R-trees)\n",
"- locality sensitive hashing\n",
"- tree VQ\n",
"- Hamming embedding"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nearest Neighbor vs Range Query:\n",
"\n",
"Database:\n",
"\n",
"- collection of points $\\\\{x_1,...,x_N\\\\}$\n",
"- preprocessing allowed\n",
"\n",
"Range Query:\n",
"\n",
"- given a query $x$ and a range $\\epsilon$, find $\\\\{i | d(x,x_i)\\lt \\epsilon\\\\}$\n",
"\n",
"Nearest Neighbor Query:\n",
"\n",
"- given a query $x$ and a $k$, find the $k$ nearest neighbors of $x$ according to some metric $d$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dimensionality Problem:\n",
"\n",
"As we saw at the beginning, in high dimensions, all distances become very similar to each other _for data with an intrinsic high dimension_.\n",
"\n",
"Finding the exact nearest neighbor reduces to linear search for all known algorithms."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Solutions to Dimensionality Problem:\n",
"\n",
"- perform dimensionality reduction, then perform nearest neighbor search in lower-dimensional space\n",
"- use a nearest neighbor search method that takes advantage of low intrinsic dimensionality"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Purpose of Nearest Neighbor:\n",
"\n",
"Almost always, the purpose of $k$-nearest neighbor searches is _classification_.\n",
"\n",
"The classification problem may be implicit or dynamic, but it still relies on the relationship between nonparametric density estimation and nearest neighbors."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Methods for Nearest Neighbor Searches"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Linear Search:\n",
"\n",
"- just compute $d(x,x_i)$ for every $x_i$ in the database\n",
"- takes $O(Nd)$ time\n",
"\n",
"possible speedups:\n",
"\n",
"- reduce dimensionality with PCA\n",
"- use $||\\cdot||_1$ instead of $||\\cdot||_2$ (no multiplication)\n",
"- stop accumulation inside $||\\cdot||$ when a value larger than the best distance so far has been found\n",
"- take advantage of sparsity when available"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Range Query with Infinity Metric:\n",
"\n",
"In the $\\infty$-metric, the distance between two points is the maximum distance in any coordinate.\n",
"\n",
"For a range query, we require that every coordinate of a match is within the given $\\epsilon$ of the query point.\n",
"\n",
"We can execute this query coordinate by coordinate, even in a relational database:\n",
"\n",
" select * from t where abs(t.x1-1.7)<0.1 and abs(t.x2-3.9)<0.1 and abs(t.x3-0.2)<1\n",
"\n",
"Internally, this can be executed using inverted sorted indexes and intersections."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"kD-Trees"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"kD-Tree:\n",
"\n",
"construction:\n",
"\n",
"- given the point set, cycle through the axes\n",
"- for each axis, find a splitting point by selecting the median of the points along the axis\n",
"- recurse on each side of the splitting point until a minimum size is reached\n",
"\n",
"retrieval:\n",
"\n",
"- find the leaf containing the query point $x$, find the nearest neighbor $y$, let $r=d(y,x)$ and return that\n",
"- on return from a child, check the sibling node\n",
"- if the node does not overlap the ball $B_r(x)$, return $y,r$\n",
"- if the node does overlap the ball $B_r(x)$, search it for a better neighbor\n",
"- return the best $y,r$ found in both child nodes\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"kD-tree complexity:\n",
"\n",
"- asymptotic complexity for nearest neighbor queries is $O(\\log N)$ (great!)\n",
"- asymptotic complexity is only reached for $N >> 2^k$ (not so great)\n",
"\n",
"In practice, kD-trees do not work for searching in high dimensions."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Locality Sensitive Hashing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Locality Sensitive Hashing\n",
"\n",
"Assume a family of hash functions $F$, a distance $d$, a distance threshold $R>0$, and an approximation factor $c>1$.\n",
"\n",
"For every $h\\in F$, and with $p_1\\gt p_2$:\n",
"\n",
"- If $d(x,y)\\lt R ~$ then $P(h(p)=h(q))\\gt p_1$\n",
"- If $d(x,y)\\gt cR ~$ then $P(h(p)=h(q))\\lt p_2$\n",
"\n",
"We call this an $(R,cR,p_1,p_2)$-sensitive family of hash functions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Amplification:\n",
"\n",
"AND-construction:\n",
"\n",
"Given a $(d_1,d_2,p_1,p_2)$-sensitive family of hash functions, choose $k$ random hash functions $h_1,...,h_k$ and define a new hash function $h$ that has to agree on all of them. This gives rise to a $(d_1,d_2,p_1^r,p_2^r)$-sensitive family.\n",
"\n",
"OR-construction:\n",
"\n",
"Given a $(d_1,d_2,p_1,p_2)$-sensitive family of hash functions, choose $k$ random hash functions $h_1,...,h_k$ and define a new hash function $h$ that has to agree on any of them. This gives rise to a $(d_1,d_2,1-(1-p_1)^r,1-(1-p_2)^r)$-sensitive family."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Approximate nearest neighbor search using LSH:\n",
"\n",
"Assume a locality sensitive family $F$ of hash functions, and parameters $k$ and $L$.\n",
"\n",
"During preprocessing:\n",
"\n",
"- construct new hash functions $g$ by concatenating $k$ hash functions from $h$\n",
"- hash each point into $L$ hash tables using $g$\n",
"\n",
"During query:\n",
"\n",
"- try to look up the query point $x$ in each of the $L$ hash tables\n",
"- for each bucket found in a hash table, return the first point for which the distance is less than $cR$\n",
"- if none is found move to the next hash table\n",
"- if all hash tables are exhausted, fail\n",
"\n",
"Complexity:\n",
"\n",
"$\\rho=\\log P_2 / \\log P_1$, assuming constant time hash functions; finds approximate nearest neighbors\n",
"\n",
"- space: $O(N^{1+\\rho})$\n",
"- search time: $O(N^\\rho(k+d))$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which random hash functions?\n",
"\n",
"- bit sampling (sample bits from the coordinates)\n",
"- quantization of coordinates\n",
"- random projections"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"FLANN"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hierarchical k-Means\n",
"\n",
"Construction:\n",
"\n",
"- take the original data points, perform $k$-means clustering\n",
"- assign each data point to its cluster\n",
"- recursively repeat on each cluster until a minimum cluster size has been reached\n",
"\n",
"Lookup:\n",
"\n",
"- traverse the tree in a best-bin-first manner\n",
"- maintain a priority queue of branches with closest centers to the query point\n",
"- stop the search after a pre-determined number of nodes have been examined\n",
"\n",
"Notes: \n",
"\n",
"- hierarchical k-Means is usually called tree-VQ (hierarchical VQ is multiple resolutions)\n",
"- smarter exploration on search doesn't seem to help"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data = array(randn(1000,10),'f')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"flann = cv2.flann_Index(data,dict(algorithm=1,trees=4))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 14
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"index,dist = flann.knnSearch(data,5,params={})"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print index[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 0 638 899 940 525]\n",
" [ 1 944 749 230 497]\n",
" [ 2 430 409 933 928]\n",
" [ 3 79 248 788 951]\n",
" [ 4 103 58 507 480]\n",
" [ 5 829 696 150 955]\n",
" [ 6 588 981 990 28]\n",
" [ 7 361 780 722 107]\n",
" [ 8 722 707 666 685]\n",
" [ 9 287 163 125 272]]\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print dist[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 0. 2.46911836 2.62438083 3.03617215 3.18475151]\n",
" [ 0. 4.27013683 4.83362865 6.21951389 6.43545055]\n",
" [ 0. 4.56065941 4.59481716 8.74168491 9.18355083]\n",
" [ 0. 5.56775475 6.28863955 6.48139763 6.63649178]\n",
" [ 0. 4.99117756 5.16120672 5.90600395 6.00450516]\n",
" [ 0. 4.00080204 5.02649212 5.42283583 5.60694599]\n",
" [ 0. 4.98363495 5.5444622 5.70466614 6.22260761]\n",
" [ 0. 9.51819229 11.93401909 12.31844997 12.35668182]\n",
" [ 0. 4.21605206 4.53588057 4.59677792 4.67584229]\n",
" [ 0. 4.1922102 4.80881023 4.81959963 5.81781578]]\n"
]
}
],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}