Academia.eduAcademia.edu
The International Journal of Robotics Research http://ijr.sagepub.com FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance Mark Cummins and Paul Newman The International Journal of Robotics Research 2008; 27; 647 DOI: 10.1177/0278364908090961 The online version of this article can be found at: http://ijr.sagepub.com/cgi/content/abstract/27/6/647 Published by: http://www.sagepublications.com On behalf of: Multimedia Archives Additional services and information for The International Journal of Robotics Research can be found at: Email Alerts: http://ijr.sagepub.com/cgi/alerts Subscriptions: http://ijr.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations (this article cites 11 articles hosted on the SAGE Journals Online and HighWire Press platforms): http://ijr.sagepub.com/cgi/content/refs/27/6/647 Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Mark Cummins Paul Newman Mobile Robotics Group, University of Oxford, UK {mjc,pnewman}@robots.ox.ac.uk Abstract This paper describes a probabilistic approach to the problem of recognizing places based on their appearance. The system we present is not limited to localization, but can determine that a new observation comes from a previously unseen place, and so augment its map. Effectively this is a SLAM system in the space of appearance. Our probabilistic approach allows us to explicitly account for perceptual aliasing in the environment—identical but indistinctive observations receive a low probability of having come from the same place. We achieve this by learning a generative model of place appearance. By partitioning the learning problem into two parts, new place models can be learned online from only a single observation of a place. The algorithm complexity is linear in the number of places in the map, and is particularly suitable for online loop closure detection in mobile robotics. KEY WORDS—place recognition, topological SLAM, appearance based navigation 1. Introduction This paper considers the problem of recognizing locations based on their appearance. This problem has recently received attention in the context of large-scale global localization (Schindler et al. 2007) and loop closure detection in mobile robotics (Ho and Newman 2007). This paper advances the state of the art by developing a principled probabilistic framework for appearance-based place recognition. Unlike many existing systems, our approach is not limited to localization—we are able to decide if new observations originate from places already in the map, or rather from new, previously unseen places. This is possible even in environments where many places have The International Journal of Robotics Research Vol. 27, No. 6, June 2008, pp. 647–665 DOI: 10.1177/0278364908090961 c SAGE Publications 2008 Los Angeles, London, New Delhi and Singapore 1 Figures 5, 11, 13, 14, 18, 22 appear in color online: http://ijr.sagepub.com FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance similar sensory appearance (a problem known as perceptual aliasing). Previous solutions to this problem (Ho and Newman 2007) had complexity cubic in the number of observations, whereas the algorithm presented here has linear complexity, extending its applicability to much larger environments. We refer to the new algorithm as FAB-MAP, for fast appearancebased mapping. Our basic approach is inspired by the bag-of-words image retrieval systems developed in the computer vision community (Sivic and Zisserman 20031 Nister and Stewenius 2006). However, we extend the approach by learning a generative model for the bag-of-words data. This generative model captures the fact that certain combinations of appearance words tend to cooccur, because they are generated by common objects in the environment. Our system effectively learns a model of these common objects (without supervision, see Figure 5), which allows us to improve inference by reasoning about which sets of features are likely to appear or disappear together (for example, see Figure 12). We present results which demonstrate that the system is capable of recognizing places even when observations have few features in common (as low as 8%), while rejecting false matches due to perceptual aliasing even when such observations share many features (as much as 45%). The approach also has reasonable computational cost—fast enough for online loop closure detection in realistic mobile robotics problems where the map contains several thousand places. We demonstrate our system by detecting loop closures over a 2 km path length in an initially unknown outdoor environment, where the system detects a large fraction of the loop closures without false positives. 2. Related Work Due to the difficulty of detecting loop closure in metric SLAM algorithms, several appearance-based approaches to this task have recently been described (Levin and Szeliski 20041 Se et al. 20051 Newman et al. 20061 Ranganathan and Dellaert 2006). These methods attempt to determine when the robot 647 Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 648 THE INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH / June 2008 is revisiting a previously mapped area on the basis of sensory similarity, which can be calculated independently of the robot’s estimated metric position. Thus similarity-based approaches may be robust even in situations where the robot’s metric position estimate is in gross error. Many issues relevant to an appearance-based approach to SLAM have previously been examined in the context of appearance-based localization systems. In particular, there has been extensive examination of how best to represent and match sensory appearance information. Several authors have described systems that apply dimensionality reduction techniques to process incoming sensory data. Early examples include representing appearance as a set of image histograms (Ulrich and Nourbakhsh 2000), and as ordered sequences of edge and color features (Lamon et al. 2001). More recently Torralba et al. (2003) represent places by a set of texture features, and describe a system where localization is cast as estimating the state of a hidden Markov model. Kröse et al. (2001) use principal components analysis to reduce the dimensionality of the incoming images. Places are then represented as a Gaussian density in the low-dimensional space, which enables principled probabilistic localization. Ramos et al. (2005) employ a dimensionality reduction technique combined with variational Bayes learning to find a generative model for each place. (A drawback of both of these approaches is that they require significant supervised training phases to learn their generative place models—an issue which we will address in this paper.) Bowling et al. (2005) describe an unsupervised approach that uses a sophisticated dimensionality reduction technique called action respecting embedding1 however, the method suffers from high computational cost (Bowling et al. 2006) and yields localization results in a “subjective” representation which is not straightforward to interpret. The methods described so far represent appearance using global features, where a single descriptor is computed for an entire image. Such global features are not very robust to effects such as variable lighting, perspective change, and dynamic objects that cause portions of a scene to change from visit to visit. Work in the computer vision community has led to the development of local features which are robust to transformations such as scale, rotation, and some lighting change, and in aggregate allow object recognition even in the face of partial occlusion of the scene. Local feature schemes consist of a region-of-interest detector combined with a descriptor of the local region, SIFT (Lowe 1999) being a popular example. Many recent appearance-based techniques represent sensory data by sets of local features. An early example of the approach was described by Sim and Dudek (1998). More recently, Wolf et al. (2005) used an image retrieval system based on invariant features as the basis of a Monte Carlo localization scheme. The issue of selecting the most salient local features for localization was considered in Li and Kosecka (2006). Wang et al. (2005) employed the idea of a visual vocabulary built upon invariant features, taking inspiration from work in the com- puter vision community (Squire et al. 20001 Sivic and Zisserman 2003). The visual vocabulary model treats an image as a “bag of words” much like a text document, where a “word” corresponds to a region in the space of invariant descriptors. While the bag-of-words model discards geometric information about the image, it enables fast visual search through the application of methods developed for text retrieval. Several authors have proposed extensions to this basic approach. Wang et al. use the geometric information in a post-verification step to confirm putative matches. Filliat (2007) described a system where the visual vocabulary is learned online. Recent work by Ferreira et al. (2006) employs a Bernoulli mixture model to capture the conditional dependencies between words in the vocabulary. Most recently, Schindler et al. (2007) have described how to tailor visual vocabulary generation so as to yield more discriminative visual words, and discuss the application of the technique to city-scale localization with a database of 10 million images. The work discussed so far has been mainly concerned with the localization problem, where the map is known a priori and the current sensory view is guaranteed to come from somewhere within the map. To use a similar appearance-based approach to detect loop closure in SLAM, the system must be able to deal with the possibility that the current view comes from a previously unvisited place and has no match within the map. As pointed out in Gutmann and Konolige (1999), this makes the problem considerably more difficult. Particularly challenging is the perceptual aliasing problem—the fact that different parts of the workspace may appear the same to the robot’s sensors. As noted in Silpa-Anan and Hartley (2004), this can be a problem even when using rich sensors such as cameras due to repetitive features in the environment, for example mass-produced objects or repeating architectural elements. Gutmann and Konolige (1999) tackle the problem by computing a similarity score between a patch of the map around the current pose and older areas of the map to identify possible loop closures. They then employ a set of heuristics to decide if the putative match is a genuine loop closure or not. Chen and Wang (2006) tackle the problem in a topological framework, using a similarity measure similar to that developed by Kröse et al. (2001). To detect loop closure they integrate information from a sequence of observations to reduce the effects of perceptual aliasing, and employ a heuristic to determine if a putative match is a loop closure. While these methods achieved some success, they did not provide a satisfactory solution to the perceptual aliasing problem. Goedemé (2006) described an approach to this issue using Dempster–Shafer theory, using sequences of observations to confirm or reject putative loop closures. The approach involved separate mapping and localization phases, so was not suitable for unsupervised mobile robotics applications. Methods based on similarity matrices have recently become a popular way to extend appearance-based matching beyond pure localization tasks. These methods define a similarity Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cummins and Newman / FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance score between observations, then compute the pairwise similarity between all observations to yield a square similarity matrix. If the robot observes frequently as it moves through the environment and observations are ordered by time, then loop closures appear as off-diagonal stripes in this matrix. Levin and Szeliski (2004) describe such a method that uses a cascade of similarity functions based on global color histograms, image moments, and epipolar geometry. Silpa-Anan and Hartley (2004, 2005) describe a similar system which employs SIFT features. Zivkovic et al. (2005) consider the case where the order in which the images were collected is unknown, and use graph cuts to identify clusters of related images. A series of papers by Ho and Newman (2005b,a)1 Newman and Ho (2005)1 Newman et al. (2006) advance similarity-matrix-based techniques by considering the problem of perceptual aliasing. Their approach is based on a singular value decomposition of the similarity matrix that eliminates the effect of repetitive structure. They also address the issue of deciding if a putative match is indeed a loop closure based on examining an extreme value distribution (Gumbel 1958) related to the similarity matrix. This paper will describe a new appearance-based technique that improves on these existing results, particularly by addressing perceptual aliasing in a probabilistic framework. The core of our approach we have previously described in Cummins and Newman (2007)1 here we expand on that presentation and present a detailed evaluation of the system’s performance. 3. Approximating High-dimensional Discrete Distributions This section introduces some background material that forms a core part of our system. Readers familiar with Chow Liu trees may wish to skip to Section 4. Consider a distribution P1Z 2 on n discrete variables, Z 2 3z 1 3 z 2 3 4 4 4 3 z n 4. We wish to learn the parameters of this distribution from a set of samples. If P1Z2 is a general distribution without special structure, the space needed to represent the distribution is exponential in n—as n increases dealing with such distributions quickly becomes intractable. A solution to this problem is to approximate P1Z 2 by another distribution Q1Z2 possessing some special structure that makes it tractable to work with, and that is in some sense similar to the original distribution. The natural similarity metric in this context is the Kullback–Leibler divergence, defined as DKL 1P3 Q2 2 1 x5X P1x2 log P1x2 3 Q1x2 (1) where the summation is carried out over all possible states in the distribution. The KL divergence is zero when the distributions are identical, and strictly larger otherwise. 649 The naive Bayes approximation is an example of a structural constraint that allows for tractable learning of Q1Z 2, under which Q1Z2 is restricted such that each variable must be independent of all others. This extreme structural constraint limits the degree to which Q1Z 2 can represent P1Z2. A less severe constraint is to require each variable in Q1Z2 to be conditioned on at most one other variable—this restricts the graphical model of Q1Z 2 to be a tree. Unlike the naive Bayes case, where there is only one possible graphical model, in this case there are n n62 possible tree-structured distributions that satisfy the structural constraint. Chow and Liu (1968) described a polynomial time algorithm to select the best such distribution, which we elaborate on below. More generally, we could constrain Q1Z 2 such that each variable is conditioned on at most n other variables. This structure is known as a thin junction tree. It was shown by Srebro (2001) that the problem of determining the optimal thin junction tree is NP-hard, though there are several methods to find a good approximate solution (for example, Bach and Jordan (2002)). 3.1. Chow Liu Trees The Chow Liu algorithm approximates a discrete distribution P1Z 2 by the closest tree-structured Bayesian network Q1Z2opt , in the sense of minimizing the Kullback–Leibler divergence. The structure of Q1Z2opt is determined by considering a mutual information graph 1. For a distribution over n variables, 1 is the complete graph with n nodes and 1n1n 6 12252 edges, where each edge 1z i 3 z j 2 has weight equal to the mutual information I 1z i 3 z j 2 between variable i and j, I 1z i 3 z j 2 2 1 zi 56 3z j 56 p1z i 3 z j 2 log p1z i 3 z j 2 3 p1z i 2 p1z j 2 (2) and the summation is carried out over all possible states of z. Mutual information measures the degree to which knowledge of the value of one variable predicts the value of another. It is zero if two variables are independent, and strictly larger otherwise. Chow and Liu prove that the maximum-weight spanning tree (Cormen et al. 2001) of the mutual information graph will have the same structure as Q1Z 2opt (see Figure 1). Intuitively, the dependencies between variables which are omitted from Q1Z26opt have as little mutual information as possible, and so are the best ones to approximate as independent. Chow and Liu further prove that the conditional probabilities p1z i 7z j 2 for each edge in Q1Z2opt are the same as the conditional probabilities in P1Z 2—thus maximum likelihood estimates can be obtained directly from co-occurrence frequency in training data. To mitigate potential problems due to limited training data, we use the pseudo-Bayesian p8 estimator described in Bishop et al. (1977), rather than the maximum likelihood estimator. This Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 650 THE INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH / June 2008 Fig. 1. (a) Graphical model of the underlying distribution P1Z2. Mutual information between variables is shown by the thickness of the edge. (b) Naive Bayes approximation. (c) Chow Liu tree. prevents any probabilities from having unrealistic values of 0 or 1. In the following section we will use Chow Liu trees to approximate large discrete distributions. Because we are interested in learning distributions over a large number of variables (7103 000) the mutual information graph to be computed is very large, typically too large to be stored in RAM. To deal with this, we use a semi-external spanning tree algorithm (Dementiev et al. 2004). The mutual information graph is required only temporarily at learning time—it is discarded once the maximal spanning tree has been computed. Meil2a (1999) has described a more efficient method of learning Chow Liu trees that avoids the explicit computation of the full mutual information graph, and provides significant speed-up for “sparse data” such as ours. However, as we need to compute Chow Liu trees only infrequently and offline, we have not yet implemented this approach. We have chosen to use Chow Liu trees because they are tractable to compute even for very high-dimensional distributions, are guaranteed to be the optimal approximation within their model class, and require only first-order conditional probabilities, which can be reliably estimated from available training data. However, our approach could easily substitute a more complex model such as a mixture of trees (Meil2a-Predoviciu 1999) or a thin junction tree of wider tree width (Bach and Jordan 2002), should this prove beneficial. 4. Probabilistic Navigation using Appearance We now describe the construction of a probabilistic framework for appearance-based navigation. In overview, the world is modeled as a set of discrete locations, each location being described by a distribution over appearance words. Incoming sensory data is converted into a bag-of-words representation1 for each location, we can then ask how likely it is that the observation came from that location’s distribution. We also find an expression for the probability that the observation came from a place not in the map. This yields a probability density function (PDF) over location, which we can use to make a data association decision and update our belief about the appearance of each place. Essentially, this is a SLAM algorithm in a discrete world. The method is outlined in detail in the following sections. 4.1 Representing Appearance We adopt a “bag-of-words” representation of raw sensor data (Sivic and Zisserman 2003), where scenes are represented as a collection of attributes (words) chosen from a set (vocabulary) of size 787 . An observation of local scene appearance, visual or otherwise, captured at time k is denoted as Z k 2 3z 1 3 4 4 4 3 z 787 4, where z i is a binary variable indicating the presence (or absence) of the ith word of the vocabulary. Furthermore 2 k is used to denote the set of all observations up to time k. The results in this paper employ binary features derived from imagery, based on quantized Speeded Up Robust Features (SURF) descriptors (see Section 5)1 however, binary features from any sensor or combination of sensors could be used. For example, laser or sonar data could be represented using quantized versions of the features proposed in Johnson (1997)1 Frome et al. (2004)1 Mozos et al. (2005). 4.2. Representing Locations At time k, our map of the environment is a collection of n k discrete and disjoint locations 3k 2 3L 1 3 4 4 4 3 L nk 4. Each of these locations has an associated appearance model. Rather than model each of the locations directly in terms of which features are likely to be observed there (i.e. as p1Z7L j 2), here we introduce an extra hidden variable e. The variable ei is the event that an object which generates observations of type z i exists. We model each location as a set 3p1e1 2 17L i 23 4 4 4 3 p1e787 2 17L i 24, where each of the feature generating objects ei are generated independently by the location. A detector model relates feature existence ei to feature detection zi . The detector is specified by 2 3 p1zi 2 17ei 2 023 false positive probability3 (3) 4: 4 p1zi 2 07ei 2 123 false negative probability4 Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cummins and Newman / FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance 651 current and past observations, conditioned on the location. The simplified observation likelihood p1Z k 7L i 2 can then be expanded as p1Z k 7L i 2 2 p1z n 7z 1 3 z 2 3 4 4 4 3 z n61 3 L i 2 p1z n61 7z 1 3 z 2 3 4 4 4 3 z n62 3 L i 2 9 9 9 p1z 2 7z 1 3 L i 2 p1z 1 7L i 24 Fig. 2. Factor graph of our generative model. Observed variables are shaded, latent variables unshaded. The environment generates locations L i . Locations independently generate words e j . Words generate observations z k , which are interdependent. (5) This expression cannot be evaluated directly because of the intractability of learning the high-order conditional dependencies between appearance words. The simplest approximation is to make a naive Bayes assumption, yielding p1Z k 7L i 2 p1z n 7L i 2 9 9 9 p1z 2 7L i 2 p1z 1 7L i 24 Each factor can be further expanded as 1 p1z j 7L i 2 2 p1z j 7e j 2 s3 L i 2 p1e j 2 s7L i 24 (6) (7) s530314 The reason for introducing this extra hidden variable e is twofold. Firstly, it provides a natural framework for incorporating data from multiple sensors with different (and possibly time-varying) error characteristics. Secondly, as outlined in the next section, it facilitates factoring the distribution p1Z 7L j 2 into two parts—a simple model for each location composed of independent variables ei , and a more complex model that captures the correlations between detections of appearance words p1z i 7Z k 2. Because the location model has a simple structure, it can be estimated online from the few available observations (combined with a prior). The more complex model that captures the correlations between observations is learned offline from abundant training data, and can be combined with the simple location models on the assumption that the conditional dependencies between appearance words are independent of location. We describe how this is achieved in the following section. The generative model is illustrated in Figure 2. Applying the independence assumptions in our model ( p1z j 7e j 3 L i 2 2 p1z j 7e j 2, i.e. detector errors are independent of position in the world) this becomes 1 p1z j 7e j 2 s2 p1e j 2 s7L i 23 (8) p1z j 7L i 2 2 s530314 which can now be evaluated directly. Here p1z j 7e j 2 s2 are the components of the detector model specified in Equation (3) and p1e j 2 s7L i 2 is a component of the place appearance model. While a naive Bayes approach yields reasonable results (see Section 5), substantially better results can be obtained by instead employing a Chow Liu approximation to capture more of the dependencies between appearance words. Using a Chow Liu assumption, the appearance likelihood becomes p1Z k 7L i 2 p1zr 7L i 2 787 5 p1z q 7z pq 3 L i 23 (9) q22 4.3. Estimating Location via Recursive Bayes k Calculating p1L i 72 2 can be formulated as a recursive Bayes estimation problem: p1L i 72 k 2 2 p1Z k 7L i 3 2 k61 2 p1L i 72 k61 2 4 p1Z k 72 k61 2 where zr is the root of the tree and z pq is the parent of z q in the tree. The observation factors in Equation (9) can be further expanded as p1z q 7z pq 3 L i 2 (4) Here p1L i 72 k61 2 is our prior belief about our location, p1Z k 7L i 3 2 k61 2 is the observation likelihood, and p1Z k 72 k61 2 is a normalizing term. We discuss the evaluation of each of these terms below. 1 2 p1z q 7eq 2 seq 3 z pq 3 L i 2 p1eq 2 seq 7z pq 3 L i 24 (10) seq 530314 Applying our independence assumptions and making the approximation that p1e j 2 is independent of z i for all i 2 j, the expression becomes p1z q 7z pq 3 L i 2 4.3.1. Observation Likelihood To simplify evaluation of the observation likelihood p1Z k 7L i 3 2 k61 2, we first assume independence between the 2 1 p1z q 7eq 2 seq 3 z pq 2 p1eq 2 seq 7L i 24 seq 530314 Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. (11) 652 THE INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH / June 2008 The term p1z q 7eq 3 z pq 2 can be expanded (see Appendix B) as p1zq 2 szq 7eq 2 seq 3 z p 2 sz p 2 2 6 1 9 761 3 (12) where szq 3 seq 3 sz p 5 303 14 and 9 2 p1z q 2 szq 2 p1z q 2 szq 7eq 2 seq 2 p1z q 2 szq 7z p 2 sz p 23 2 (13) p1z q 2 szq 2 p1z q 2 szq 7eq 2 seq 2 8 9 8 9 prior detector model p1z q 2 szq 7z p 2 sz p 24 8 9 (14) conditional from training where sz denotes the opposite state to sz . 9 and are now expressed entirely in terms of quantities which can be estimated from training data (as indicated by the under-braces). Hence p1Z k 7L i 2 can now be computed directly. 4.3.2. New Place or Old Place? We now turn our attention to the denominator of Equation (4), p1Z k 72 k61 2. For pure localization we could compute a PDF over location simply by normalizing the observation likelihoods computed as described in the previous section. However, we would like to be able to deal with the possibility that a new observation comes from a previously unknown location, which requires an explicit calculation of p1Z k 72 k61 2. If we divide the world into the set of mapped places M and the unmapped places M, then 1 p1Z k 72 k61 2 2 p1Z k 7L m 2 p1L m 72 k61 2 m5M 1 probability that we are at an unknown place, which can be obtained from our previous position estimate and a motion model (discussed in the next section). An alternative to the mean field approach is to approximate the second summation via sampling. The procedure is to sample location models L u according to the distribution by which they are generated by the environment, and to evaluate k61 2 for the sampled location models. u5M p1Z k 7L u 2 p1L u 72 In order to do this, some method of sampling location models L u is required. Here we make an approximation for ease of implementation—we instead sample an observation Z and use the sampled observation to create a place model. The reason for doing this is that sampling an observation Z is extremely easy—our sampler simply consists of selecting at random from a large collection of observations. For our robot which navigates in outdoor urban environments using imagery, large collections of street-side imagery for sampling are readily available—for example from previous runs of the robot, or from such sources as Google Street View1 . In general, this sampling procedure will not create location models according to their true distribution because models may have been created from multiple observations of the location. However, it will be a good approximation when the robot is exploring a new environment, where most location models will have only a single observation associated with them. In practice, we have found that it works very well, and it has the advantage of being very easy to implement. Having sampled a location model, we must evaluate p1Z k 7L u 2 p1L u 72 k61 2 for the sample. Calculating p1Z k 7L u 2 has already been described1 however, we currently have no method of evaluating p1L u 72 k61 2, the prior probability of our sampled place model with respect to our history of observations, so we assume it to be uniform over the samples. Equation (15) thus becomes 1 p1Z k 7L m 2 p1L m 72 k61 2 p1Z k 72 k61 2 m5M p1Z k 7L u 2 p1L u 72 k61 23 (15) p1L new 72 k61 2 u5M where we have applied our assumption that observations are conditionally independent given location. The second summation cannot be evaluated directly because it involves all possible unknown places. We have compared two different approximations to this term. The first is a mean field approximation (Jordan et al. 1999): 1 p1Z k 7L m 2 p1L m 72 k61 2 p1Z k 72 k61 2 m5M p1Z k 7L avg 2 1 p1L u 72 k61 23 (16) u5M where L avg is the “average place” where the ei values are set to their marginal probability, and u5M p1L u 72 k61 2 is the prior ns 1 p1Z k 7L u 2 3 (17) ns u21 where n s is the number of samples used, and p1L new 72 k61 2 is our prior probability of being at a new place. 4.3.3. Location Prior The last term to discuss is the location prior p1L i 72 k61 2. The most straightforward way to obtain a prior is to transform the previous position estimate via a motion model, and use this as our prior for location at the next time step. While at present our system is not explicitly building a topological map, for the 1. See http://maps.google.com/help/maps/streetview/. Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cummins and Newman / FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance purpose of calculating a prior we assume that sequentially collected observations come from adjacent places, so that, if the robot is at place i at time t, it is likely to be at one of the places 3i 6 13 i3 i 14 at time t 1, with equal probability. For places with unknown neighbors (e.g. the next place after the most recently collected observation, where place i 1 may be a new place not already in the map), part of the probability mass is assigned to a “new place” node and the remainder is spread evenly over all places in the map. The split is governed by the probability that a topological link with unknown endpoint leads to a new place, which is a user-defined parameter in our system. Clearly this prior could be improved via a better topology estimate or through the use of some odometry information1 however, we have found that the effect of the prior is relatively weak, so these issues are not crucial to performance. If the assumption that sequential observations come from adjacent places is not valid, the prior can be simply left uniform. While this results in some increase in the number of false matches, performance is largely unaffected. Alternatively, if we have an estimate for the topology of the environment, but no estimate of our position within this topology, we could obtain a prior as the limit of a random walk on the topology. This can be obtained efficiently as the dominant eigenvector of the normalized link matrix of the topology (Page et al. 1998). 4.3.4. Smoothing We have found that the performance of the inference procedure outlined above is strongly dependent on an accurate estimation of the denominator term in Equation (4), p1Z k 72 k61 2. Ideally we would like to perform the Monte Carlo integration in Equation (17) over a large set of place models L u that fully capture the visual variety of the environment. In practice, we are limited by running time and available data. The consequence is that occasionally we will incorrectly assign two similar images a high probability of having come from the same place, when in fact the similarity is due to perceptual aliasing. An example of this is illustrated in Figure 13, where two images from different places, both of which contain iron railings, are incorrectly assigned a high probability of having come from the same place. The reason for this mistake is that the railings generate a large number of features, and there are no examples of railings in the sampling set used to calculate p1Z k 72 k61 2, so the system cannot know that these sets of features are correlated. In practice, due to limited data and computation time, there will always be some aspects of repetitive environmental structure that we fail to capture. To ameliorate this problem, we apply a smoothing operation to our likelihood estimates: p1Z k 7L i 2 6 p1Z k 7L i 2 11 6 2 3 nk (18) 653 where n k is the number of places in the map and is the smoothing parameter, which for our experiments was set to 0499. The effect of this very slight smoothing of the data likelihood values is to prevent the system asserting loop closure with high confidence on the basis of a single similar image pair. Smoothing the likelihood values effectively boosts the importance of the prior term p1L i 72 k61 2, so that high probability of loop closure can now only be achieved by accumulating evidence from a sequence of matched observations. While, in principle, mistakes due to our approximate calculation of p1Z k 72 k61 2 can still occur, in practice we have found that this modification successfully rejects almost all outliers, while still allowing true loop closures to achieve high probability after a sequence of only two or three corresponding images. 4.3.5. Updating Place Models We have now outlined how to compute a PDF over location given a new observation. The final issue to address is data association and the updating of place appearance models. At present we simply make a maximum likelihood data association decision after each observation. Clearly there is room for improvement here, for example by maintaining a PDF over topologies using a particle filter, as described by Ranganathan and Dellaert (2006). After taking a data association decision, we can update the relevant place appearance model, which consists of the set 3p1e1 2 17L j 23 4 4 4 3 p1e787 2 17L j 24. When a new place is created, its appearance model is initialized so that all words exist with marginal probability p1ei 2 12 derived from the training data. Given an observation that relates to the place, each component of the appearance model can be updated according to p1ei 2 17L j 3 2 k 2 2 p1Z k 7ei 2 13 L j 2 p1ei 2 17L j 3 2 k61 2 3 (19) p1Z k 7L j 2 where we have applied our assumption that observations are conditionally independent given location. 4.4. Summary of Approximations and Parameters We briefly recap the approximations and user-defined terms in our probabilistic scheme. For tractability, we have made a number of independence assumptions: 1. Sets of observations are conditionally independent given position: p1Z k 7L i 3 2 k61 2 2 p1Z k 7L i 2. 2. Detector behavior is independent p1z j 7e j 3 L i 2 2 p1z j 7e j 2. Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. of position: 654 THE INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH / June 2008 3. Location models are generated independently by the environment. In addition to these independence assumptions, we also make an approximation for tractability: 1. Observations of one feature do not inform us about p1e j 7z jk 2. the existence of other features: p1e j 7Z k 2 While more complex inference is possible here, in effect this approximation only means that we will be somewhat less certain about feature existence than if we made full use of the available data. 4.4.1. Input Parameters The only user-specified inputs to the algorithm are the detector model, p1z i 2 17ei 2 02 and p1z i 2 07ei 2 12 (two scalars), the smoothing parameter , and a term that sets the prior probability that a topological link with an unknown endpoint leads to a new place. Of these, the algorithm is particularly sensitive only to the detector model, which can be determined from a calibration discussed in the following section. 5. Evaluation We tested the described algorithm using imagery from a mobile robot. Each image that the robot collects is passed into a processing pipeline that produces a bag-of-words representation, which is then the input to the algorithm described. There are many possible ways to convert input sensory data into a bag-of-words representation. Our system uses the SURF detector/descriptor (Bay et al. 2006) to extract regions of interest from the images, and compute 128D non-rotation-invariant descriptors for these regions. Finally, we map regions of descriptor space to visual words as suggested by Sivic and Zisserman (2003). This is achieved by clustering all the descriptors from a set of training images using a simple incremental clustering procedure, then quantizing each descriptor in the test images to its approximate nearest cluster center using a kd-tree. We have not focused on this bag-of-words generation stage of the system—more sophisticated approaches exist (Nister and Stewenius 20061 Schindler et al. 2007) and there are probably gains to be realized here. 5.1. Building the Vocabulary Model The next step is to construct a Chow Liu tree to capture the co-occurrence statistics of the visual words. To do this, we construct the mutual information graph as described in Section 3.1. Each node in the graph corresponds to a visual word, and the edge weights (mutual information) between node i and j are calculated as per Equation (2)—essentially this amounts to counting the number of images in which word i and j cooccur. The Chow Liu tree is then the maximum weight spanning tree of the graph. If the Chow Liu tree is to be a good approximation to the true distribution, it must be computed from a large number of observations. To prevent bias, these observations should be independent samples from the observation distribution. We collected 2,800 images from 28 km of urban streets using the robot’s camera. The images were taken 10 m apart, perpendicular to the robot’s motion, so that they are non-overlapping and approximate independent samples from the observation distribution. From this data set we computed the clustering of SURF features that form our vocabulary, then compute the Chow Liu tree for the vocabulary. The clustering procedure generated a vocabulary of approximately 11,000 words. The combined process took 2 hours on a 3 GHz Pentium IV. This is a one-off computation that occurs offline. Illustrative results are shown in Figures 3, 4, and 5, which show sample visual words learned by the system and a visualization of a portion of the Chow Liu tree. These words (which correspond to parts of windows) were determined to have high pairwise mutual information, and so are neighbors in the tree. The joint probability of observing the five words shown in the tree is 4,778 times higher under the Chow Liu model than under a naive Bayes assumption. Effectively the system is learning about the presence of objects that generate the visual words. This significantly improves inference (see Figure 12). 5.2. Testing Conditions We used this vocabulary to navigate using imagery collected by a mobile robot. We modeled our detector by p1zi 2 07ei 2 12 2 0439 and p1zi 2 17ei 2 02 2 0, i.e. a per-word false negative rate of 39%, and no false positives2 . In principle different detector terms could be learned for each word1 for example, words generated by cars might be less reliably reobservable than words generated by buildings. At present we assume all detector terms are the same. The number of samples for determining the denominator term p1Z k 72 k61 2 was fixed at 2,800—the set of images used to learn the vocabulary. The smoothing parameter was 0.99, and the prior that a new topological link leads to a new place was 0.9. To evaluate the effect of the various alternative approximations discussed in Section 4, several sets of results were generated, comparing naive Bayes versus Chow Liu approximations to p1Z 7L i 2 and mean field versus Monte Carlo approximations to p1Z k 72 k61 2. 2. To set these parameters, we 3rst assumed a false positive rate of zero. The false negative rate could then be determined by collecting multiple images at a test location1 p1zi 2 07ei 2 12 then comes directly from the ratio of the number of words detected in any one image to the number of unique words detected in union of all the images from that location. Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cummins and Newman / FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance 655 Fig. 3. A sample word in the vocabulary, showing typical image patches and an example of the interest points in context. Interest points quantized to this word typically correspond to the top-left corner of windows. The most correlated word in the vocabulary is shown in Figure 4. Fig. 4. A sample word in the vocabulary, showing typical image patches and an example of the interest points in context. Interest points quantized to this word typically correspond to the cross-piece of windows. The most correlated word in the vocabulary is shown in Figure 3. 5.3. Results We tested the system on two outdoor urban data sets collected by our mobile robot. Both data sets are included in Extension 3. As the robot travels through the environment, it collects images to the left and right of its trajectory approximately every 1.5 m. Each collected image is processed by our algorithm and is used either to initialize a new place, or, if loop closure is detected, to update an existing place model. The area of our test data sets did not overlap with the region where the training data was collected. The first data set— New College—was chosen to test the system’s robustness to perceptual aliasing. It features several large areas of strong vi- Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 656 THE INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH / June 2008 Fig. 5. Visualization of a section the Chow Liu tree computed for our urban vocabulary. Each word in the tree is represented by a typical example. Clockwise from top, the words correspond to the cross-pieces of window panes, right corners of window stills, top-right corners of window panes, bottom-right corners of window panes, and top-left corners of window panes. Under the Chow Liu model the joint probability of observing these words together is 4,778 times higher than under the naive Bayes assumption. sual repetition, including a medieval cloister with identical repeating archways and a garden area with a long stretch of uniform stone wall and bushes. The second data set—labeled City Center—was chosen to test matching ability in the presence of scene change. It was collected along public roads near the city center, and features many dynamic objects such as traffic and Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cummins and Newman / FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance 657 Fig. 6. Appearance-based matching results for the City Center data set overlaid on an aerial photograph. The robot travels twice around a loop with total path length of 2 km, collecting 2,474 images. Positions (from hand-corrected GPS) at which the robot collected an image are marked with a yellow dot. Two images that were assigned a probability p  0499 of having come from the same location (on the basis of appearance alone) are marked in red and joined with a line. There are no incorrect matches that meet this probability threshold. This result is best viewed as a video (Extension 2). pedestrians. Additionally, it was collected on a windy day with bright sunshine, which makes the abundant foliage and shadow features unstable. Figures 6 and 7 show navigation results overlaid on an aerial photo. These results were generated using the Chow Liu and Monte Carlo approximations, which we found to give the best performance. The system correctly identifies a large proportion of possible loop closures with high confidence. There are no false positives that meet the probability threshold. See also Extensions 1 and 2 for videos of these results. Precision recall curves are shown in Figure 8. The curves were generated by varying the probability at which a loop closure was accepted. Ground truth was labeled by hand. We are primarily interested in the recall rate at 100% precision—if the system were to be used to complement a metric SLAM algorithm, an incorrect loop closure could cause unrecoverable mapping failure. At 100% precision, the system achieves 48% recall on the New College data set, and 37% on the more challenging City Center data set. “Recall” here is the proportion of possible image-to-image matches that exceed the probability Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 658 THE INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH / June 2008 Fig. 7. Appearance-based matching results for the New College data set overlaid on an aerial photograph. The robot traverses a complex trajectory of 1.9 km with multiple loop closures. 2,146 images were collected. Positions (from hand-corrected GPS) at which the robot collected an image are marked with a yellow dot. Two images that were assigned a probability p  0499 of having come from the same location (on the basis of appearance alone) are marked in red and joined with a line. There are no incorrect matches that meet this probability threshold. This result is best viewed as a video (Extension 1). Fig. 8. Precision–recall curves for the City Center and New College data sets. Note the scale. Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cummins and Newman / FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance 659 Fig. 9. Some examples of remarkably similar-looking images from different parts of the workspace that were correctly assigned a low probability of having come from the same place. The result is possible because most of the probability mass is captured by locations in the sampling set—effectively the system has learned that images like these are common. Of course, had these examples been genuine loop closures they might also have received a low probability. We would argue that this is correct behavior, modulo the fact that the probabilities in (a) and (b) are too low. The very low probabilities in (a) and (b) are due to the fact that good matches for the query images are found in the sampling set, capturing almost all the probability mass. This is less likely in the case of a true but ambiguous loop closure. Words common to both images are shown in green, others in red. (Common words are shown in blue in (b) for better contrast.) The probability that the two images come from the same place is indicated between the pairs. threshold. As a typical loop closure is composed of multiple images, even a recall rate of 37% is sufficient to detect almost all loop closures. Some examples of typical image matching results are presented in Figures 9 and 10. Figure 9 highlights robustness to perceptual aliasing. Here very similar images that originate from different locations are correctly assigned a low probability of having come from the same place. We emphasize that these results are not outliers1 they show typical system performance. Figure 10 shows matching performance in the presence of scene change. Many of these image pairs have far fewer visual words in common than the examples of perceptual aliasing, yet are assigned a high probability of having come from the same place. The system can reliably reject perceptual aliasing, even when images have as much as 46% of visual words in common (e.g. Figure 9(b)), while detecting loop closures where images have as few as 8% of words in common. Poor probability estimates do occasionally occur—some examples of images incorrectly assigned a high probability of a place match are shown in Figure 13. Note, however, that the typical true loop closure receives a much higher probability of a match. Neither of the examples in Figure 13 met the data association threshold of 0.99. 5.3.1. Comparing Approximations This section presents a comparison of the alternative approximations discussed in Section 4—naive Bayes versus Chow Liu approximations to p1Z 7L i 2 and mean field versus. Monte Carlo approximations to p1Z k 72 k61 2. Figure 11 shows precision–recall curves from the New College data set for the four possible combinations. Timing and accuracy results are summarized in Table 1. The Chow Liu and Monte Carlo approximations both considerably improve performance, particularly at high levels of precision, which is the region that con- Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 660 THE INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH / June 2008 Fig. 10. Some examples of images that were assigned a high probability of having come from the same place, despite scene change. Words common to both images are shown in green, others in red. The probability that the two images come from the same place is indicated between the pairs. Table 1. Comparison of the four different approximations. The recall rates quoted are at 100% precision. The time to process a new observation is given as a function of the number of places already in the map, plus a fixed cost to perform the sampling. Timing results are for a 3 GHZ Pentium IV. Algorithm Recall: New College Recall: City Center Run time Mean Field, Naive Bayes 33% 16% 0.6 ms/place Mean Field, Chow Liu 35% 31% 1.1 ms/place Monte Carlo, Naive Bayes 40% 31% 0.6 ms/place + 1.71 s sampling Monte Carlo, Chow Liu 47% 37% 1.1 ms/place + 3.15 s sampling cerns us most. Some examples of typical loop closures detected by the Chow Liu approximation but not by the naive Bayes are shown in Figure 12. The extra performance comes at the cost of some increase in computation time1 however, even the slowest version using Chow Liu and Monte Carlo approximations is still relatively fast. Running times are summarized in Table 1. The maximum time taken to process a new observation over all data sets was 5.9 s. As the robot collects images approximately every 2 s, this is not too far from real-time performance. The dominant computational cost is the calculation of p1Z7L i 2 for each place model in the map and each sampled place. Each of these calculations is independent, so the algorithm is highly parallelizable and will perform well on multi-core processors. 6.3.2. Generalization Performance Because FAB-MAP relies on a learned appearance model to reject perceptual aliasing and improve inference, a natural question is to what extent system performance will degrade in environments very dissimilar to the training set. To investigate Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cummins and Newman / FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance 661 Fig. 11. Precision–recall curves for the four variant algorithms on the New College data sets. Note the scale. Relative performance on the City Center data set is comparable. Fig. 12. Some examples of situations where the Chow Liu approximation outperforms naive Bayes. In (a), a change in lighting means that the feature detector does not fire on the windows of the building. In (b), the people are no longer present. In (c), the foreground text and the scaffolding in the top right are not present in the second image. In each of these cases, the missing features are known to be correlated by the Chow Liu approximation, hence the more accurate probability. Words common to both images are shown in green, others in red. The probability that the two images come from the same place (according to both the Chow Liu and naive Bayes models) is indicated between the pairs. Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 662 THE INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH / June 2008 Fig. 13. Some images from different locations incorrectly assigned a high probability of having come from the same place. In (a), the training set contains no examples of railings, so the matched features are not known to be correlated. In (b), we encounter the same truck again in a different part of the workspace. Errors of this type are particularly challenging. Note that while both images are assigned a high probability of a match, a typical true loop closure is assigned a much higher probability. Neither of these image pairs met our p 2 0499 data association threshold. this question we have recently performed some preliminary indoor navigation experiments with data from a hand-held video camera, using the same training data and algorithm parameters as used outdoors. The quality of the imagery in this indoor data set is considerably poorer than that from the robot, due to adverse lighting conditions and greater motion blur. Interestingly, however, system performance is largely similar to our outdoor results (46% recall at 100% precision). The fact that the system works indoors despite using outdoor training data is quite surprising. Most interestingly, the Chow Liu approximation continues to noticeably outperform the naive Bayes on the indoor data set (46% recall as opposed to 39%). This suggests that some of the correlations being learned by the Chow Liu tree model generic low-level aspects of imagery. These results, while preliminary, indicate that the system will degrade gracefully in environments dissimilar to the training set. See (Extension 4 and 5) for video results. on appearance information only. The system is robust even in visually repetitive environments and is fast enough for online loop closure detection in realistic mobile robotics applications. In our evaluation, FAB-MAP was successful in detecting large portions of loop closures in challenging outdoor environments, without false positives. Two aspects of our results are particularly noteworthy. Firstly, learning a generative model of our bag-of-words observations yielded a marked improvement in performance. We think that this technique will have applications beyond the specific problem addressed in this paper. Secondly, we have observed that the system appears to perform well even in environments quite dissimilar to the training data. This suggests that the approach is practical for deployment in unconstrained navigation tasks, as a natural compliment to more typical metric SLAM algorithms. Acknowledgments 6. Conclusions This paper has introduced the FAB-MAP algorithm, a probabilistic framework for navigation and mapping which relies The work reported in this paper was funded by the Systems Engineering for Autonomous Systems (SEAS) Defence Technology Centre established by the UK Ministry of Defence and by the Engineering and Physical Sciences Research Council. Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cummins and Newman / FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance Appendix A: Appendix of Multimedia Extensions 663 where 9 The multimedia extension page is found at http://www.ijrr.org. 2 p1z q 2 p1z q 7eq 2 p1z q 7z p 23 2 p1z q 2 p1z q 7eq 2 p1z q 7z p 24 Table of Multimedia Extensions Extension Type Description 1 Video Results for the New College Dataset. 2 Video Results for the City Centre Dataset. 3 Data Images, GPS coordinates and ground truth labels. 4 Video Generalization Performance – Indoor Dataset A 5 Video Generalization Performance – Indoor Dataset B References Appendix B: Derivation This appendix presents the derivation of Equation (12) from Section 4.3. For compactness of notation, in this appendix z q 2 szq will be denoted as z q and z q 2 szq as z q , etc. We seek to express the term p1zq 7eq 3 z p 2 as a function of the known quantities p1zq 2, p1z q 7eq 2, p1z q 7z p 2. Applying Bayes rule, p1z q 7eq 3 z p 2 2 p1eq 7z q 3 z p 2 p1z q 7z p 2 p1eq 7z p 2 now expanding p1eq 7z p 2 as p1eq 7z p 2 2 p1eq 7z q 3 z p 2 p1z q 7z p 2 p1eq 7z q 3 z p 2 p1z q 7z p 23 and making the approximation p1eq 7z q 3 z p 2 expression becomes p1eq 7z q 2, the p1eq 7z q 2 p1z q 7z p 2 p1eq 7z q 2 p1z q 7z p 2 p1eq 7z q 2 p1z q 7z p 2 p1z q 7eq 3 z p 2 2 6 1 p1eq 7z q 2 p1z q 7z p 2 p1eq 7z q 2 p1z q 7z p 2 Now p1eq 7z q 2 2 761 p1z q 7eq 2 p1eq 2 p1z q 2 and similarly for p1eq 7z q 2, so p1z q 7eq 2 p1z q 2 p1eq 7z q 2 2 3 p1eq 7z q 2 p1z q 7eq 2 p1z q 2 and substituting back yields p1z q 7eq 3 z p 2 6 1 9 761 3 4 Bach, F. R. and Jordan, M. I. (2002). Thin junction trees. Advances in Neural Information Processing Systems. Bay, H., Tuytelaars, T. and Van Gool, L. (2006). SURF: Speeded Up Robust Features. Proceedings of the 9th European Conference on Computer Vision, Graz, Austria, Vol. 13, pp. 404–417. Bishop, Y. M., Fienberg, S. E. and Holland, P. W. (1977). Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA, MIT Press. Bowling, M., Wilkinson, D. and Ghodsi, A. (2006). Subjective mapping. Twenty-First National Conference on Artificial Intelligence. Boston, MA, AAAI Press. Bowling, M., Wilkinson, D., Ghodsi, A. and Milstein, A. (2005). Subjective localization with action respecting embedding. Proceedings of the International Symposium of Robotics Research. Chen, C. and Wang, H. (2006). Appearance-based topological Bayesian inference for loop-closing detection in a crosscountry environment. International Journal of Robotics Research, 25(10): 953–983. Chow, C. and Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3): 462–467. Cormen, T. H., Leiserson, C. E., Rivest, R. L. and Stein, C. (2001). Introduction to Algorithms, 2nd edn. Cambridge, MA, MIT Press. Cummins, M. and Newman, P. (2007). Probabilistic appearance based navigation and loop closing. Proceedings IEEE International Conference on Robotics and Automation (ICRA’07), Rome. Dementiev, R., Sanders, P., Schultes, D. and Sibeyn, J. (2004). Engineering an external memory minimum spanning tree algorithm. Proceedings 3rd International Conference on Theoretical Computer Science. Ferreira, F., Santos, V. and Dias, J. (2006). Integration of multiple sensors using binary features in a Bernoulli mixture model. Proceedings IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pp. 104–109. Filliat, D. (2007). A visual bag of words method for interactive qualitative localization and mapping. Proceedings IEEE International Conference on Robotics and Automation, 3921– 3926. Frome, A., Huber, D., Kolluri, R., Bulow, T. and Malik, J. (2004). Recognizing objects in range data using regional Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 664 THE INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH / June 2008 point descriptors. Proceedings of the European Conference on Computer Vision. Berlin, Springer. Goedemé, T. (2006). Visual navigation. Ph.D. Thesis, Katholieke Universiteit Leuven. Gumbel, E. J. (1958). Statistics of Extremes. New York, Columbia University Press. Gutmann, J. and Konolige, K. (1999). Incremental mapping of large cyclic environments. Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA). Monterey, CA, pp. 318–325. http://citeseer.ist.psu.edu/gutmann00incremental.html. Ho, K. and Newman, P. (2005a). Combining visual and spatial appearance for loop closure detection. Proceedings of the European Conference on Mobile Robotics . Ho, K. and Newman, P. (2005b). Multiple map intersection detection using visual appearance. Proceedings of the International Conference on Computational Intelligence, Robotics and Autonomous Systems, Singapore. Ho, K. L. and Newman, P. (2007). Detecting loop closure with scene sequences. International Journal of Computer Vision, 74(3): 261–286. Johnson, A. (1997). Spin-Images: a representation for 3-D surface matching. Ph.D. thesis, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2): 183–233. Kröse, B. J. A., Vlassis, N. A., Bunschoten, R. and Motomura, Y. (2001). A probabilistic model for appearance-based robot localization. Image and Vision Computing, 19(6): 381– 391. Lamon, P., Nourbakhsh, I., Jensen, B. and Siegwart, R. (2001). Deriving and matching image fingerprint sequences for mobile robot localization. Proceedings of the IEEE International Conference on Robotics and Automation, Seoul, Korea. Levin, A. and Szeliski, R. (2004). Visual odometry and map correlation. Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pp. 611–618. Li, F. and Kosecka, J. (2006). Probabilistic location recognition using reduced feature set. IEEE International Conference on Robotics and Automation. Orlando, FL. Lowe, D. G. (1999). Object recognition from local scaleinvariant features. Proceedings of the 7th International Conference on Computer Vision, Kerkyra, pp. 1150–1157. Meil2a, M. (1999). An accelerated Chow and Liu algorithm: Fitting tree distributions to high-dimensional sparse data. ICML ’99: Proceedings of the Sixteenth International Conference on Machine Learning. San Francisco, CA, Morgan Kaufmann, pp. 249–257. Meil2a-Predoviciu, M. (1999). Learning with mixtures of trees. Ph.D. Thesis, Massachusetts Institute of Technology. Mozos, O., Stachniss, C. and Burgard, W. (2005). Supervised learning of places from range data using AdaBoost. Proceedings of the IEEE International Conference on Robotics and Automation, pp. 1730–1735. Newman, P. and Ho, K. (2005). SLAM – loop closing with visually salient features. Proceedings of the IEEE International Conference on Robotics and Automation. Newman, P. M., Cole, D. M. and Ho, K. (2006). Outdoor SLAM using visual appearance and laser ranging. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Orlando, FL. Nister, D. and Stewenius, H. (2006). Scalable recognition with a vocabulary tree. Proceedings of the Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 2161– 2168. Page, L., Brin, S., Motwani, R. and Winograd, T. (1998). The PageRank citation ranking: bringing order to the web. Technical Report, Stanford Digital Library Technologies. Ramos, F. T., Upcroft, B., Kumar, S. and Durrant-Whyte, H. F. (2005). A Bayesian approach for place recognition. Proceedings of the IJCAI Workshop on Reasoning with Uncertainty in Robotics. Ranganathan, A. and Dellaert, F. (2006). A Rao–Blackwellized particle filter for topological mapping. Proceedings of International Conference on Robotics and Automation, Orlando, FL. http://www.cc.gatech.edu/dellaert/pub/ Ranganathan06icra.pdf. Schindler, G., Brown, M., and Szeliski, R. (2007). Cityscale location recognition. Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7. Se, S., Lowe, D. G. and Little, J. (2005). Vision based global localisation and mapping for mobile robots. IEEE Transactions on Robotics, 21(3): 364–375. Silpa-Anan, C. and Hartley, R. (2004). Localisation using an image-map. Proceedings of the Australian Conference on Robotics and Automation. Silpa-Anan, C. and Hartley, R. (2005). Visual localization and loop-back detection with a high resolution omnidirectional camera. Proceedings of the Workshop on Omnidirectional Vision. Sim, R. and Dudek, G. (1998). Mobile robot localization from learned landmarks. Proceedings of the IEEE International Conference on Intelligent Robots and Systems, p. 2. Sivic, J. and Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. Proceedings of the International Conference on Computer Vision, Nice, France. Squire, D., Müller, W., Müller, H. and Pun, T. (2000). Contentbased query of image databases: inspirations from text retrieval. Pattern Recognition Letters, 21(13–14): 1193– 1198. Srebro, N. (2001). Maximum likelihood bounded tree-width markov networks. Proceedings of the 17th Annual Confer- Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cummins and Newman / FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance ence on Uncertainty in Artificial Intelligence (UAI-01). San Francisco, CA, Morgan Kaufmann, pp. 504–551. Torralba, A., Murphy, K. P., Freeman, W. T. and Rubin, M. A. (2003). Context-based vision system for place and object recognition. ICCV ’03: Proceedings of the Ninth IEEE International Conference on Computer Vision. Washington, DC, IEEE Computer Society, p. 273 Ulrich, I. and Nourbakhsh, I. (2000). Appearance-based place recognition for topological localization. Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Vol. 2, pp. 1023–1029. 665 Wang, J., Zha, H. and Cipolla, R. (2005). Vision-based global localization using a visual vocabulary. Proceedings of the IEEE International Conference on Robotics and Automation. Wolf, J., Burgard, W. and Burkhardt, H. (2005). Robust visionbased localization by combining an image-retrieval system with Monte Carlo localization. IEEE Transactions on Robotics, 21(2): 208–216. Zivkovic, Z., Bakker, B. and Kröse, B. (2005). Hierarchical map building using visual landmarks and geometric constraints. Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 7–12. Downloaded from http://ijr.sagepub.com at Oxford University Libraries on June 5, 2008 © 2008 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.