Places:AutoComplete: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(Fill in optimizations information)
 
(4 intermediate revisions by 2 users not shown)
Line 1: Line 1:
(Please comment by clicking "discussion" above, and not by editing this page.)
(Please comment by clicking "discussion" above, and not by editing this page.)


== Algorithmic changes ==
The Places History AutoComplete uses a SQL database to find pages to display results to the user for a given search query. This page describes how results are processed and given to the front-end (e.g., Firefox 3) and various optimizations to speed things up.


Places (Firefox 3.0) will <i>not</i> likely see a large change in the AutoComplete algorithm. The current differences are:
__TOC__


* It is faster. The places autocomplete code should be faster than the old code on and normal amount of data. See below.
= Basic Work Flow =


* Suggestion of host names first. If the first URL has not been visited very many times and it has a path component, a new entry is suggested as the first item consisting of just the host portion, even if that URL has never been visited. If the first item has been visited many times, we assume the user actually wants to go to that page and don't do this addition.
The general outline of how a search query incrementally finds results is as follows:
# Initialize data structures
# Process search query
# Search database and save matches
# Update search status
# Continue searching if necessary


* We give a little boost to URLs that are bookmarked.
== Initialize data structures ==


There are some additional simple low-risk changes that we might consider making.
Searches are started through the nsIAutoCompleteSearch::StartSearch interface, where we get the search query and a listener to give results. Here we standardize the query internally by trimming spaces, switching to lowercase, etc. to simplify matching it against other queries.


*  Better use of the "typed" flag (set for URLs that have ever been entered in the address bar). The current implementation gives typed URLs a constant boost over non-typed URLs. It would be nice to weight the various parameters by time, so a recent typed URL would still take precidence over something visited several times but many weeks ago.
Here we initialize data to store the listener, new results, and information about bookmarks/livemarks. We additionally reset previous search information like the status of the search, search position/offset, and matching urls.


* When considering the aforementioned "aging" of priority, we might want to take into account browsing behavior. It sucks to come back from a two-week vacation and have all your history and autocomplete expired. At startup, we can detect periods of little or no activity and adjust the aging parameters accordingly.
== Process search query ==


== Performance with the new database ==
The search query is a plain string that can be split into multiple words by spaces. Typically this is for multi-word searching, but special tokens are also processed to change the behavior of the search.


If a significant amount of history will be stored, a much more efficient method of searching will be required (the current implemtation searches all of history for matches).
As the query is split into individual words/tokens, the special tokens are recorded as changing the behavior and removed from the query, so we don't try matching it against the title/URL.


The database can be queried for URLs matching multiple criteria: recency, popularity, and type-edness. We can probably encode some of the ranking in a query. For example, select all pages visited in the last <i>n</i> days OR where (visit count - days last visited ago > 1). This will save us from ranking most of the pages in the user's history.
== Search database and save matches ==


We might want to consider another column in the history DB for autocomplete purposes. This would just contain the URL minus the protocol type and with the common prefixes stripped (the current implementation strips "www" and "ftp", and we'll probably just continue this). Then an index could be created on this column and we can quickly find matching pages without schlepping through all of them comparing strings.
The bulk of the history processing is in this stage, and it consists mostly of executing a SQL query and filtering results. There's also some special behavior for the first chunk for adaptive learning and keyword searches.
 
=== Query database ===
 
Queries are actually fairly simple in that they blindly fetch all results that fall within the current chunk (a fixed chunk size and an updating chunk offset). One major aspect of the results is that they come in frecency order, where we've precomputed the ordering ahead of time and the database has an index on that column.
 
=== Filter results ===
 
The filtering of results is not handled by SQL as we do fairly complex matching against titles/tags/URLs with varying behavior specified by the user's preferences.
 
For each page returned by the database query, we get a number of properties/attributes such as page title, URL, tags, visit count, etc. For a page to match a query, it needs to have all tokens match somewhere in the title, URL, or tags. The filtering supports matching query tokens on word boundaries, anywhere, and at the beginning of each of those properties depending on the preferences.
 
Additionally, special tokens and behavior preferences can additionally filter results by forcing url matches or allowing only visited pages. These filters are also processed while looking at each search token against each page.
 
=== Store matches ===
 
Once a match is determined, we package it into a result that has the URL, title, favicon, and style. The title is overloaded to additionally hold the tags of the page by concatenating it to the page/bookmark title and separated by an endash.
 
The style of the page is to let the front-end know if it's a plain page, bookmarked page, etc.
 
=== Extra first-chunk queries ===
 
The previous sections described the general query and filtering process focusing on a full-history search that happens for every chunk. There are a couple additional queries that use the same structure as above, but provide a different set of results.
 
The first is adaptive learning that shows pages that users have previously selected from the autocomplete when typing a similar query. The query is slightly different as it uses another table to look up previous query->page mappings, but it goes through the same filtering process.
 
The second special query is to show smart keyword searches, where the first token in the query is the keyword and other tokens are additional search parameters. We look in the bookmarks database to find a matching keyword and insert the search parameters into the "%s" portion of the bookmark url.
 
== Update search status ==
 
After a chunk is finished, we report to the listener that we've either found results or not and if we're going to continue searching.
 
This is done by setting the result status on the nsIAutoCompleteResult package and delivering the results to the nsIAutoCompleteListener.
 
== Continue searching if necessary ==
 
If we have enough search results, we stop searching; otherwise, we keep searching until we run out of pages to process. The continuing search is set up with a timer that waits a little bit before continuing again.
 
Additionally, if the search behavior is to match on word boundaries first, it's possible for the search to start back at the beginning (from the first chunk) and process all the pages again without forcing a match on word boundaries. This is to ensure we show word boundary matches first even if a match-anywhere page has a higher frecency.
 
= Optimizations =
 
There are a number of optimizations to help speed up finding matching pages. They tend to focus on the fact that the user is incrementally typing a query, i.e., an additional letter is added to the end of the previous query.
 
# Stop searching when there's enough results
# Reuse existing results when filtering more
# Don't search if it won't find anything
 
== Stop searching when there's enough results ==
 
Instead of finding many pages that match a search query, we stop early when we have at least as many results that will be shown. This allows for a more responsive UI as we finish processing a chunk before notifying the listener that we have results.
 
== Reuse existing results when filtering more ==
 
Every time we find matching pages, we keep track of those URLs so that the next query can start searching from those matched pages. This is useful when the user types an additional character at the end of the query.
 
Additionally, we track which chunk we stopped at for the previous query and continue from that chunk if we start from the previously matched URLs. This is possible because an additional character in the query will not match anything a previous query didn't match.
 
== Don't search if it won't find anything ==
 
If there were no previously matching URLs and the search has completely finished searching through all places, we can immediately say that there will be no results when the user types more. This is useful for users typing in a new URL either editing an existing URL or typing a brand new one.

Latest revision as of 21:51, 12 February 2009

(Please comment by clicking "discussion" above, and not by editing this page.)

The Places History AutoComplete uses a SQL database to find pages to display results to the user for a given search query. This page describes how results are processed and given to the front-end (e.g., Firefox 3) and various optimizations to speed things up.

Basic Work Flow

The general outline of how a search query incrementally finds results is as follows:

  1. Initialize data structures
  2. Process search query
  3. Search database and save matches
  4. Update search status
  5. Continue searching if necessary

Initialize data structures

Searches are started through the nsIAutoCompleteSearch::StartSearch interface, where we get the search query and a listener to give results. Here we standardize the query internally by trimming spaces, switching to lowercase, etc. to simplify matching it against other queries.

Here we initialize data to store the listener, new results, and information about bookmarks/livemarks. We additionally reset previous search information like the status of the search, search position/offset, and matching urls.

Process search query

The search query is a plain string that can be split into multiple words by spaces. Typically this is for multi-word searching, but special tokens are also processed to change the behavior of the search.

As the query is split into individual words/tokens, the special tokens are recorded as changing the behavior and removed from the query, so we don't try matching it against the title/URL.

Search database and save matches

The bulk of the history processing is in this stage, and it consists mostly of executing a SQL query and filtering results. There's also some special behavior for the first chunk for adaptive learning and keyword searches.

Query database

Queries are actually fairly simple in that they blindly fetch all results that fall within the current chunk (a fixed chunk size and an updating chunk offset). One major aspect of the results is that they come in frecency order, where we've precomputed the ordering ahead of time and the database has an index on that column.

Filter results

The filtering of results is not handled by SQL as we do fairly complex matching against titles/tags/URLs with varying behavior specified by the user's preferences.

For each page returned by the database query, we get a number of properties/attributes such as page title, URL, tags, visit count, etc. For a page to match a query, it needs to have all tokens match somewhere in the title, URL, or tags. The filtering supports matching query tokens on word boundaries, anywhere, and at the beginning of each of those properties depending on the preferences.

Additionally, special tokens and behavior preferences can additionally filter results by forcing url matches or allowing only visited pages. These filters are also processed while looking at each search token against each page.

Store matches

Once a match is determined, we package it into a result that has the URL, title, favicon, and style. The title is overloaded to additionally hold the tags of the page by concatenating it to the page/bookmark title and separated by an endash.

The style of the page is to let the front-end know if it's a plain page, bookmarked page, etc.

Extra first-chunk queries

The previous sections described the general query and filtering process focusing on a full-history search that happens for every chunk. There are a couple additional queries that use the same structure as above, but provide a different set of results.

The first is adaptive learning that shows pages that users have previously selected from the autocomplete when typing a similar query. The query is slightly different as it uses another table to look up previous query->page mappings, but it goes through the same filtering process.

The second special query is to show smart keyword searches, where the first token in the query is the keyword and other tokens are additional search parameters. We look in the bookmarks database to find a matching keyword and insert the search parameters into the "%s" portion of the bookmark url.

Update search status

After a chunk is finished, we report to the listener that we've either found results or not and if we're going to continue searching.

This is done by setting the result status on the nsIAutoCompleteResult package and delivering the results to the nsIAutoCompleteListener.

Continue searching if necessary

If we have enough search results, we stop searching; otherwise, we keep searching until we run out of pages to process. The continuing search is set up with a timer that waits a little bit before continuing again.

Additionally, if the search behavior is to match on word boundaries first, it's possible for the search to start back at the beginning (from the first chunk) and process all the pages again without forcing a match on word boundaries. This is to ensure we show word boundary matches first even if a match-anywhere page has a higher frecency.

Optimizations

There are a number of optimizations to help speed up finding matching pages. They tend to focus on the fact that the user is incrementally typing a query, i.e., an additional letter is added to the end of the previous query.

  1. Stop searching when there's enough results
  2. Reuse existing results when filtering more
  3. Don't search if it won't find anything

Stop searching when there's enough results

Instead of finding many pages that match a search query, we stop early when we have at least as many results that will be shown. This allows for a more responsive UI as we finish processing a chunk before notifying the listener that we have results.

Reuse existing results when filtering more

Every time we find matching pages, we keep track of those URLs so that the next query can start searching from those matched pages. This is useful when the user types an additional character at the end of the query.

Additionally, we track which chunk we stopped at for the previous query and continue from that chunk if we start from the previously matched URLs. This is possible because an additional character in the query will not match anything a previous query didn't match.

Don't search if it won't find anything

If there were no previously matching URLs and the search has completely finished searching through all places, we can immediately say that there will be no results when the user types more. This is useful for users typing in a new URL either editing an existing URL or typing a brand new one.