Browser History

From MozillaWiki
Jump to navigation Jump to search

Introduction

A redesign of the Global History component in Firefox using mozStorage and supplying additional capabilities to application developers. Questions about the history re-write can be sent to brettw<AT>gmail<DOT>com.

Separate but related is the Annotations service which can store arbitrary information about a page for extension developers.

New capabilities

The current history system uses the annoying Mork database format, and is impossible to find and difficult to use. Most people don't use it at all, and when they do, it can be difficult to find what you are looking for.

The new history system will hopefully fulfill the following goals:

  • Store much more history than the current default (9 days) and do so with minimal performance penalty.
  • Be faster and more efficient.
  • Have a better UI that people will want to use, and that allows them to find what they are looking for. The current history sidebar is too cramped and does not offer enough capabilities.
  • Store each time you visit a page, rather than just the most recent. This allows us to answer simple questions such as "What did I visit yesterday?" which is currently impossible.

Database design

The history will be stored in 2 SQL tables: "history" and "historytransitions".

The "history" table essentially duplicates the functionality of the current mork history table: It contains:

  • Unique ID (primary key)
  • URL
  • Title
  • Visit count
  • Last visit date
  • Host name (see below)
  • Hidden (bool)
  • Typed (bool)

Note that the host name is stored backwards, unlike the current table. This is done so that it can be indexed alphabetically and we can quickly pull out all pages within any "mozilla.org" domain by asking for hostname fields that begin with "gro.allizom." A period is always appended to the reversed hostname.

The second table stores transitions between pages, which is information unavailable now.

  • Source page ID
  • Destination page ID
  • Time
  • Transition type

Transition type will hopefully contain info about whether the link was clicked, opened in new tab/window, typed, etc.

Performance

The sqlite database system does file-level locking to manage access to the database. This means that for every transaction (even read), the database file must be checked for changes to update the cache, and locked. Most systems can do this quickly, although this operation is clearly not good for performance. Some Linux/Unix users may have their home directories and profiles stored on a network-mounted drive. In these cases, individual database accesses can be relatively expensive.

Therefore, it is very important that database access be minimized and grouped in transactions. A new query/result format will be used instead of RDF. RDF requires many extra transactions, to retrieve the ID of a URL, then to retrieve the title, then to retrieve the visit date, etc. This makes performance, especially for network users, unacceptably slow.

Link coloring

Link coloring is perhaps the most important performance issue, since this directly affects page load time. A naive approach requires querying the database for each link that comes in. Even when the history database is local, asking the OS for a file lock for each link on a page will be very bad for performance.

To solve this problem, URLs will be loaded into an in-memory database when the application starts (or, perhaps, the first time they are required). These links can be sorted so they can be accessed very quickly, and rquire no OS calls or file locks. This list will only contain URLs, and none of the extra information such as visit dates, visit counts, and titles.

Because this data will reside in memory, it is important to not take too much space. If we suceed in the goal of allowing users to keep large amounts of history, size could be significant. We will therefore want to limit the number of URLs that we bring into memory, either allowing, for example, the past 10K URLs, or the past 30 days of visits.

A two-level approach may improve perceived size. Most URLs in the table are ads and random images from web sites that the user will never see a link to, and probably wouldn't even care about if they did. Then there are the top-level pages (with hidden=false) that are the top-level URLs that the user actually visited. They are much more likely to care about these top-level links. To save space and speed things up, we could keep the past X URLs in memory, plus the previous Y top-level URLs. It is unlikely that users will even notice that most of their history links are not being used for link coloring.

Database size

I just imported my Mork history database file into the equivalent sqlite format. The sqlite file was 10x smaller than the original Mork file.