Exploring the Vivaldi History Database

I have been recently exploring the Vivaldi history file. I like to keep all my history forever. My browser, Vivaldi (based on chrome), allows syncing history. This works for new entries, but I wanted to see if there was a way to merge the history files for some of my devices that started off in a diverged state. Each of my files are a few years old, are are several hundred MB in size. The largest has hundreds of thousands of URLs, and that’s after a lot of pruning of URLs I didn’t care to keep.

My assumption was that the history file would be fairly simple. What I would guess a browser would need is as follows:

URL
Visited time
Maybe the title of the page
Probably some indexes on URLs and visit times to allow for better searching

It turns out the history file is vastly more complex than this, and I am struggling to figure out what is going on. It is certainly clear that merging multiple files will not be as simple as concatenating them together.

I explored these using the SQLite DB Browser from here. I copied my history files into a new directory so I could work on them without affecting my main browser profile. As I went through each table in the database, I made a note of key information in a UML diagram, to help make sense of how it fits together. As well as the history, the browser includes all the information about file downloads in the history database too, so there is some space devoted to that.

UML diagram of the history table, visualising what is discussed in text in the blog post

There are 19 tables in total. 13 deal directly with history:

cluster_keywords
cluster_visit_duplicates
clusters
clusters_and_visits
content_annotations
context_annotations
keyword_search_terms
segment_usage
segments
urls
visit_source
visited_links
visit

3 with downloads:

downloads
downloads_slices
downloads_url_chains

and there are 3 others:

history_sync_metadata
meta
sqlite_sequence

Of the main 13 databases, most of the useful info, as far as I can tell, is stored in URLs. keyword_search_terms contains info about typed history entries, such as searches, that have been entered into the address bar. The download tables are fairly self explanatory:

downloads has all the relevant metadata about a downloaded file,
download_slices has info that seems to help for resuming downloads, including offsets,
download_url_chains keeps track of redirected URLs so you can see the true location a file was downloaded from

The visited URLs stored in the urls database have IDs, URLs, Titles, Visit Counts, Typed Counts, Last visit time, all of which makes sense, and this is the table that I was expecting to make up the entire history file. Almost immediately, I saw things that confuse me:

It is possible to have a URL with 0 visits. Why is that stored?,
The table records if a URL is “hidden”, I again have to wonder why it’s even in the history file.
The URLs are also tracked in a segments table, and there is another table beyond that called segment_usage,and I’ve no idea why
Another table tracks visited links, presumably when you click a link on one page to another it ends up here, which seems redundant considering that the visited links will already be recorded as visited URLs. And many of the linked URLs in this table seem identical, so I’m not sure what it’s recording.

A major structure of the history file though is centred around the concept of keeping track of visits. The visits table seems to keep track of a journey that is taken across multiple URLs. Given that the sites are stored in the URLs with a last_visited timestamp, I tried to imagine what possible use there would be in tracking all this metadata for individual visits. If you want to record visits on different days, you only need a unique pair of URL id and a timestamp per row in a separate table for that, so all this extra stuff seems unnecessary. Visits are then referenced in 3 tables that deal with something called clusters which I cannot make head nor tail of, but seem to have something to do with search terms, given the cluster_keywords table. Four additional tables make reference to visits:

visit_source, which seems to assign each visit a ‘source”, numbered in my case from 0 to 6. I am guessing this is some sort of classification of a type, but I am not sure what the difference is
context_annotations, which keeps track of window IDs, tab IDs, task IDs, and time since last visit for each visit and I cannot see why these would be in any way useful to remember.
cluster_visit_duplicates keeps track of duplicate visits, and I don’t see why this is helpful to know nor what it is actually used for, and doesn’t seem connected to the other ‘cluster’ tables
content_annotations was an almost completely empty table with columns referencing topics and floc, google’s latest tracking endeavour, so I am dismissing any utility there.

Additional things I have discovered:

Timestamps in the database are done using a 17-digit 64-bit timestamp since Jan 1 1601 – here is a converter
If you delete a bunch of history entries in browser, the browser won’t compact the database afterwards
In some tables, such as context_annotations, many entries are simply blank rows, or contain default values, rather than being deleted from the database or not recorded at all

After all this, my original intention, to be able to merge just the parts of the history I am interested in, required just the URLS and when I visited them, and I now know where they are. I used this query to export all the relevant data:

SELECT urls.url, urls.title, datetime(((visits.visit_time-11644473600000000)/1000000),'unixepoch') FROM visits LEFT JOIN urls ON visits.url = urls.id ORDER BY visits.visit_time ASC;

If I export this into a basic CSV file it comes to about 1/4 the size of the database file. Because it’s recording each visit, this ends up with a lot of duplicated data, but it’s still a fraction of the original size. I can easily concatenate this with other files as needed, sort them together by timestamp. It does leaves me wondering why any of the other stuff is needed? While I am sure there are valid engineering considerations that have gone into all of it, it seems really overcomplicated.

My reference point for good no-nonsense web browser is classic opera. I had a look at my history file for Opera 12, to see how they managed it back then. And, oh how beautiful it is. A UTF8 text file, with 4 lines per item: Title, URL, Date, and Time. Why did we fix something that wasn’t broken?

Join the Conversation

doctorg says:

2025-03-10 at 1504

Nice information 🙂

sorgel says:

2025-03-11 at 1101

https://chromium.googlesource.com/chromium/src/+/master/components/history/core/browser/history_types.h

Here are some explanations about clusters. Mostly it’s all about tracking, which makes sense. Opera’s business was not about ranking pages and understanding audience.

1. lonm says:
  
  2025-03-11 at 1112
  
  Thanks, this is exactly the info I was hoping to find, but couldn’t. Reading this I suspect a lot of the extra stuff is optimisations to help with the chrome style omnibox and its suggestions.
  
greybeard says:

2025-03-23 at 1901

As the good doctor says, great info.
As you say there do seem to be redundancies… Perhaps as a newer chromium build came along they decided on a new db structure? Perhaps V wanted its own db to fit with its UI?
I used to do that with my Bookmarks in O12 and early Vivaldi but it all seems to have changed or my Four Active Brain Cells just can’t cope anymore.
Well done Lon

Yury Trofimenko says:

2025-04-01 at 0724

Thanks for the research – I used it to cleanup my history database quite a bit.

Let me add to this in case it’s useful to anybody as it’s also related to history: with the latest version of Vivaldi, the browser gets frozen for 3 minutes on startup if it has a history file with more than 100K records (approximately) in the urls table specifically (it doesn’t seem to matter how much data there is in other tables). If you get below the threshold, the problem goes away entirely. I reported it as a bug to VIvaldi, but hasn’t got a response yet.

Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Join the Conversation

Cancel reply

Comment