I have been recently exploring the Vivaldi history file. I like to keep all my history forever. My browser, Vivaldi (based on chrome), allows syncing history. This works for new entries, but I wanted to see if there was a way to merge the history files for some of my devices that started off in a diverged state. Each of my files are a few years old, are are several hundred MB in size. The largest has hundreds of thousands of URLs, and that’s after a lot of pruning of URLs I didn’t care to keep.
My assumption was that the history file would be fairly simple. What I would guess a browser would need is as follows:
- URL
- Visited time
- Maybe the title of the page
- Probably some indexes on URLs and visit times to allow for better searching
It turns out the history file is vastly more complex than this, and I am struggling to figure out what is going on. It is certainly clear that merging multiple files will not be as simple as concatenating them together.
I explored these using the SQLite DB Browser from here. I copied my history files into a new directory so I could work on them without affecting my main browser profile. As I went through each table in the database, I made a note of key information in a UML diagram, to help make sense of how it fits together. As well as the history, the browser includes all the information about file downloads in the history database too, so there is some space devoted to that.

There are 19 tables in total. 13 deal directly with history:
cluster_keywords
cluster_visit_duplicates
clusters
clusters_and_visits
content_annotations
context_annotations
keyword_search_terms
segment_usage
segments
urls
visit_source
visited_links
visit
3 with downloads:
downloads
downloads_slices
downloads_url_chains
and there are 3 others:
history_sync_metadata
meta
sqlite_sequence
Of the main 13 databases, most of the useful info, as far as I can tell, is stored in URLs. keyword_search_terms
contains info about typed history entries, such as searches, that have been entered into the address bar. The download tables are fairly self explanatory:
downloads
has all the relevant metadata about a downloaded file,download_slices
has info that seems to help for resuming downloads, including offsets,download_url_chains
keeps track of redirected URLs so you can see the true location a file was downloaded from
The visited URLs stored in the urls
database have IDs, URLs, Titles, Visit Counts, Typed Counts, Last visit time, all of which makes sense, and this is the table that I was expecting to make up the entire history file. Almost immediately, I saw things that confuse me:
- It is possible to have a URL with 0 visits. Why is that stored?,
- The table records if a URL is “hidden”, I again have to wonder why it’s even in the history file.
- The URLs are also tracked in a
segments
table, and there is another table beyond that calledsegment_usage
,and I’ve no idea why - Another table tracks visited links, presumably when you click a link on one page to another it ends up here, which seems redundant considering that the visited links will already be recorded as visited URLs. And many of the linked URLs in this table seem identical, so I’m not sure what it’s recording.
A major structure of the history file though is centred around the concept of keeping track of visits. The visits
table seems to keep track of a journey that is taken across multiple URLs. Given that the sites are stored in the URLs with a last_visited timestamp, I tried to imagine what possible use there would be in tracking all this metadata for individual visits. If you want to record visits on different days, you only need a unique pair of URL id and a timestamp per row in a separate table for that, so all this extra stuff seems unnecessary. Visits are then referenced in 3 tables that deal with something called clusters
which I cannot make head nor tail of, but seem to have something to do with search terms, given the cluster_keywords table. Four additional tables make reference to visits:
visit_source
, which seems to assign each visit a ‘source”, numbered in my case from 0 to 6. I am guessing this is some sort of classification of a type, but I am not sure what the difference iscontext_annotations
, which keeps track of window IDs, tab IDs, task IDs, and time since last visit for each visit and I cannot see why these would be in any way useful to remember.cluster_visit_duplicates
keeps track of duplicate visits, and I don’t see why this is helpful to know nor what it is actually used for, and doesn’t seem connected to the other ‘cluster’ tablescontent_annotations
was an almost completely empty table with columns referencing topics and floc, google’s latest tracking endeavour, so I am dismissing any utility there.
Additional things I have discovered:
- Timestamps in the database are done using a 17-digit 64-bit timestamp since Jan 1 1601 – here is a converter
- If you delete a bunch of history entries in browser, the browser won’t compact the database afterwards
- In some tables, such as context_annotations, many entries are simply blank rows, or contain default values, rather than being deleted from the database or not recorded at all
After all this, my original intention, to be able to merge just the parts of the history I am interested in, required just the URLS and when I visited them, and I now know where they are. I used this query to export all the relevant data:
SELECT urls.url, urls.title, datetime(((visits.visit_time-11644473600000000)/1000000),'unixepoch') FROM visits LEFT JOIN urls ON visits.url = urls.id ORDER BY visits.visit_time ASC;
If I export this into a basic CSV file it comes to about 1/4 the size of the database file. Because it’s recording each visit, this ends up with a lot of duplicated data, but it’s still a fraction of the original size. I can easily concatenate this with other files as needed, sort them together by timestamp. It does leaves me wondering why any of the other stuff is needed? While I am sure there are valid engineering considerations that have gone into all of it, it seems really overcomplicated.
My reference point for good no-nonsense web browser is classic opera. I had a look at my history file for Opera 12, to see how they managed it back then. And, oh how beautiful it is. A UTF8 text file, with 4 lines per item: Title, URL, Date, and Time. Why did we fix something that wasn’t broken?