Import Kobo highlights to Calibre.
Import your Kobo e-reader highlights into calibre e-book viewer.
Calibre's built-in viewer supports highlights since 5.0. Thanks to this script, you can populate the native calibre highlights with the highlights from your Kobo e-reader.
This is different from calibre's Annotations plugin which concatenates the text of all your highlights in a new metadata field in your books library, this script plugs them directly into calibre's native reader highlighting feature.
Warning: this works by writing directly to metadata.db
in your
Calibre library. It only inserts into the annotations
table and only
updates the rows it inserted (addressed by highlight UUID). I would
still recommend you do a backup of your metadata.db
before running
this just in case.
See all the details below.
git clone https://github.com/valeriangalliat/kobo-highlights-to-calibre.git
cd kobo-highlights-to-calibre
npm install
node index.js <kobo-volume> <calibre-library>
Example:
node index.js "/run/media/$USER/KOBOeReader" "$HOME/Calibre Library"
Creating highlights should work fine, but if you want to update highlight data (e.g. when working on the script), you might notice that Calibre restores the previous state of the highlight when you open the viewer!
This is because calibre stores a copy of the highlight data from its database in a bunch of different files.
The best way to identify those is to run:
grep -R --files-with-matches <uuid> ~/.cache/calibre ~/.config/calibre ~/Calibre\ Library
Where <uuid>
is the UUID of the highlight you're working with.
It will yield something like this:
~/.cache/calibre/some/random/stuff/calibre-book-annotations.json
~/.config/calibre/viewer/annots/some-random-hash.json
~/Calibre Library/path/to/book/metadata.opf
~/Calibre Library/metadata.db
I usually delete everything but metadata.db
(that's your calibre
database, don't delete it!).
Then running the script again and opening the viewer should show the updated highlight.
This project is inspired by Kobo2Calibre, and while this program didn't work for me (Python dependencies issues and incompatibilities with 3.10), plus the fact I wasn't super keen on depending on a natural language processing (NLP) model for this task, it really helped me to figure everything I needed to build my own version.
Here's the gist.
When mounted over USB, the Kobo volume contains a .kobo
directory,
with a bunch of interesting things in it.
One of them is KoboReader.sqlite
, a SQLite database containing even
more interesting things, including a Bookmark
table with the highlight
data.
$ sqlite3 .kobo/KoboReader.sqlite
sqlite> .tables
AbTest KoboPlusAssets Wishlist
Achievement OverDriveCards WordList
Activity OverDriveCheckoutBook content
AnalyticsEvents OverDriveLibrary content_keys
Authors Reviews content_settings
BookAuthors Rules ratings
Bookmark Shelf shortcover_page
DbVersion ShelfContent user
DropboxItem SubscriptionProducts volume_shortcovers
Event SyncQueue volume_tabs
KoboPlusAssetGroup Tab
sqlite> .schema --indent Bookmark
CREATE TABLE Bookmark(
BookmarkID TEXT NOT NULL,
VolumeID TEXT NOT NULL,
ContentID TEXT NOT NULL,
StartContainerPath TEXT NOT NULL,
StartContainerChildIndex INTEGER NOT NULL,
StartOffset INTEGER NOT NULL,
EndContainerPath TEXT NOT NULL,
EndContainerChildIndex INTEGER NOT NULL,
EndOffset INTEGER NOT NULL,
Text TEXT,
Annotation TEXT,
ExtraAnnotationData BLOB,
DateCreated TEXT,
ChapterProgress REAL NOT NULL DEFAULT 0,
Hidden BOOL NOT NULL DEFAULT 0,
Version TEXT,
DateModified TEXT,
Creator TEXT,
UUID TEXT,
UserID TEXT,
SyncTime TEXT,
Published BIT default false,
ContextString TEXT,
Type TEXT,
PRIMARY KEY(BookmarkID)
);
CREATE INDEX bookmark_content ON bookmark(ContentID);
CREATE INDEX bookmark_volume ON bookmark(VolumeID);
If we look at a specific highlight of mine:
sqlite> .mode line
sqlite> select * from Bookmark where Text like '%The best moments%';
BookmarkID = d0d2f956-f666-4214-90a2-0423530e622e
VolumeID = file:///mnt/onboard/Csikszentmihalyi, Mihaly/Flow - Mihaly Csikszentmihalyi.epub
ContentID = file:///mnt/onboard/Csikszentmihalyi, Mihaly/Flow - Mihaly Csikszentmihalyi.epub#(6)OEBPS/cha01.xhtml
StartContainerPath = OEBPS/cha01.xhtml#point(/1/4/2/26/1:223)
StartContainerChildIndex = -99
StartOffset = 0
EndContainerPath = OEBPS/cha01.xhtml#point(/1/4/2/26/1:385)
EndContainerChildIndex = -99
EndOffset = 0
Text = . The best moments usually occur when a person’s body or mind is stretched to its limits in a voluntary effort to accomplish something difficult and worthwhile.
Annotation =
ExtraAnnotationData = ¤
DateCreated = 2022-03-08T03:55:38.421
ChapterProgress = 0.0315533980582524
Hidden = false
Version = ¤
DateModified = ¤
Creator = ¤
UUID = ¤
UserID = 2b54ff11-671f-4853-a4a7-a98bb20d9836
SyncTime = ¤
Published = false
ContextString = ¤
Type = highlight
Here we got:
- A UUID in
BookmarkID
that we can use to identify this highlight uniquely. - A
VolumeID
which doesn't have any join in this database but refers to a file on the volume. - A
ContentID
that has a join with theContent
table, useful to get the book title and the chapter title for a given highlight. - The start and end path of the highlight (
StartContainerPath
andEndContainerPath
). In my caseStartOffset
andEndOffset
seem unused and the offset is part of the path (more on that later). - The highlighted
Text
.
So what do we do with those start and end paths? Well, let's see what an EPUB is made of (BTW this only supports EPUB for now, I don't use other formats and I have no clue what they look like).
It turns out an EPUB is just a ZIP archive containing a bunch of
metadata XML and content HTML files. So in the path
OEBPS/cha01.xhtml#point(/1/4/2/26/1:223)
, Kobo means we're looking at
the file OEBPS/cha01.xhtml
in the ZIP archive, and the point we're
targeting is /1/4/2/26/1:223
. But what is /1/4/2/26/1:223
?
There's a thing in the EPUB spec called canonical fragment identifiers (CFI), which look a lot like those Kobo path fragments.
How do they work? epub-cfi-resolver does a great job at explaining this.
Essentially, /1/4/2/26/1:223
means we're looking at the 223th
character of the first child node of the 26th child node of the... you get
the idea.
The trick is that CFI differs from the DOM standard and counts element nodes with even numbers and text and CDATA nodes with odd numbers. There's also a bunch of rules about combining and counting adjacent text nodes.
So in DOM terms, the above path is actually 223th character of the first text node of the 13th element node of the 1st element node of the second element node of the first text node.
That last part doesn't make any sense, because a XML document root isn't
a text node. This is a Kobo quirk and we need to deal with it. Most
valid CFI paths otherwise start with /2/
for the root node.
The next fun thing is about the text offset. See, in /1/4/2/26/1:223
,
the spec defines the offset following the :
character as a "character
offset". But Kobo actually puts a byte offset in there.
This is why if you forward this offset as is, you'll notice that in Calibre viewer, the highlights are often shifted by a few characters.
Why is the byte offset different from the character offset? Because some characters can span on multiple bytes (hello Unicode).
Fun fact: this is the reason why Kobo2Calibre depends on a natural language processing model, by identifying the sentence boundaries they can work around the shift that's otherwise introduced by the Kobo bytes vs. characters offset problem.
So there's some extra massaging to be done here to make sure we identify the real Unicode character offset from the byte offset we have. This is tricky because of the rules around text and CDATA nodes in the CFI spec, but overall it seems that an incomplete implementation that just works with the current DOM text node without any CFI-specific handling works quite well in most cases.
Now, this works the majority of the time, but I still get some highlights that are off, and other highlights that just don't show at all on the calibre side.
I tried to debug those a bit, but from what I can tell, Kobo's path (other than the quirks we talked about earlier) matches perfectly with the compliant epub-cfi-resolver implementation, so the issues might be more on the calibre side?
These are the issues I identified so far:
- Sometimes Kobo targets empty text nodes in between paragraphs as the start or end point, and this can confuse calibre. It would work better if we targeted the first or last character of a paragraph text node instead.
- There's some cases where calibre's idea of text nodes doesn't seem to match the implementation of Kobo nor epub-cfi-resolver and I can't myself understand their logic based on my comprehension of the spec. I don't think those are easily fixable but luckily it represents a small subset of highlights for me.
- Sometimes even after proper bytes to Unicode characters offset conversion, the highlight on calibre is still shifted by a few words, and I couldn't identify the reason or anything that would deterministically find an offset that's compatible with calibre.