Skip to content

Commit

Permalink
docs: update README, add doc strings to all exported functionality
Browse files Browse the repository at this point in the history
  • Loading branch information
FlorianLoch committed Feb 23, 2024
1 parent 86d9c9b commit 356b236
Show file tree
Hide file tree
Showing 6 changed files with 222 additions and 113 deletions.
44 changes: 42 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,44 @@
# go-hibp-sync

`go-hibp-sync` provides functionality to keep a local copy of the HIBP leaked password database in sync with the upstream version at https://haveibeenpowned.com.
In addition to syncing the database, the library allows exporting it into a single list — the former distribution format of the database.
`go-hibp-sync` provides functionality to keep a local copy of the *HIBP leaked password database* in sync with the upstream version at [https://haveibeenpowned.com](https://haveibeenpowned.com).
In addition to syncing the "database", the library allows exporting it into a single list — the former distribution format of the database — and querying it for a given *k-proximity range*.

This local copy consists of one file per range/prefix, grouped into `256` directories (first `2` of `5` prefix characters).
As an uncompressed copy of the database would currently require around `~40 GiB` of disk space, a moderate level of `zstd` compression is applied with the result of cutting down storage consumption by `50%`.
This compression can be disabled if the little computational overhead caused outweighs the advantage of requiring only half the space.

To avoid unnecessary network transfers and to also speed up things, `go-hibp-sync` additionally keeps the `etag` returned by the upstream CDN.
Subsequent requests contain it and should allow for more frequent syncs, not necessarily resulting in full re-downloads.
Of course, this can be disabled too.

The library supports to continue from where it left off, the `sync` command mentioned below demonstrates this.

The basic API is really simple; two functions are exported (and additionally, typed configuration options):

```go
Sync(options ...SyncOption) error // Syncs the local copy with the upstream database
Export(w io.Writer, options ...ExportOption) error // Writes a continuous, decompressed and "free-of-etags" stream to the given io.Writer
```

Additionally, the library can also operate on its data using the `RangeAPI` type and its `Query` method.
This operates on disk but, depending on the medium, should provide access times that are probably good enough for all scenarios.
A memory-based `tmpfs` will speed things up when necessary.

```go
querier := NewRangeAPI(/* optional options go here */)
kProximityResponse, err := querier.Query("ABCDE")
// TODO: Handle error
// TODO: Read the response (as before received from the upstream API) line-by-line and check whether it contains your hash.
```

There are two basic CLI commands, `sync` and `export` that can be used for manual tasks and serve as minimal examples on how to use the library.
They are basic but should play well with other tooling.
`sync` will track the progress and is able to continue from where it left of last.

Run them with:

```bash
go run github.com/exaring/go-hibp-sync/cmd/sync
# and
go run github.com/exaring/go-hibp-sync/cmd/export
```
3 changes: 3 additions & 0 deletions cmd/export/main.go
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
// Small utility to export the HIBP data to stdout.
// Expects the data to be available in the default data directory or in the directory specified as the first argument.
// Data is expected to be compressed.
package main

import (
Expand Down
12 changes: 10 additions & 2 deletions cmd/sync/main.go
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
// Small utility to sync the HIBP data to the default data directory or to the directory specified as the
// first argument.
// The data will be stored applying zstd compression.
// The tool keeps track of progress and is able to continue from where it left off in case syncing
// needs to be interrupted.
package main

import (
Expand Down Expand Up @@ -64,11 +69,14 @@ func run(dataDir string) error {
return nil
}

if err := hibpsync.Sync(hibpsync.SyncWithDataDir(dataDir), hibpsync.SyncWithProgressFn(updateProgressBar), hibpsync.SyncWithStateFile(stateFile)); err != nil {
if err := hibpsync.Sync(
hibpsync.SyncWithDataDir(dataDir),
hibpsync.SyncWithProgressFn(updateProgressBar),
hibpsync.SyncWithStateFile(stateFile)); err != nil {
return fmt.Errorf("syncing: %w", err)
}

// Explicitly close the file because otherwise we cannot remove it
// Explicitly close the file because otherwise we cannot remove it in the next step
stateFile.Close()

if err := os.Remove(stateFilePath); err != nil {
Expand Down
107 changes: 0 additions & 107 deletions config.go

This file was deleted.

26 changes: 24 additions & 2 deletions lib.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,26 @@ import (

const (
DefaultDataDir = "./.hibp-data"
DefaultStateFileName = "state"
defaultEndpoint = "https://api.pwnedpasswords.com/range/"
defaultWorkers = 50
DefaultStateFileName = "state"
defaultLastRange = 0xFFFFF
)

// ProgressFunc represents a type of function that can be used to report progress of a sync operation.
// The parameters are as follows:
// - lowest: The lowest prefix that has been processed so far (due to concurrent operations, there is a window of
// prefixes that are possibly being processed at the same time, "lowest" refers to the range with the lowest prefix).
// - current: The current prefix that is being processed, i.e. for which the ProgressFunc gets invoked.
// - to: The highest prefix that will be processed.
// - processed: The number of prefixes that have been processed so far.
// - remaining: The number of prefixes that are remaining to be processed.
// The function should return an error if the operation should be aborted.
type ProgressFunc func(lowest, current, to, processed, remaining int64) error

// Sync copies the ranges, i.e., the HIBP data, from the upstream API to the local storage.
// The function will start from the lowest prefix and continue until the highest prefix.
// See the set of SyncOption functions for customizing the behavior of the sync operation.
func Sync(options ...SyncOption) error {
config := &syncConfig{
commonConfig: commonConfig{
Expand Down Expand Up @@ -70,6 +82,9 @@ func Sync(options ...SyncOption) error {
return sync(config.ctx, from, config.lastRange+1, client, storage, pool, config.progressFn)
}

// Export writes the HIBP data to the given writer.
// The data is written in the same format as it is provided by the HIBP API itself.
// See the set of ExportOption functions for customizing the behavior of the export operation.
func Export(w io.Writer, options ...ExportOption) error {
config := &exportConfig{
commonConfig: commonConfig{
Expand All @@ -86,11 +101,14 @@ func Export(w io.Writer, options ...ExportOption) error {
return export(0, defaultLastRange+1, storage, w)
}

// RangeAPI provides an API for querying the local HIBP data.
type RangeAPI struct {
storage storage
}

func NewRangeAPI(options ...QueryOption) *RangeAPI {
// NewRangeAPI creates a new RangeAPI instance that can be used for querying k-proximity ranges.
// See the set of RangeAPIOption functions for customizing the behavior of the RangeAPI.
func NewRangeAPI(options ...RangeAPIOption) *RangeAPI {
config := &queryConfig{
commonConfig: commonConfig{
dataDir: DefaultDataDir,
Expand All @@ -106,6 +124,10 @@ func NewRangeAPI(options ...QueryOption) *RangeAPI {
}
}

// Query queries the local HIBP data for the given prefix.
// The function returns an io.ReadCloser that can be used to read the data, it should be closed as soon as possible
// to release the read lock on the file.
// It is the responsibility of the caller to close the returned io.ReadCloser.
func (q *RangeAPI) Query(prefix string) (io.ReadCloser, error) {
reader, err := q.storage.LoadData(prefix)
if err != nil {
Expand Down
143 changes: 143 additions & 0 deletions options.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
package hibpsync

import (
"context"
"io"
)

type commonConfig struct {
dataDir string
noCompression bool
}

type syncConfig struct {
commonConfig
ctx context.Context
endpoint string
minWorkers int
progressFn ProgressFunc
stateFile io.ReadWriteSeeker
lastRange int64
}

// SyncOption represents a type of function that can be used to customize the behavior of the Sync function.
type SyncOption func(config *syncConfig)

// SyncWithContext sets the context for the sync operation.
func SyncWithContext(ctx context.Context) SyncOption {
return func(c *syncConfig) {
c.ctx = ctx
}
}

// SyncWithDataDir sets the data directory for the sync operation.
// The directory will be created it if it does not exist.
// Default: "./.hibp-data"
func SyncWithDataDir(dataDir string) SyncOption {
return func(c *syncConfig) {
c.dataDir = dataDir
}
}

// SyncWithEndpoint sets a custom endpoint instead of the default HIBP API endpoint.
// Default: "https://api.pwnedpasswords.com/range/"
func SyncWithEndpoint(endpoint string) SyncOption {
return func(c *syncConfig) {
c.endpoint = endpoint
}
}

// SyncWithMinWorkers sets the minimum number of workers goroutines that will be used to process the ranges.
// Default: 50
func SyncWithMinWorkers(workers int) SyncOption {
return func(c *syncConfig) {
c.minWorkers = workers
}
}

// SyncWithStateFile sets the state file to be used for tracking progress.
// This can either be an os.File or any other implementation of io.ReadWriteSeeker.
// Seeking is only used to jump back to the start of the "virtual file".
// It should be easy enough to decorate a bytes.Buffer with the necessary methods to make it work.
// Default: nil, i.e., no state will be tracked.
func SyncWithStateFile(stateFile io.ReadWriteSeeker) SyncOption {
return func(c *syncConfig) {
c.stateFile = stateFile
}
}

// SyncWithProgressFn sets a custom progress function that will be called regularly.
// The function should return an error if the operation should be aborted.
// Note, there is no guarantee that the function will be called for every prefix.
// Default: no-op function
func SyncWithProgressFn(progressFn ProgressFunc) SyncOption {
return func(c *syncConfig) {
c.progressFn = progressFn
}
}

// SyncWithNoCompression disables compression for the sync operation.
// This seriously increases the amount of storage required.
// Default: false
func SyncWithNoCompression() SyncOption {
return func(c *syncConfig) {
c.noCompression = true
}
}

// SyncWithLastRange sets the last range to be processed.
// Aside from tests, this is rarely useful.
// Default: 0xFFFFF
func SyncWithLastRange(to int64) SyncOption {
return func(c *syncConfig) {
c.lastRange = to
}
}

type exportConfig struct {
commonConfig
}

// ExportOption represents a type of function that can be used to customize the behavior of the Export function.
type ExportOption func(*exportConfig)

// ExportWithDataDir sets the data directory for the export operation.
// Default: "./.hibp-data"
func ExportWithDataDir(dataDir string) ExportOption {
return func(c *exportConfig) {
c.dataDir = dataDir
}
}

// ExportWithNoCompression instructs the export operation to assume the local data is not compressed.
// This should be in sync with the configuration of the call to Sync.
// Default: false
func ExportWithNoCompression() ExportOption {
return func(c *exportConfig) {
c.noCompression = true
}
}

type queryConfig struct {
commonConfig
}

// RangeAPIOption represents a type of function that can be used to customize the behavior of the RangeAPI constructor.
type RangeAPIOption func(*queryConfig)

// QueryWithDataDir sets the data directory for the RangeAPI.
// Default: "./.hibp-data"
func QueryWithDataDir(dataDir string) RangeAPIOption {
return func(c *queryConfig) {
c.dataDir = dataDir
}
}

// QueryWithNoCompression instructs the RangeAPI to assume the local data is not compressed.
// This should be in sync with the configuration of the call to Sync.
// Default: false
func QueryWithNoCompression() RangeAPIOption {
return func(c *queryConfig) {
c.noCompression = true
}
}

0 comments on commit 356b236

Please sign in to comment.