Skip to content
Mahesh Maan edited this page Dec 5, 2022 · 3 revisions

PQAI Storage

This service intends to store the original copies of the documents, which may be retrieved in the search pipeline.

These documents need not just be patents. They may also be research papers, invention disclosures, etc. Storage of images (patent drawings) is also supported.

Most patent retrieval systems need to store document (e.g. patent) data in some form. Depending on the use case, the system may store this data in machine readable form (e.g., JSON or XML) or in human readable form (e.g. PDF files). This service provides a way to store and retrieve arbitrary documents in any form, depending on the use case.

The actual storage medium can take many forms. In the current implementation, 3 types are supported:

  1. Local storage (on a hard disk)
  2. Cloud storage (on S3)
  3. Database (on Mongo DB)

The service can be configured to use one of or combine the above storage mediums in a flexible manner. For instance, it's possible to store bibliographic data of documents in a database, their full text as plain JSON files on local disk, and their PDFs on the cloud.

The users of this service don't see the storage details, although, the latency may differ from one storage medium to another (local databases would be fastest, cloud storage slowest).

Code structure

root
  |-- core
        |-- storage.py				// defines wrappers on storage types
  |-- tests
        |-- test_server.py			// Tests for the REST API
        |-- test_storage.py			// Tests for storage module
  |-- main.py						// Defines the REST API
  |
  |-- requirements.txt				// List of Python dependencies
  |
  |-- Dockerfile					// Docker files
  |-- docker-compose.yml
  |
  |-- env							// .env file template
  |-- deploy.sh						// Script for setting up on local system

Core modules

Storage

The storage module defines wrappers over different types (e.g. local, cloud, or database) of physical storage mediums.

These wrappers inherit their interface from an abstract class named Storage which provides the following methods:

  1. get: to fetch an item by its identifier
  2. ls: to list all items with a given prefix (which can be a path)
  3. exists: to check if an item with the given identifier exists in the storage or not
  4. remove: to delete an item
  5. put: to add a new item

Note that all of the above functionalities are not exposed by the service via its public-facing REST API in its current implementation. Specifically, the provision of adding and removing new items is not exposed.

The actual wrappers are as follows:

LocalStorage

LocalStorage is a wrapper over local file system. It can be initalized with the absolute path to a local directory, which is given by the root argument. The root folder can contain further subfolders. Files are identified with their relative path to the root folder.

Typical usage:

import json
from core.storage import LocalStorage

root = "/home/ubuntu/documents"
storage = LocalStorage(root)
contents = storage.get("US7654321B2.json") # bytes
patent_data = json.loads(contents) # dict

S3Bucket

This is a wrapper over an AWS S3 bucket and can be initalized with an instance of botoclient and the name of the S3 bucket where the data is stored.

To use this wrapper, you need to have appropriate environment variables listed in the .env file. The following three variables are required:

  1. AWS_ACCESS_KEY_ID
  2. AWS_SECRET_ACCESS_KEY
  3. AWS_S3_BUCKET_NAME

Typical usage:

import json
import os
import botocore
from core.storage import S3Bucket

config = botocore.config.Config(
    read_timeout=400,
    connect_timeout=400,
    retries={"max_attempts": 0}
)
credentials = {
    "aws_access_key_id": os.environ["AWS_ACCESS_KEY_ID"],
    "aws_secret_access_key": os.environ["AWS_SECRET_ACCESS_KEY"]
}
botoclient = boto3.client("s3", **credentials, config=config)
bucket_name = os.environ["AWS_S3_BUCKET_NAME"]
storage = S3Bucket(botoclient, bucket_name)
obj = storage.get("patents/US7654321B2.json") # bytes
patent_data = json.loads(obj) # dict

MongoDB

This is a wrapper on the MongoDB database and can be initalized with an instance of pymongo.MongoClient, a database name, a collection name, and the name of attribute that stores the identifier field (e.g. publicationNumber).

To use this wrapper, you need to have appropriate variables listed in the .env file. The following variables are required:

  1. MONGO_HOST
  2. MONGO_PORT
  3. MONGO_USER
  4. MONGO_PASS
  5. MONGO_DB
  6. MONGO_COLL

If you are using a Mongo installation on your local system, the MONGO_HOST can be set to localhost.

Unless you have changed it, MongoDB is accessible on port 27017 which is the default value for MONGO_PORT.

Unless your MongoDB is protected with a username and password, you can leave the MONGO_USER and MONGO_PASS fields blank in the .env file.

If you are using the database dump supplied by PQAI, then MONGO_DB should be set to pqai and MONGO_COLL should be set to bibliography.

Typical usage:

import os
import json
from pymongo import MongoClient
from core.storage import MongoDB

mongo_client = MongoClient(MONGO_URI)
db = os.environ["MONGO_DB"]
coll = os.environ["MONGO_COLL"]
field = "publicationNumber"
storage = MongoDB(client, db, coll, field)
obj = storage.get("patents/US7654321B2.json") # bytes
patent_data = json.loads(obj) # dict

Assets

The service does not require any assets as such.

For any realistic testing or experimental, however, you would need to connect it to some data source. A good starting point could be the US patent bibliography data which you can download from the PQAI S3 bucket free of cost:

https://s3.amazonaws.com/pqai.s3/public/pqai-mongo-dump.tar.gz

The above data is in the form of a MongoDB database dump, which can be restored on your local system.

Make sure you have at least 30 GB of free space on your system before downloading and restoring this database dump.

It contains bibliographic details (title, abstract, CPC classes, inventor and assignee names, citations, etc.) of about 13 million US patents and published applications. Once restored into the database, individual datapoints can be retrieved as JSON documents.

Deployment

Prerequisites

The following instructions assume that you are running a Linux distribution and have Git and MongoDB installed on your system.

Setup

The easiest way to get this service up and running on your local system is to follow these steps:

  1. Clone the repository

    git clone https://github.com/pqaidevteam/pqai-db.git
    
  2. Using the env template in the repository, create a .env file and set the environment variables.

    cd pqai-db
    cp env .env
    nano .env
    
  3. Run deploy.sh script.

    chmod +x deploy.sh
    bash ./deploy.sh
    

This will create a docker image and run it as a docker container on the port number you specified in the .env file.

Alternatively, after following steps (1) and (2) above, you can use the command python main.py to run the service in a terminal.

Service dependency

This service is not dependent on any other PQAI service for its operation.

Dependent services

The following services depend on this service:

  • pqai-gateway