Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: SITES-23133 [Importer] Report on the progress of a import job via a new jobs endpoint #508

Merged
merged 13 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2,862 changes: 1,441 additions & 1,421 deletions docs/index.html

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions docs/openapi/api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,8 @@ paths:
$ref: './tools-api.yaml#/import'
/tools/import/jobs/{jobId}:
$ref: './tools-api.yaml#/import-job-status'
/tools/import/jobs/{jobId}/progress:
$ref: './tools-api.yaml#/import-job-progress'
/tools/import/jobs/{jobId}/result:
$ref: './tools-api.yaml#/import-job-result'

Expand Down
37 changes: 32 additions & 5 deletions docs/openapi/schemas.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1311,19 +1311,22 @@ ImportJobStatus:
- 'COMPLETE'
- 'FAILED'
startTime:
description: The date and time when the import job started
description: The date and time when the import job started.
$ref: '#/DateTime'
endTime:
description: The date and time when the import job ended, if finished
description: The date and time when the import job ended, if finished.
$ref: '#/DateTime'
urlCount:
description: The total number of URLs in the job
description: The total number of URLs in the job.
type: integer
successCount:
description: The number of URLs successfully imported
description: The number of URLs successfully imported.
type: integer
failedCount:
description: The number of URLs that failed to import
description: The number of URLs that failed to import.
type: integer
redirectCount:
description: The number of URLs that were redirected.
type: integer
example:
id: '12d6ac8e-d5e4-4788-90eb-b69e10e745fd'
Expand All @@ -1332,6 +1335,30 @@ ImportJobStatus:
urlCount: 25
successCount: 15
failedCount: 0
ImportJobProgress:
type: object
properties:
pending:
type: integer
description: The number of pending url jobs still to process.
redirect:
type: integer
description: The number of urls that resulted in a redirect.
running:
type: integer
description: The number of url jobs that are running.
completed:
type: integer
description: The number of url jobs that have completed.
failed:
type: integer
description: The number of url jobs that have failed.
example:
pending: 0
redirect: 0
running: 0
completed: 1
failed: 0
ImportJobArchive:
type: object
properties:
Expand Down
33 changes: 33 additions & 0 deletions docs/openapi/tools-api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,39 @@ import-job-status:
$ref: './responses.yaml#/404'
'500':
$ref: './responses.yaml#/500'
import-job-progress:
get:
tags:
- import
summary: Query the progress of an import job
description: |
This endpoint can be used to query the progress of an import job, given its jobId.
operationId: getImportJobProgress
security:
- api_key: [ ]
parameters:
- $ref: './parameters.yaml#/jobId'
- in: header
name: x-api-key
schema:
type: string
description: |
Client-specific import API key. Must match the key used to start the import job.
responses:
'200':
description: The current progress of the import job.
content:
application/json:
schema:
$ref: './schemas.yaml#/ImportJobProgress'
'400':
$ref: './responses.yaml#/400'
'401':
$ref: './responses.yaml#/401'
'404':
$ref: './responses.yaml#/404'
'500':
$ref: './responses.yaml#/500'
import-job-result:
post:
tags:
Expand Down
9 changes: 4 additions & 5 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@
"@adobe/helix-shared-wrap": "2.0.2",
"@adobe/helix-status": "10.1.3",
"@adobe/helix-universal-logger": "3.0.20",
"@adobe/spacecat-shared-data-access": "1.44.4",
"@adobe/spacecat-shared-data-access": "1.44.5",
"@adobe/spacecat-shared-http-utils": "1.6.10",
"@adobe/spacecat-shared-ims-client": "1.3.16",
"@adobe/spacecat-shared-rum-api-client": "2.9.4",
Expand Down
63 changes: 57 additions & 6 deletions src/controllers/import.js
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,25 @@ import { ImportJobDto } from '../dto/import-job.js';
* @constructor
*/
function ImportController(context) {
/**
* The import controller has a number of scopes that are required to access different parts of the
* import functionality. These scopes are used to validate the authenticated user has the required
* level of access.
* The scopes are:
* READ: 'imports.read' - allows users to view import jobs
bhellema marked this conversation as resolved.
Show resolved Hide resolved
bhellema marked this conversation as resolved.
Show resolved Hide resolved
* WRITE: 'imports.write' - allows users to create new import jobs
* READ_ALL: 'imports.read_all' - allows users to view all import jobs
* WRITE_ALL_DOMAINS: 'imports.write_all_domains' - allows users to import across any domain
* @type {{READ: 'imports.read', WRITE_ALL_DOMAINS: 'imports.write_all_domains',
* READ_ALL: 'imports.read_all', WRITE: 'imports.write'}}
*/
const SCOPE = {
READ: 'imports.read',
WRITE: 'imports.write',
READ_ALL: 'imports.read_all',
WRITE_ALL_DOMAINS: 'imports.write_all_domains',
};

const {
dataAccess, sqs, s3, log, env, auth, attributes,
} = context;
Expand Down Expand Up @@ -85,7 +104,11 @@ function ImportController(context) {
}
}

function validateImportApiKey(importApiKey, scopes) {
/**
* Verify that the authenticated user has the required level of access scope.
* @param scopes a list of scopes to validate the user has access to.
*/
function validateAccessScopes(scopes) {
log.debug(`validating scopes: ${scopes}`);

try {
Expand Down Expand Up @@ -120,7 +143,7 @@ function ImportController(context) {

try {
// The API scope imports.write is required to create a new import job
validateImportApiKey(importApiKey, ['imports.write']);
validateAccessScopes([SCOPE.WRITE]);
validateRequestData(multipartFormData);

const { authInfo: { profile } } = attributes;
Expand All @@ -137,10 +160,14 @@ function ImportController(context) {
const {
urls, options, importScript, customHeaders,
} = multipartFormData;

// Merge the import configuration options with the request options allowing the user options
// to override the defaults
const mergedOptions = {
...importConfiguration.options,
...options,
};

const job = await importSupervisor.startNewJob(
urls,
importApiKey,
Expand Down Expand Up @@ -174,11 +201,11 @@ function ImportController(context) {
* @returns {Promise<Response>} 200 OK with a JSON representation of the import jobs.
*/
async function getImportJobsByDateRange(requestContext) {
const { startDate, endDate, importApiKey } = parseRequestContext(requestContext);
const { startDate, endDate } = parseRequestContext(requestContext);
log.debug(`Fetching import jobs between startDate: ${startDate} and endDate: ${endDate}.`);

try {
validateImportApiKey(importApiKey, ['imports.read_all']);
validateAccessScopes([SCOPE.READ_ALL]);
validateIsoDates(startDate, endDate);
const jobs = await importSupervisor.getImportJobsByDateRange(startDate, endDate);
return ok(jobs.map((job) => ImportJobDto.toJSON(job)));
Expand All @@ -200,7 +227,7 @@ function ImportController(context) {

try {
// The API scope imports.read is required to get the import job status
validateImportApiKey(importApiKey, ['imports.read']);
validateAccessScopes([SCOPE.READ]);
const job = await importSupervisor.getImportJob(jobId, importApiKey);
return ok(ImportJobDto.toJSON(job));
} catch (error) {
Expand All @@ -221,7 +248,7 @@ function ImportController(context) {

try {
// The API scope imports.read is required to get the import job status
validateImportApiKey(importApiKey, ['imports.read']);
validateAccessScopes([SCOPE.READ]);
const job = await importSupervisor.getImportJob(jobId, importApiKey);
const downloadUrl = await importSupervisor.getJobArchiveSignedUrl(job);
return ok({
Expand All @@ -234,10 +261,34 @@ function ImportController(context) {
}
}

/**
* Get the progress of an import job. Results are broken down into the following:
* - complete: URLs that have been successfully imported.
* - failed: URLs that have failed to import.
* - pending: URLs that are still being processed.
* - redirected: URLs that have been redirected.
* - running - URLs that are currently being imported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will running ever be non-zero? Since we do not have access to the data layer from the content-scraper, we do not update the DB when we begin processing a URL.

Copy link
Contributor Author

@bhellema bhellema Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is interesting. We could do two things I guess with this.

  1. remove running as it will never be non zero.
  2. implement a new feature that will allow the content-scraper to message back / use a lamda to update the status of the import url job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there value in constantly updating this field?

  • Currently, it will always be 0 or 1 since we have configured only a single Lambda invocation for the import queues, so effectively sequential processing
  • The cost of additional writes to the DB each time a new URL begins processing will add up for larger jobs

Copy link
Contributor Author

@bhellema bhellema Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is value at this point updating the field, at this point. I think at this point I should just eliminate the concept of running from the api-service. However if we start running jobs in parallel, then I think it might be interesting to revisit this.

* @param requestContext - Context of the request.
* @return {Promise<Response|*>} 200 OK with a JSON representation of the import job progress.
*/
async function getImportJobProgress(requestContext) {
const { jobId, importApiKey } = parseRequestContext(requestContext);

try {
validateAccessScopes([SCOPE.READ]);
const progress = await importSupervisor.getImportJobProgress(jobId, importApiKey);
return ok(progress);
} catch (error) {
log.error(`Failed to fetch the import job progress: ${error.message}`);
return createErrorResponse(error);
}
}

return {
createImportJob,
getImportJobStatus,
getImportJobResult,
getImportJobProgress,
getImportJobsByDateRange,
};
}
Expand Down
4 changes: 0 additions & 4 deletions src/dto/import-job.js
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,6 @@ export const ImportJobDto = {
endTime: importJob.getEndTime(),
duration: importJob.getDuration(),
status: importJob.getStatus(),
urlCount: importJob.getUrlCount(),
successCount: importJob.getSuccessCount(),
failedCount: importJob.getFailedCount(),
importQueueId: importJob.getImportQueueId(),
initiatedBy: importJob.getInitiatedBy(),
}),
};
1 change: 1 addition & 0 deletions src/routes/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ export default function getRouteHandlers(
'GET /trigger': triggerHandler,
'POST /tools/import/jobs': importController.createImportJob,
'GET /tools/import/jobs/:jobId': importController.getImportJobStatus,
'GET /tools/import/jobs/:jobId/progress': importController.getImportJobProgress,
'POST /tools/import/jobs/:jobId/result': importController.getImportJobResult,
'GET /tools/import/jobs/by-date-range/:startDate/:endDate/all-jobs': importController.getImportJobsByDateRange,
};
Expand Down
Loading
Loading