Refactor ojd_daps_skills #223

india-kerle · 2024-05-08T16:18:38Z

This PR is a major re-factor of the current ojd_daps_skills library. It does a number of things:

Remove unnecessary files: it removes a bunch of files that are not necessary for the library generally. The plan here is the make a copy of the current dev branch so we don't loose anything from the previous iteration.
Use poetry for dependency management: hopefully, this would sort out some of the really type dependency issues that users are currently having.
Use Pydantic for more enforceable type hints
Moves model training out of this inference repo and into this repo: ojd_daps_language_models This way, we're just pulling the latest models from huggingface hub
Creates configuration managers and uses Pathlib: I've created two Config classes that help in downloaded data from s3 and models from huggingface hub to hopefully avoid the JobNer does not exist drama.

In addition to ensuring that the results are similar to last times, there are additional things outstanding:

we need to add a LICENSE file
the documentation needs to be updated - the docs directory has been removed so needs to be re-added with updated information on how to use the library and updated model performance (which can be found on huggingface)
pytests gh actions need to be updated to also work on a windows system. It is not as simple as simply adding windows-latest to the action.
We will need to do a release and release to pypi once merged. We should also write an issue or communicate something about the changes to the library.

…er name

… - e.g. adding both the split and unsplit version of a multiskill entity

lizgzil · 2024-06-27T18:30:56Z

@Jack-Vines I've had a look at this PR and I think I've spotted the code issues that needed changing and changed them. I trained a new NER model and Multiskill model (and uploaded them to huggingface) after realising they were trained on an old (and smaller) version of the training data.

Still to do:

try to investigate quality changes?
Documentation/github actions that India mentions in this PR's description

Deep dive into the results pre and post refactor

I wanted to know if the results changed before and after the refactor. So using the same sample of 1000 OJO job adverts:

I applied the old (current dev) code [with the data downloaded from ojd_daps_skills_data_new i.e. the 20230808 model.
I applied this new code

TL;DR

The original (20230808 on S3) and refactored (the one on huggingface) models are different - they have the same random seeds and training data, but stochasticity/changes may have crept in. This isn't necessarily a bad thing.
Mapping extracted skills to ESCO seems to be exactly the same before and after the refactor 😄 .
There are more skills extracted in the refactored method than the original. This is likely explained by the following point.
Previously we removed duplicate skills in the list of mapped skills for each job advert. In this method I thought about also removing duplicates, but then I thought maybe we should leave them in and the user can decide to remove them. This is important to remember.
I haven't got a sense of 'quality' - is the quality of the new results better/worse/the same as before?

Full results

The results are not the same. Generally more skills are extracted from the new method.

Original number of skills: [5, 10, 16] (25%,50%,75% percentiles)
Refactored number of skills: [7, 12, 20]

The "top ESCO skills" extracted are similar, but not the same.

The number of occurrences of each unique ESCO skill are correlated across the 2 methods:

📏 The length of the skills are basically the same. In the original the mean length was 29.4 with a IQR of [14,23,38] in the new version the mean length is 30.8 with [14, 25, 40].

🎉 The matching is the same though. If the same skill entity is extracted from both methods, then they are always matched to the same ESCO skill.

🔬 The models are different. I tested this with the following code:

import os
import spacy

job_ad = "[ Marketing Communications Executive Full Time (Hybrid) Location: Crewe Base Salary: Up to £32k per annum DOE Are you an experienced Marketing Communications Executive looking for a new role with a fantastic company? If so, we have the perfect new role for you! The focus of the role is marketing automations, set up and execution of campaigns via automations, focus  on digital but some direct mail and some social (Facebook). The Role:  Campaign ManagementWrite Briefing DocumentsCollaborate with different teamsAssist with Creating a Brand Marketing CalendarManager Social Media ChannelsSupport in Raising PO'SProduce Monthly NewsletterExecute Email Marketing CampaignsReporting of Campaign Effectiveness You:  Educated to a higher level (Ideally in Marketing)Excellent Communication SkillsExperience Working with StakeholdersPrevious Experience of Email Software (Active Campaigns, Mailchimp, SendInBlue, Capterra etc)Knowledge of Wordpress is an advantage. ]"

## Use the model without any infrastructure around it

ner_model_name = "nestauk/en_skillner"
namespace, ner_name = ner_model_name.split("/")

os.system(f"pip install https://huggingface.co/{namespace}/{ner_name}/resolve/main/{ner_name}-any-py3-none-any.whl")
hf_nlp = spacy.load(ner_name)

doc = hf_nlp(job_ad)
[(ent.text, ent.label_) for ent in doc.ents]
>>> [("Raising PO'SProduce Monthly NewsletterExecute Email", 'SKILL'), ('Educated to a higher level (Ideally in Marketing)Excellent Communication SkillsExperience Working with StakeholdersPrevious Experience of Email Software', 'EXPERIENCE'), ('Active Campaigns', 'SKILL'), ('Mailchimp', 'SKILL'), ('SendInBlue', 'SKILL'), ('Wordpress', 'SKILL')]

## The old way
# After downloaded the `ojd_daps_skills_data_new` folder from the public S3 bucket (confusingly called 'new')
old_nlp = spacy.load("ojd_daps_skills_data_new/outputs/models/ner_model/20230808/")
old_doc = old_nlp(job_ad)
[(ent.text, ent.label_) for ent in old_doc.ents]
>>> [('marketing automations', 'SKILL'), ('direct mail and some social (Facebook)', 'SKILL'), ('NewsletterExecute', 'SKILL'), ('Educated to a higher level', 'EXPERIENCE'), ('Email Software', 'SKILL'), ('Mailchimp', 'SKILL'), ('Capterra etc)Knowledge', 'SKILL')]

India Kerle added 30 commits May 7, 2024 17:45

initial refactor commit

ca03403

refactor v2

6b80592

isort file

dc88ae1

update pytest; isort files

b9fe09e

use different poetry action

fcab149

update poetry

5dcbcac

update pytest file

08153df

update workflow

902db16

update workflow

7be49c7

update pytest

2ea4d85

remove 3.10

ecc50d5

only ubuntu;3.9

33e30c9

multiple python versions and systems:

13d95cc

test for 3.9, 3.10

b64ef1e

multiple system compatability

ddb643e

change poetry version

1f5c074

change poetry version

8c6e899

fix error

1692e74

remove windows

61346e0

update workflow

9766c1b

update workflow

f2181ec

install spacy model

8bb59d0

test with just windows

1372b8e

add poetry path

582d741

test just windows

d770827

remove windows

4948f10

add lint github actions

b55432e

update readme

321f09a

fix formatting

7f47cb3

enforce black style

b5e9ccd

India Kerle added 5 commits May 8, 2024 17:12

black and isort are circular

6fb49c3

remove isort as its in conflict with black

1542deb

add lock file

620ea0b

update extract_skills_config

d7bda99

add poetry.lock

4839866

india-kerle marked this pull request as draft May 8, 2024 16:30

india-kerle requested a review from lizgzil May 8, 2024 16:30

India Kerle added 2 commits May 10, 2024 10:14

debug

0c87545

debug

ec208b9

india-kerle requested a review from Jack-Vines May 10, 2024 09:19

lizgzil added 2 commits June 26, 2024 14:14

fix multiskill appending, skills mapper test and use refactor s3 fold…

c25a3c5

…er name

Use latest MS model version and correct some issues in the MS process…

2d3ea01

… - e.g. adding both the split and unsplit version of a multiskill entity

lizgzil mentioned this pull request Jul 1, 2024

Update typing-extensions version to be compatible with Pydantic #225

Merged

13 tasks

Jack-Vines added 9 commits July 9, 2024 08:49

fix black formatting

0f9e570

update linting

8b64b91

fix linting

b27c25d

Update tests

682f25d

testing

3f7c569

ruff improvements

795bed0

Black improvements

3804c7d

Update README

a8e3864

add docs

377d795

Jack-Vines marked this pull request as ready for review August 13, 2024 18:39

Merge branch 'dev' into ojoner

028dda5

Jack-Vines approved these changes Aug 13, 2024

View reviewed changes

Jack-Vines merged commit b88392e into dev Aug 13, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor ojd_daps_skills #223

Refactor ojd_daps_skills #223

india-kerle commented May 8, 2024 •

edited

Loading

lizgzil commented Jun 27, 2024 •

edited

Loading

Refactor ojd_daps_skills #223

Refactor ojd_daps_skills #223

Conversation

india-kerle commented May 8, 2024 • edited Loading

lizgzil commented Jun 27, 2024 • edited Loading

Still to do:

Deep dive into the results pre and post refactor

TL;DR

Full results

india-kerle commented May 8, 2024 •

edited

Loading

lizgzil commented Jun 27, 2024 •

edited

Loading