Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ojd_daps_skills #223

Merged
merged 49 commits into from
Aug 13, 2024
Merged

Refactor ojd_daps_skills #223

merged 49 commits into from
Aug 13, 2024

Conversation

india-kerle
Copy link
Collaborator

@india-kerle india-kerle commented May 8, 2024

This PR is a major re-factor of the current ojd_daps_skills library. It does a number of things:

  1. Remove unnecessary files: it removes a bunch of files that are not necessary for the library generally. The plan here is the make a copy of the current dev branch so we don't loose anything from the previous iteration.
  2. Use poetry for dependency management: hopefully, this would sort out some of the really type dependency issues that users are currently having.
  3. Use Pydantic for more enforceable type hints
  4. Moves model training out of this inference repo and into this repo: ojd_daps_language_models This way, we're just pulling the latest models from huggingface hub
  5. Creates configuration managers and uses Pathlib: I've created two Config classes that help in downloaded data from s3 and models from huggingface hub to hopefully avoid the JobNer does not exist drama.

In addition to ensuring that the results are similar to last times, there are additional things outstanding:

  • we need to add a LICENSE file
  • the documentation needs to be updated - the docs directory has been removed so needs to be re-added with updated information on how to use the library and updated model performance (which can be found on huggingface)
  • pytests gh actions need to be updated to also work on a windows system. It is not as simple as simply adding windows-latest to the action.
  • We will need to do a release and release to pypi once merged. We should also write an issue or communicate something about the changes to the library.

@india-kerle india-kerle marked this pull request as draft May 8, 2024 16:30
@india-kerle india-kerle requested a review from lizgzil May 8, 2024 16:30
India Kerle added 2 commits May 10, 2024 10:14
@lizgzil
Copy link
Collaborator

lizgzil commented Jun 27, 2024

@Jack-Vines I've had a look at this PR and I think I've spotted the code issues that needed changing and changed them. I trained a new NER model and Multiskill model (and uploaded them to huggingface) after realising they were trained on an old (and smaller) version of the training data.

Still to do:

  • try to investigate quality changes?
  • Documentation/github actions that India mentions in this PR's description

Deep dive into the results pre and post refactor

I wanted to know if the results changed before and after the refactor. So using the same sample of 1000 OJO job adverts:

  • I applied the old (current dev) code [with the data downloaded from ojd_daps_skills_data_new i.e. the 20230808 model.
  • I applied this new code

TL;DR

  • The original (20230808 on S3) and refactored (the one on huggingface) models are different - they have the same random seeds and training data, but stochasticity/changes may have crept in. This isn't necessarily a bad thing.
  • Mapping extracted skills to ESCO seems to be exactly the same before and after the refactor 😄 .
  • There are more skills extracted in the refactored method than the original. This is likely explained by the following point.
  • Previously we removed duplicate skills in the list of mapped skills for each job advert. In this method I thought about also removing duplicates, but then I thought maybe we should leave them in and the user can decide to remove them. This is important to remember.
  • I haven't got a sense of 'quality' - is the quality of the new results better/worse/the same as before?

Full results

The results are not the same. Generally more skills are extracted from the new method.
Screenshot 2024-06-27 at 19 13 33

Original number of skills: [5, 10, 16] (25%,50%,75% percentiles)
Refactored number of skills: [7, 12, 20]

Screenshot 2024-06-27 at 19 16 49

The "top ESCO skills" extracted are similar, but not the same.

Screenshot 2024-06-27 at 19 18 22

Screenshot 2024-06-27 at 19 18 28

The number of occurrences of each unique ESCO skill are correlated across the 2 methods:
Screenshot 2024-06-27 at 19 18 59

📏 The length of the skills are basically the same. In the original the mean length was 29.4 with a IQR of [14,23,38] in the new version the mean length is 30.8 with [14, 25, 40].

🎉 The matching is the same though. If the same skill entity is extracted from both methods, then they are always matched to the same ESCO skill.

🔬 The models are different. I tested this with the following code:

import os
import spacy

job_ad = "[ Marketing Communications Executive Full Time (Hybrid) Location: Crewe Base Salary: Up to £32k per annum DOE Are you an experienced Marketing Communications Executive looking for a new role with a fantastic company? If so, we have the perfect new role for you! The focus of the role is marketing automations, set up and execution of campaigns via automations, focus  on digital but some direct mail and some social (Facebook). The Role:  Campaign ManagementWrite Briefing DocumentsCollaborate with different teamsAssist with Creating a Brand Marketing CalendarManager Social Media ChannelsSupport in Raising PO'SProduce Monthly NewsletterExecute Email Marketing CampaignsReporting of Campaign Effectiveness You:  Educated to a higher level (Ideally in Marketing)Excellent Communication SkillsExperience Working with StakeholdersPrevious Experience of Email Software (Active Campaigns, Mailchimp, SendInBlue, Capterra etc)Knowledge of Wordpress is an advantage. ]"

## Use the model without any infrastructure around it

ner_model_name = "nestauk/en_skillner"
namespace, ner_name = ner_model_name.split("/")

os.system(f"pip install https://huggingface.co/{namespace}/{ner_name}/resolve/main/{ner_name}-any-py3-none-any.whl")
hf_nlp = spacy.load(ner_name)

doc = hf_nlp(job_ad)
[(ent.text, ent.label_) for ent in doc.ents]
>>> [("Raising PO'SProduce Monthly NewsletterExecute Email", 'SKILL'), ('Educated to a higher level (Ideally in Marketing)Excellent Communication SkillsExperience Working with StakeholdersPrevious Experience of Email Software', 'EXPERIENCE'), ('Active Campaigns', 'SKILL'), ('Mailchimp', 'SKILL'), ('SendInBlue', 'SKILL'), ('Wordpress', 'SKILL')]

## The old way
# After downloaded the `ojd_daps_skills_data_new` folder from the public S3 bucket (confusingly called 'new')
old_nlp = spacy.load("ojd_daps_skills_data_new/outputs/models/ner_model/20230808/")
old_doc = old_nlp(job_ad)
[(ent.text, ent.label_) for ent in old_doc.ents]
>>> [('marketing automations', 'SKILL'), ('direct mail and some social (Facebook)', 'SKILL'), ('NewsletterExecute', 'SKILL'), ('Educated to a higher level', 'EXPERIENCE'), ('Email Software', 'SKILL'), ('Mailchimp', 'SKILL'), ('Capterra etc)Knowledge', 'SKILL')]

@Jack-Vines Jack-Vines marked this pull request as ready for review August 13, 2024 18:39
@Jack-Vines Jack-Vines merged commit b88392e into dev Aug 13, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants