Hrflow indeed

Made by HADJI KHALIL (aka H-ADJI)

Objectif

Scripts to collect data from indeed job board, the collected jobs will be indexed in hrflow internal databases using python + hrflow APIs.

High level architecture

Setup

Add a .env file in the root of the project with the following variables.

API_KEY="YOUR_API_KEY"
USER_EMAIL ="YOUR_EMAIL"
BOARD_KEY="YOUR_BOARD_KEY"

--> Using your local python environment

IMPORTANT : python >= 3.9 to support all asyncio features
Install packages and dependencies (requirements.txt) using :

pip install -r requirements.txt

Install a browser for playwright using playwright install chromium (or any other browser you want)

playwright install chromium

Run the code.

python main.py

--> Using docker runtime environment

IMPORTANT : docker runtime with the new CLI >= 1.13.
For Unix based systems (where Makefile is supported) execute the following command.

make

For windows based systems (you should rethink your career as a programmer) use the docker desktop GUI to build the image and run the container.

Tools used

Playwright : Browser automation tool made by Microsoft, offering the best performance with a modern API and asyncio support.
Playwright-stealth : Library to patch browser environment variables to imitate a real user browser to avoid fingerprints of automated browsers.
Parsel : HTML parsing backend used by scrapy, uses the Python-C language FFI (Foreign Functions Interfaces) for super fast parsing.
ChompJS : Library used to parse javascript objects into Python dictionaries, more powerful and flexible than json from std-library.
Asyncio : Coroutine / Event loop based concurrency implementation in python, very good for IO heavy application.
Hrflow : Hrflow SDK to communicate with parsing and indexing APIs.
Loguru : Simple and beautiful logging.

Deployement

The program what bundled using docker, and the resulting artifact was deployed on a docker optimized VM on GCP.

Scraping flow

clicking the location for suggestion
getting the search parameter from home page
choosing a location
navigating to the job list
paginating until no next page
deal with popup when navigating pages
extract each page html
parse html to get the jobs urls
visit each job page
extract job data from json in script tag
merge data from feed with data from job page
done

Data fields

The job feed

The following fields will be extracted from indeed job feed :

in platform job id
title
url
company name
company rating
company_location
salary (raw format)
job_type / employement_type
shift
work model : remote / in-person / hybrid (computed from location)

The job page

The following fields will be extracted from a job page :

in platform job id
title
description
job location
company_name
title
date posted
valid until
job_type / employement_type (full list)
salary (detailed infos)
job benefits
company description
company logo url
company name
company indeed profile
company indeed reviews
company average rating / same as from feed
company rating count

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
assets		assets
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
makefile		makefile
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hrflow indeed

Objectif

High level architecture

Setup

--> Using your local python environment

--> Using docker runtime environment

Tools used

Deployement

Scraping flow

Data fields

The job feed

The job page

About

Releases

Packages

Languages

H-ADJI/hrflow-indeed-connector

Folders and files

Latest commit

History

Repository files navigation

Hrflow indeed

Objectif

High level architecture

Setup

--> Using your local python environment

--> Using docker runtime environment

Tools used

Deployement

Scraping flow

Data fields

The job feed

The job page

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages