Scraping and Analysing Indeed Job Data

Have you ever been looking at the job market, and wanted better information? Wanted to know more about the number of jobs being advertised and the companies advertising. But you do not want to have to gather all the data manually. Being a data analyst gives me the skills and tools to attain the data and answer such questions. So today I am going to show how it can be done using Python.

I have selected indeed.com as it is the largest job website in the world, boasting over 250 million monthly users with nearly 10 new job listings added every second. A quick look at the site and I could see that it has an API; but as I was after a relatively small amount of data I decided to scrape the data instead.

The process was broken down into two parts, first I wrote a python script for scraping indeed.com, second I produced a notebook for analysing the data.

My go to language is Python, due to its fast development time, large selection of libraries and excellent community.

Scraping indeed.com

First I needed to decide what I wanted to scrape, my personal focus is on data so for this I will search ‘Data analysis’ and ‘data science’. I decided to scrape new jobs posted in the last week. With this in hand I created a base URL as shown below and concatenated other search strings for job, location and timeframe.

job = "data+science+analyst"
Location = "London"
last_x_days = "7"
base_url = "https://uk.indeed.com/jobs?q="+job+"&l="+Location+"&fromage="+last_x_days

This allows me to cycle through pages using a while loop. I used Requests to return the pages and BeautifulSoup to process the HTML, it provides a user-friendly API to grab the data. I then wrote a number of functions, one to for each of the data points I wanted to grab from indeed. These included:

job title
company name
company location
salary
date added to indeed
page number
total count of jobs

Here is the function to get the company location, you can see that I had to remove the last item from the list as it consistently ended with a blank item.

def company_location(soup):
    """Func to get the company location"""
    data_str = ""
    result = ""
    for item in soup.find_all(class_="companyLocation"):
        # build a string adding the new data each loop 
        data_str = data_str + item.get_text() + "\n"
    # split the string and make a list
    result = data_str.split("\n")
    # remove unwanted last item
    result.pop() 
    return result

Job id

I had hoped to be able to pull a unique job id from indeed. But the id seems to be dynamic and changes each load of the page. So I will have to deal with potential duplicates while analysing until I find a better solution.

Reshape data

I wanted to push the data to a pandas DataFrame to make it easy to output ready for analysis. So the data needed to be transformed from a list of lists, the current data structure, to a DataFrame.

From this:-

[['Data Analyst', 'Data Analyst', 'Data Analyst', 'Genome Analyst / Data Scientist', 'Graduate Data Science and Econometric Analyst'], 
['Burberry', 'NonStop Consulting', 'Ladder.io', 'Genomics England'], 
['London', 'Remote in London', 'London', 'Hybrid remote in London', 'United Kingdom'], 
['No Salary provided', 'No Salary provided', 'No Salary provided', 'No Salary provided', '£42,000 - £45,000 a year'], 
[datetime.date(2022, 4, 8), datetime.date(2022, 4, 6), datetime.date(2022, 4, 12), datetime.date(2022, 4, 12), datetime.date(2022, 4, 6)]]

To this:-

#	job_title	company_name	company_location	salary	date_added
0	Data Analyst	Burberry	London	No Salary provided	2022-04-08
1	Data Analyst	NonStop Consulting	Remote in London	No Salary provided	2022-04-06
2	Data Analyst	Ladder.io	London	No Salary provided	2022-04-12
3	Genome Analyst / Data Scientist	Genomics England	Hybrid remote in London	No Salary provided	2022-04-12
4	Graduate Data Science and Econometric Analyst	Graduate Recruitment Bureau	United Kingdom	£42,000 - £45,000 a year	2022-04-06

Each list has one column rather than one row of data. So we need to transpose that which was achieved using map and zip. Zip takes iterables and combines them in a tuple. Map is used to execute a function on iterables. The resulting list can then be used to produce a DataFrame. A DataFrame is a matrix style data structure similar to a spreadsheet with rows and columns.

# transpose list of lists so it fits into a df
l = list(map(list, zip(*l)))
df = pd.DataFrame(l, columns=['job_title', 'company_name', 'company_location', 'salary', 'date_added'])

Enriching the data (feature engineering)

Before outputting the data, I wanted two extra fields added, the first was job category. This would help when it came to analysis by grouping the jobs into a small number of familiar titles. I wrote the below function to help with this and then used pandas apply to run on the job_title field. I will be adding extra category’s in the future but for now there are three: Data Analyst, Data Scientist, and Data Engineer.

def map_to_job_category(x):
    """Func to map job Category if job title contains string"""
    cat_map = {'Analyst': 'Data Analyst',
               'Scientist': 'Data Scientist',
               'Data Science': 'Data Scientist',
               'Engineer': 'Data Engineer'}

    group = "unknown"
    for key in cat_map:
        if key in x:
            group = cat_map[key]
            break
    return group

Saving data

I pushed the data into a pandas DataFrame and originally output the data to a CSV file. But a database is better long term so I took advantage of the fact Python is packaged with sqlite3 a self-contained, full-featured SQL database. It is very easy to setup an sqlite DB, just import then create a connection object and execute the SQL.

import sqlite3
conn = sqlite3.connect(r'filename.db')
conn.execute('''CREATE TABLE IF NOT EXISTS JOBS
                (job_title TEXT,
                company_name TEXT,
                company_location TEXT,
                salary TEXT,
                date_added DATE,
                date_scrapped DATETIME,
                job_category TEXT
                )''')

The full code for this script can be found here job_data_analysis

Analysis of jobs data

The second stage to this task involved using a notebook (.ipynb file). Notebooks help create reproducible, shareable, collaborative computational narratives. If you have not worked with Jupyter notebooks read here on their history. Notebooks have become an important tool for data science and while they have issues e.g. lack of debugger. They are still the best way to analyse data with Python but are no longer restricted to Python you can write R and a number of other languages in notebooks.

After importing the data from the DB, first thing was to clean and enrich the data. I extracted the year/hour from the salary column and produced a salary time field. This could then be used to define between yearly and hourly pay. Note the use of the flags parameter, this is due to the fact there is no case parameter in Pandas when using str.extract.

df_DB['salary_time'] = df_DB['salary'].str.extract('(year|hour)', flags=re.IGNORECASE)

I made min and max salary fields using the salary range.

A new dataframe called sal_df was made using the following code.

sal_df = (df_DB['salary'].str.findall(r'(?:£)(\d+[,.]\d+)')
            .apply(pd.Series)
            .replace('[,.]', '', regex=True)
            .rename({0:'sal_min', 1:'sal_max'}, axis=1))

I then made an average salary field using the logic that, if there was no max salary, rather than averaging, just use the minimum salary. This is because some jobs advertise a range e.g. 40-50k and others just give a single figure e.g. 40k.

Merging the two dataframes

After checking the data I then merged the frames with a simple concat using axis=1.

Time to analyse the data

I then put together a range of visualisations using the handy plot method that comes with Pandas.

Average Salary

Descriptive statistics for the average salary.

It is always good to have a look at the descriptive statistics. When looking at wages many believe it is best practice to use the 50 percentile (median) rather than the mean. This is due to the robust nature of median. But for this analysis I have decided to just use mean.

	ave_salary
count	143.0
mean	51781.0
std	23618.0
min	16260.0
25%	36403.0
50%	47450.0
75%	62000.0
max	155000.0

Top ten Job count by company

Highest salaries

I pulled the highest salaries.

	company_name	job_title	ave_salary
569	Digital Waffle	Principle Data Analytics Consultant	155000.0
438	Digital Waffle	Principle Data Analytics Consultant	155000.0
256	Oliver James Associates	Senior Data Solution Architect	125000.0
251	BCT Resourcing	Senior Data Solution Architect	125000.0
727	Client Server	Software Developer C# .Net Core Data Cloud	110000.0
247	La Fosse Associates	Lead Data Engineer	105000.0
721	Technical Resources Ltd	Senior/Principal Data Engineer GCP	100000.0
208	Intellectual Property Office	Director of the Intellectual Property Framework	94400.0
695	Taylor Root Global Legal Recruitment	Model Validation Manager	85000.0
120	LexisNexis Risk Solutions Group	Software Engineering Lead ICIS Sutton	82500.0

Lowest salaries

df_final.query("ave_salary > 0").filter(['company_name', 'job_title', 'ave_salary']).sort_values(by='ave_salary', ascending=True).head(10)

	company_name	job_title	ave_salary
353	The Institute of Cancer Research	Student Placement - Clinical Trial Assistant (…	16260.0
89	Cathcart Associates Graduate Database	Support Analyst	21000.0
127	HeadStart Speech & Behaviour Therapies	ABA Tutor - Autism	21710.0
711	Client Server	IT Support / Operations Analyst SQL - E-commerce	22000.0
28	The Monarch Partnership Limited	Junior Pricing Analyst	23000.0
408	TalentPool	Business Insight Analyst	23000.0
673	Graduate Recruitment Bureau	Graduate German Speaking Digital Marketing Ana…	24000.0
565	I.T. Associates Ltd	Junior Procim Support/Business Analyst – 23 to…	25500.0
159	Department for Transport	Analysis Officer	26126.0
504	Natural England	Earth Observation Analyst	26224.0

Job title

I looked at most used job titles. One problem while trying to compare jobs is that companies do not always use the standard job titles. They prefer to use very specific job titles which are harder to compare like for like. In future I might look at using ML to try and cluster jobs based on the wording in the advert. But for now we will stick with the job_title and job_category fields.

Data Analyst              29
Senior Data Analyst       15
Data Scientist            14
Data Engineer             12
Business Analyst           9
Senior Analyst             6
Product Analyst            6
Senior Insight Analyst     5
Quantitative Analyst       5
Analyst                    5
Name: job_title, dtype: int64

Looking at data using the job_category field

Looking at the current data captured there is a lack of salaries given; only about 25% of jobs advertise salaries which seems low to me. Of the current data scientists jobs scraped none have a salary advertised.

Here is salaries by job category.

job_category	ave_salary	ave_salary	job_title
	mean	count	count

Data Analyst	41204.0	40	192
Data Engineer	65602.0	3	21
Data Scientist		0	7
unknown	72191.8	20	94

I will develop this analysis in the future if there is a requirement. I am thinking to pull technical requirements, and years of experience data. Hope you have enjoyed the read.

Data In Out Of Pandas

Visualising and Analysing File Structures