Understanding the Data Science job market in Atlanta using Cosine Similarity and GMaps

Executive Summary
There exist websites that provide you a match percentage of ones resume against a job description. There exist websites that provide a geographic representation of jobs in a given area. There exist services that provide real-time traffic visual representation in a given area. However – there is not a service that provides all these three data points simultaneously. I believe a service that provides this could help job seekers be more informed in the decisions they are making about their careers.

Example: A job seeker that moves into a new city can better compare different opportunities based on commute distance and typical traffic patterns.

Problem Statement
Can a platform be created where job seekers are provided a better visual and analytical overview of jobs they are considering?

Anticipated Results:
A model that can predict the cosine_similarity between two or more documents. I then use these results to provide a visual overview of the best matching jobs to one’s profile.

Scope Statement
As with any project – a precise and succinct scope statement is needed to ensure that all the objectives are met.
1) Provide a visual overview of the data science jobs available in the Atlanta area.
2) Provide a percentage match between a resume and a job description.
3) Provide a visual representation of typical traffic patterns for the opportunities.

Summary
To accomplish the scope above the following procedures were utilized.
1. Examine different models to determine which model would provide the best results for document comparisons.
2. Examine different platforms to determine which would provide the best map layout overview.
3. Examine which platforms can accommodate a traffic overview

Anticipated Risks
1. Have not used Gmaps previously. I do not know at this time how complicated this could be.
2. Contingency Plan: Use Tableau Map function
3. Converting scraped job descriptions into individual CSV files.
4. Contingency Plan: May need to just do each on a manual basis. I could probably manage at least 50 individual jobs. This should be enough to conduct an analysis for the Atlanta area.

Project Schedule
I realized that I would need to balance the capstone with the rest of the work that needed to be done to fulfill graduation requirements. A project schedule needed to be created to make sure all the phases were completed on time and other competing works were addressed appropriately.

import pandas as pd
pd.read_excel('Project Schedule.xlsx')

	Project Phase	Start Date	End Date	Summary Notes
0	Step01-Scraping from Indeed Pages to build ini…	2018-06-30	2018-06-30	Completed
1	Step02-Get Full JD	2018-07-01	2018-07-01	Completed
2	Step03-Countvectorizer_Regional-Technical_Requ…	2018-07-02	2018-07-02	Completed
3	Step04-StopWords_Regional_Cultural_Requirements	2018-07-03	2018-07-04	Completed
4	Step05-Stop Words-Regional_TechnicalRequirements	2018-07-04	2018-07-04	Completed
5	Step06-Cleaning-BaselineResume	2018-07-05	2018-07-05	Completed
6	Step07-Countvectorize_Resume	2018-07-06	2018-07-06	Completed
7	Step08-Cleaning-Updated_Resume	2018-07-07	2018-07-07	Completed
8	Step09-Countvectorize_Updated_Resume	2018-07-08	2018-07-08	Completed
9	Step10-Capstone-Getting Individual JDs	2018-07-09	2018-07-09	Completed
10	Step11-Cleaning-Individual_JD	2018-07-10	2018-07-12	Completed
11	Step12-Using Cosine Similarity to compare docu…	2018-07-10	2018-07-11	Completed
12	Step13-UsingGMaps	2018-07-11	2018-07-12	Completed
13	Step14-Report Write-up + Technical Analysis	2018-07-12	2018-07-13	Completed
14	Step15-Create Presentations	2018-07-13	2018-07-13	Completed

Code Snippets

Please note – I am presenting here code snippets from my overall project.
This is not the entirety of the code but examples of the work involved in the overall project.

Code Snippet: Initial Scraping Work using BeautifulSoup

max_results_per_city = 1000
results = []

for city in indeed_cities:
    for start in range(0, max_results_per_city, 100):
        url = "https://www.indeed.com/jobs?as_and=data+scientist+python&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=\
               &salary=&radius=25&l= + Atlanta + &fromage=any&limit=100&start=" + str(start) + "&sort=&psf=advsrch"
        html = requests.get(url)
        soup = BeautifulSoup(html.text, 'html.parser')
        for result in soup.find_all('div', {'class':' row result'}):
            results.append(result)
        sleep(1)

Code Snippet: Getting Full Job Descriptions

for r in slugs[:30]: new_url = 'http://www.indeed.com' + r
print('Requesting content from ' + new_url)
# you can add + '...' res = requests.get(new_url)
# print('Converting content from the res object.')
soup = BeautifulSoup(res.content, 'lxml')
extended_descriptions.append(soup)
sleep(3)
print('Appending soup...')

Code Snippet: Customizing StopWords

custom_stopwords = ['000', '01', '06', '08','10254', '12', '15',
'19', '2018', '22', '25', '28', '45', '500',
'cox', 'norfolk', 'apply', 'com', 'www', 'applications', 'application',
'applicants', 'southern', 'https', 'ia', 'var', 'indeedapply', 'env',
'atlanta', 'opportunity', 'iip', 'gender', 'location', 'new', 'employer',
'midtown', 'manheim', 'ml', 'including', 'llc', 'truck', 'automotive', 'nationality',
'nation', 'iot', 'kelley', 'hopea', 'date', 'incadea', 'honeywell', '100', '1372', '27', '300',
'30308', '30309', '59', '60', '666', '715', '800', '850', '89', '90', 'ga', 'geo', 'genetic',
'mercedes', 'marta', 'lunch', 'familimarity', 'fitting', 'floors', 'furthermore', 'living',
'make', 'members', 'family', 'req149533', 'requisition', 'freshman', 'sophomore', 'et', 'etc',
'etl', 'job', 'invest', 'member', 'eye', 'relocation', 'Unnamed', 'wework', 'yarn', 'yrs',
'test', 'intent', 'intermediete', 'key', 'inflection', 'informatica', 'way', 'recent', 'fewer',
'iteratively', 'joining', 'd3', 'bi', 'bs', 'alteryx', 'benz', 'ai', 'arcgis', 'talend', 'al',
'bus', 'cassandra', 'growing', 'growth', 'guidance', 'bigdata', 'bigquery', 'cotiviti',
'councils', 'like', 'located', 'devops', 'usa', 'winning', 'ex', 'awesome', 'address',
'assurance', 'pig', 'needed', 'id', 'integral', 'impeccable', 'arts', 'auditing', 'community',
'commuter', 'jobs', 'help', 'js', 'human', 'variety', 'stipend', 'rewards', 'sharting',
'daimler', 'degreepreferred', 'advisors', 'characteristics', 'draw', 'donor', 'creek', 'dental',
'medical', 'survival', '0064382_p0223181', '10', '1553', '2016', '24', '30327', '401',
'experiencepredictive', 'emory', 'caffe2', 'caffe', 'workingmother',]

Code Snippet: Using Cosine_Similarity

Credit: https://blogs.oracle.com/meena/finding-similarity-between-text-documents

def get_similarity(dict1, dict2):
    all_words_list = []
    for key in dict1:
        all_words_list.append(key)
    for key in dict2:
        all_words_list.append(key)
    all_words_list_size = len(all_words_list)

    v1 = np.zeros(all_words_list_size, dtype = np.int)
    v2 = np.zeros(all_words_list_size, dtype = np.int)
    i = 0
    for (key) in all_words_list:
        v1[i] = dict1.get(key, 0)
        v2[i] = dict2.get(key, 0)
        i = i + 1
    return cos_sim(v1, v2) * 100;

if __name__ == '__main__':
    dict1 = process('Bullets')
    dict2 = process('Resumes/Updated_Resume_Vectorized.csv')
    dict3 = process('Resumes/Resume_Vectorized.csv')
    dict4 = process('Paragraphs')
    dict5 = process('Job_Descriptions_csv/TruckIT-Desc.csv')
    dict6 = process('Job_Descriptions_csv/Cotiviti.csv')

print("Similarity between the two documents you are comparing is",
get_similarity(dict1, dict2))
print("Similarity between the two documents you are comparing is",
get_similarity(dict1, dict3))
print("Similarity between the two documents you are comparing is",
get_similarity(dict2, dict3))
print("Similarity between the two documents you are comparing is",
get_similarity(dict2, dict4))
print("Similarity between the two documents you are comparing is",
get_similarity(dict3, dict4))

Code Snippet: Summary Table as a DataFrame

import pandas as pd
pd.read_excel('Job Table.xlsx')

Documents being analyzed for similarity	Resume 1 – Updated	Resume 2 – Previous	Address	Latitude	Longitude
0	Aggregated Regional Job Descriptions – Bullets	0.806	0.807	NaN	NaN	NaN
1	Similarity between 2 Resumes	0.998	0.998	NaN	NaN	NaN
2	Aggregated Regional Job Descriptions – Culture…	0.825	0.825	NaN	NaN	NaN
3	Truck IT	0.805	0.807	1380 West Paces Ferry Rd NW, Atlanta, GA 30327	33.847387	-84.431586
4	Cotiviti	0.656	0.661	One Glenlake Parkway #1400, Atlanta, GA 30328	33.934144	-84.359620
5	Home Depot	0.686	0.688	2455 Paces Ferry Rd SE, Atlanta, GA 30339	33.865490	-84.481408
6	Honeywell	0.814	0.816	715 Peachtree St NE, Atlanta, GA 30308	33.773495	-84.387307
7	AnswerRocket	0.821	0.823	50 Glenlake Pkwy NE #200, Sandy Springs, GA 30328	33.940736	-84.361513
8	Cox Automotive	0.817	0.821	3003 Summit Blvd NE #200, Atlanta, GA 30319	33.913781	-84.342588
9	Travelport	0.832	0.841	760 Doug Davis Dr # B, Atlanta, GA 30354	33.657159	-84.415204
10	Coca Cola	0.928	0.928	1 Coca-Cola Plaza, Atlanta, GA 30313	33.754155	-84.381390
11	SalesLoft	0.892	0.896	1180 W Peachtree St NW #600, Atlanta, GA 30309	33.786986	-84.388118
12	SoftVision	0.645	0.656	1349 W Peachtree St NE Suite 1375, Atlanta, GA…	33.791482	-84.386529
13	Aarons	0.843	0.846	500 Chastain Center Blvd, Kennesaw, GA 30144	34.033083	-84.564686
14	Inspire Brands	0.819	0.823	1155 Perimeter Center W Ste 1200, Atlanta, GA …	33.930034	-84.350866
15	Hiscox	0.867	0.871	5 Concourse Pkwy Suite 2150, Atlanta, GA 30328	33.917080	-84.354513

Code Snippet: Using GMaps

</pre>
data_science = [ {'name': 'TruckIT', 'location': (33.847387, -84.431586), 'Match Percentage': 0.807, 'Days Posted': 5}, {'name': 'Cotiviti', 'location':(33.934144, -84.35962), 'Match Percentage': 0.665, 'Days Posted': 5}, {'name': 'Home Depot', 'location':(33.86549, -84.481408), 'Match Percentage': 0.688, 'Days Posted': 5}, {'name': 'AnswerRocket', 'location':(33.940736, -84.361513), 'Match Percentage': 0.823, 'Days Posted': 2}, {'name': 'Cox Automotive', 'location':(33.913781, -84.342588), 'Match Percentage': 0.821, 'Days Posted': 5}, {'name': 'TravelPort', 'location':(33.657159, -84.415204), 'Match Percentage': 0.841, 'Days Posted': 5}, {'name': 'Coca Cola', 'location':(33.754155, -84.38139), 'Match Percentage': 0.928, 'Days Posted': 5}, {'name': 'SalesLoft', 'location':(33.786986, -84.388118), 'Match Percentage': 0.896, 'Days Posted': 4}, {'name': 'SoftVision', 'location':(33.791482, -84.386529), 'Match Percentage': 0.645, 'Days Posted': 5}, {'name': 'Aarons', 'location':(34.033083, -84.564686), 'Match Percentage': 0.846, 'Days Posted': 5}, {'name': 'HOME', 'location':(33.968690, -84.785872),'Match Percentage': 0, 'Days Posted': 0}, {'name': 'Inspire Brands', 'location':(33.930034, -84.350866),'Match Percentage': 0.823, 'Days Posted': 5}, {'name': 'Hiscox Insurance', 'location':(33.91708, -84.354513),'Match Percentage': 0.871, 'Days Posted': 6}, ] job_locations = [job['location'] for job in data_science] info_box_template = """<dl><dt>Name</dt><dd>{name}</dd><dt>Match %</dt><dd>{Match Percentage}</dd><dt>Days Posted</dt><dd>{Days Posted}</dd></dl>""" job_info = [info_box_template.format(**plant) for plant in data_science] marker_layer = gmaps.marker_layer(job_locations, info_box_content=job_info) #fig = gmaps.figure(center=(33.753746, -84.386330), zoom_level=11) fig = gmaps.figure(center=(33.753746, -84.386330), zoom_level=11) fig.add_layer(marker_layer) fig.add_layer(gmaps.traffic_layer()) fig
<pre>

Screen Shot 2018-07-24 at 2.26.53 PM

For a more thorough breakdown of this project please visit my Capstone Github page:
https://github.com/gitliong/Capstone_DSI

Process Mavericks LLC

Project Management and Business Analysis Consulting

Data Science Capstone Project

Understanding the Data Science job market in Atlanta using Cosine Similarity and GMaps

Code Snippets