Data Science Capstone Project

Understanding the Data Science job market in Atlanta using Cosine Similarity and GMaps

 

Executive Summary
There exist websites that provide you a match percentage of ones resume against a job description. There exist websites that provide a geographic representation of jobs in a given area. There exist services that provide real-time traffic visual representation in a given area. However – there is not a service that provides all these three data points simultaneously. I believe a service that provides this could help job seekers be more informed in the decisions they are making about their careers.

Example: A job seeker that moves into a new city can better compare different opportunities based on commute distance and typical traffic patterns.

Problem Statement
Can a platform be created where job seekers are provided a better visual and analytical overview of jobs they are considering?

Anticipated Results:
A model that can predict the cosine_similarity between two or more documents. I then use these results to provide a visual overview of the best matching jobs to one’s profile.


Scope Statement
As with any project – a precise and succinct scope statement is needed to ensure that all the objectives are met.
1) Provide a visual overview of the data science jobs available in the Atlanta area.
2) Provide a percentage match between a resume and a job description.
3) Provide a visual representation of typical traffic patterns for the opportunities.

Summary
To accomplish the scope above the following procedures were utilized.
1. Examine different models to determine which model would provide the best results for document comparisons.
2. Examine different platforms to determine which would provide the best map layout overview.
3. Examine which platforms can accommodate a traffic overview

Anticipated Risks
1. Have not used Gmaps previously. I do not know at this time how complicated this could be.
2. Contingency Plan: Use Tableau Map function
3. Converting scraped job descriptions into individual CSV files.
4. Contingency Plan: May need to just do each on a manual basis. I could probably manage at least 50 individual jobs. This should be enough to conduct an analysis for the Atlanta area.

Project Schedule
I realized that I would need to balance the capstone with the rest of the work that needed to be done to fulfill graduation requirements. A project schedule needed to be created to make sure all the phases were completed on time and other competing works were addressed appropriately.

import pandas as pd
pd.read_excel('Project Schedule.xlsx')
Project Phase Start Date End Date Summary Notes
0 Step01-Scraping from Indeed Pages to build ini… 2018-06-30 2018-06-30 Completed
1 Step02-Get Full JD 2018-07-01 2018-07-01 Completed
2 Step03-Countvectorizer_Regional-Technical_Requ… 2018-07-02 2018-07-02 Completed
3 Step04-StopWords_Regional_Cultural_Requirements 2018-07-03 2018-07-04 Completed
4 Step05-Stop Words-Regional_TechnicalRequirements 2018-07-04 2018-07-04 Completed
5 Step06-Cleaning-BaselineResume 2018-07-05 2018-07-05 Completed
6 Step07-Countvectorize_Resume 2018-07-06 2018-07-06 Completed
7 Step08-Cleaning-Updated_Resume 2018-07-07 2018-07-07 Completed
8 Step09-Countvectorize_Updated_Resume 2018-07-08 2018-07-08 Completed
9 Step10-Capstone-Getting Individual JDs 2018-07-09 2018-07-09 Completed
10 Step11-Cleaning-Individual_JD 2018-07-10 2018-07-12 Completed
11 Step12-Using Cosine Similarity to compare docu… 2018-07-10 2018-07-11 Completed
12 Step13-UsingGMaps 2018-07-11 2018-07-12 Completed
13 Step14-Report Write-up + Technical Analysis 2018-07-12 2018-07-13 Completed
14 Step15-Create Presentations 2018-07-13 2018-07-13 Completed

Code Snippets

Please note – I am presenting here code snippets from my overall project.
This is not the entirety of the code but examples of the work involved in the overall project.

Code Snippet: Initial Scraping Work using BeautifulSoup

max_results_per_city = 1000
results = []

for city in indeed_cities:
    for start in range(0, max_results_per_city, 100):
        url = "https://www.indeed.com/jobs?as_and=data+scientist+python&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=\
               &salary=&radius=25&l= + Atlanta + &fromage=any&limit=100&start=" + str(start) + "&sort=&psf=advsrch"
        html = requests.get(url)
        soup = BeautifulSoup(html.text, 'html.parser')
        for result in soup.find_all('div', {'class':' row result'}):
            results.append(result)
        sleep(1)

Code Snippet: Getting Full Job Descriptions

for r in slugs[:30]: new_url = 'http://www.indeed.com' + r
print('Requesting content from ' + new_url)
# you can add + '...' res = requests.get(new_url)
# print('Converting content from the res object.')
soup = BeautifulSoup(res.content, 'lxml')
extended_descriptions.append(soup)
sleep(3)
print('Appending soup...') 

Code Snippet: Customizing StopWords

custom_stopwords = ['000', '01', '06', '08','10254', '12', '15',
'19', '2018', '22', '25', '28', '45', '500',
'cox', 'norfolk', 'apply', 'com', 'www', 'applications', 'application',
'applicants', 'southern', 'https', 'ia', 'var', 'indeedapply', 'env',
'atlanta', 'opportunity', 'iip', 'gender', 'location', 'new', 'employer',
'midtown', 'manheim', 'ml', 'including', 'llc', 'truck', 'automotive', 'nationality',
'nation', 'iot', 'kelley', 'hopea', 'date', 'incadea', 'honeywell', '100', '1372', '27', '300',
'30308', '30309', '59', '60', '666', '715', '800', '850', '89', '90', 'ga', 'geo', 'genetic',
'mercedes', 'marta', 'lunch', 'familimarity', 'fitting', 'floors', 'furthermore', 'living',
'make', 'members', 'family', 'req149533', 'requisition', 'freshman', 'sophomore', 'et', 'etc',
'etl', 'job', 'invest', 'member', 'eye', 'relocation', 'Unnamed', 'wework', 'yarn', 'yrs',
'test', 'intent', 'intermediete', 'key', 'inflection', 'informatica', 'way', 'recent', 'fewer',
'iteratively', 'joining', 'd3', 'bi', 'bs', 'alteryx', 'benz', 'ai', 'arcgis', 'talend', 'al',
'bus', 'cassandra', 'growing', 'growth', 'guidance', 'bigdata', 'bigquery', 'cotiviti',
'councils', 'like', 'located', 'devops', 'usa', 'winning', 'ex', 'awesome', 'address',
'assurance', 'pig', 'needed', 'id', 'integral', 'impeccable', 'arts', 'auditing', 'community',
'commuter', 'jobs', 'help', 'js', 'human', 'variety', 'stipend', 'rewards', 'sharting',
'daimler', 'degreepreferred', 'advisors', 'characteristics', 'draw', 'donor', 'creek', 'dental',
'medical', 'survival', '0064382_p0223181', '10', '1553', '2016', '24', '30327', '401',
'experiencepredictive', 'emory', 'caffe2', 'caffe', 'workingmother',]

Code Snippet: Using Cosine_Similarity

Credit: https://blogs.oracle.com/meena/finding-similarity-between-text-documents

def get_similarity(dict1, dict2):
    all_words_list = []
    for key in dict1:
        all_words_list.append(key)
    for key in dict2:
        all_words_list.append(key)
    all_words_list_size = len(all_words_list)

    v1 = np.zeros(all_words_list_size, dtype = np.int)
    v2 = np.zeros(all_words_list_size, dtype = np.int)
    i = 0
    for (key) in all_words_list:
        v1[i] = dict1.get(key, 0)
        v2[i] = dict2.get(key, 0)
        i = i + 1
    return cos_sim(v1, v2) * 100;

if __name__ == '__main__':
    dict1 = process('Bullets')
    dict2 = process('Resumes/Updated_Resume_Vectorized.csv')
    dict3 = process('Resumes/Resume_Vectorized.csv')
    dict4 = process('Paragraphs')
    dict5 = process('Job_Descriptions_csv/TruckIT-Desc.csv')
    dict6 = process('Job_Descriptions_csv/Cotiviti.csv')

print("Similarity between the two documents you are comparing is",
get_similarity(dict1, dict2))
print("Similarity between the two documents you are comparing is",
get_similarity(dict1, dict3))
print("Similarity between the two documents you are comparing is",
get_similarity(dict2, dict3))
print("Similarity between the two documents you are comparing is",
get_similarity(dict2, dict4))
print("Similarity between the two documents you are comparing is",
get_similarity(dict3, dict4))

 

Code Snippet: Summary Table as a DataFrame

import pandas as pd
pd.read_excel('Job Table.xlsx')
Documents being analyzed for similarity Resume 1 – Updated Resume 2 – Previous Address Latitude Longitude
0 Aggregated Regional Job Descriptions – Bullets 0.806 0.807 NaN NaN NaN
1 Similarity between 2 Resumes 0.998 0.998 NaN NaN NaN
2 Aggregated Regional Job Descriptions – Culture… 0.825 0.825 NaN NaN NaN
3 Truck IT 0.805 0.807 1380 West Paces Ferry Rd NW, Atlanta, GA 30327 33.847387 -84.431586
4 Cotiviti 0.656 0.661 One Glenlake Parkway #1400, Atlanta, GA 30328 33.934144 -84.359620
5 Home Depot 0.686 0.688 2455 Paces Ferry Rd SE, Atlanta, GA 30339 33.865490 -84.481408
6 Honeywell 0.814 0.816 715 Peachtree St NE, Atlanta, GA 30308 33.773495 -84.387307
7 AnswerRocket 0.821 0.823 50 Glenlake Pkwy NE #200, Sandy Springs, GA 30328 33.940736 -84.361513
8 Cox Automotive 0.817 0.821 3003 Summit Blvd NE #200, Atlanta, GA 30319 33.913781 -84.342588
9 Travelport 0.832 0.841 760 Doug Davis Dr # B, Atlanta, GA 30354 33.657159 -84.415204
10 Coca Cola 0.928 0.928 1 Coca-Cola Plaza, Atlanta, GA 30313 33.754155 -84.381390
11 SalesLoft 0.892 0.896 1180 W Peachtree St NW #600, Atlanta, GA 30309 33.786986 -84.388118
12 SoftVision 0.645 0.656 1349 W Peachtree St NE Suite 1375, Atlanta, GA… 33.791482 -84.386529
13 Aarons 0.843 0.846 500 Chastain Center Blvd, Kennesaw, GA 30144 34.033083 -84.564686
14 Inspire Brands 0.819 0.823 1155 Perimeter Center W Ste 1200, Atlanta, GA … 33.930034 -84.350866
15 Hiscox 0.867 0.871 5 Concourse Pkwy Suite 2150, Atlanta, GA 30328 33.917080 -84.354513
Code Snippet: Using GMaps
</pre>
data_science = [ {'name': 'TruckIT', 'location': (33.847387, -84.431586), 'Match Percentage': 0.807, 'Days Posted': 5}, {'name': 'Cotiviti', 'location':(33.934144, -84.35962), 'Match Percentage': 0.665, 'Days Posted': 5}, {'name': 'Home Depot', 'location':(33.86549, -84.481408), 'Match Percentage': 0.688, 'Days Posted': 5}, {'name': 'AnswerRocket', 'location':(33.940736, -84.361513), 'Match Percentage': 0.823, 'Days Posted': 2}, {'name': 'Cox Automotive', 'location':(33.913781, -84.342588), 'Match Percentage': 0.821, 'Days Posted': 5}, {'name': 'TravelPort', 'location':(33.657159, -84.415204), 'Match Percentage': 0.841, 'Days Posted': 5}, {'name': 'Coca Cola', 'location':(33.754155, -84.38139), 'Match Percentage': 0.928, 'Days Posted': 5}, {'name': 'SalesLoft', 'location':(33.786986, -84.388118), 'Match Percentage': 0.896, 'Days Posted': 4}, {'name': 'SoftVision', 'location':(33.791482, -84.386529), 'Match Percentage': 0.645, 'Days Posted': 5}, {'name': 'Aarons', 'location':(34.033083, -84.564686), 'Match Percentage': 0.846, 'Days Posted': 5}, {'name': 'HOME', 'location':(33.968690, -84.785872),'Match Percentage': 0, 'Days Posted': 0}, {'name': 'Inspire Brands', 'location':(33.930034, -84.350866),'Match Percentage': 0.823, 'Days Posted': 5}, {'name': 'Hiscox Insurance', 'location':(33.91708, -84.354513),'Match Percentage': 0.871, 'Days Posted': 6}, ] job_locations = [job['location'] for job in data_science] info_box_template = """<dl><dt>Name</dt><dd>{name}</dd><dt>Match %</dt><dd>{Match Percentage}</dd><dt>Days Posted</dt><dd>{Days Posted}</dd></dl>""" job_info = [info_box_template.format(**plant) for plant in data_science] marker_layer = gmaps.marker_layer(job_locations, info_box_content=job_info) #fig = gmaps.figure(center=(33.753746, -84.386330), zoom_level=11) fig = gmaps.figure(center=(33.753746, -84.386330), zoom_level=11) fig.add_layer(marker_layer) fig.add_layer(gmaps.traffic_layer()) fig
<pre>

Screen Shot 2018-07-24 at 2.26.53 PM

For a more thorough breakdown of this project please visit my Capstone Github page:
https://github.com/gitliong/Capstone_DSI