45 KiB
45 KiB
#Skrypt do ściagnięcia zbiory danych
!pip install --user kaggle #API Kaggle, do pobrania zbioru
!pip install --user pandas
!pip install --user numpy
Collecting kaggle Using cached kaggle-1.5.12.tar.gz (58 kB) Requirement already satisfied: certifi in /usr/lib/python3/dist-packages (from kaggle) (2019.11.28) Requirement already satisfied: python-dateutil in /usr/lib/python3/dist-packages (from kaggle) (2.7.3) Collecting python-slugify Using cached python_slugify-6.1.1-py2.py3-none-any.whl (9.1 kB) Requirement already satisfied: requests in /usr/lib/python3/dist-packages (from kaggle) (2.22.0) Requirement already satisfied: six>=1.10 in /usr/lib/python3/dist-packages (from kaggle) (1.14.0) Collecting tqdm Downloading tqdm-4.63.0-py2.py3-none-any.whl (76 kB) [K |████████████████████████████████| 76 kB 1.7 MB/s eta 0:00:011 [?25hRequirement already satisfied: urllib3 in /usr/lib/python3/dist-packages (from kaggle) (1.25.8) Collecting text-unidecode>=1.3 Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB) Building wheels for collected packages: kaggle Building wheel for kaggle (setup.py) ... [?25ldone [?25h Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73052 sha256=02172fba1c0ec42884ad8bcbb3c3b99749f529299444b00aaa946e78b9dfcb1f Stored in directory: /home/students/s444463/.cache/pip/wheels/29/da/11/144cc25aebdaeb4931b231e25fd34b394e6a5725cbb2f50106 Successfully built kaggle Installing collected packages: text-unidecode, tqdm, python-slugify, kaggle [33m WARNING: The script tqdm is installed in '/home/students/s444463/.local/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.[0m [33m WARNING: The script kaggle is installed in '/home/students/s444463/.local/bin' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.[0m Successfully installed kaggle-1.5.12 python-slugify-6.1.1 text-unidecode-1.3 tqdm-4.63.0 [33mWARNING: You are using pip version 21.2.4; however, version 22.0.4 is available. You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m Requirement already satisfied: pandas in /usr/lib/python3/dist-packages (0.25.3) [33mWARNING: You are using pip version 21.2.4; however, version 22.0.4 is available. You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m Requirement already satisfied: numpy in /usr/lib/python3/dist-packages (1.17.4) [33mWARNING: You are using pip version 21.2.4; however, version 22.0.4 is available. You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
# Żeby poniższa komenda zadziałała, musisz posiadać plik ~/.kaggle/kaggle.json, zawierający Kaggle API token.
# Instrukcje: https://www.kaggle.com/docs/api
!kaggle datasets download -d shivamb/real-or-fake-fake-jobposting-prediction
/bin/bash: kaggle: command not found
!unzip -o real-or-fake-fake-jobposting-prediction.zip
unzip: cannot find or open real-or-fake-fake-jobposting-prediction.zip, real-or-fake-fake-jobposting-prediction.zip.zip or real-or-fake-fake-jobposting-prediction.zip.ZIP.
!pip install --user seaborn
Collecting seaborn Downloading seaborn-0.11.2-py3-none-any.whl (292 kB) [K |████████████████████████████████| 292 kB 2.8 MB/s eta 0:00:01 [?25hRequirement already satisfied: scipy>=1.0 in /usr/lib/python3/dist-packages (from seaborn) (1.3.3) Requirement already satisfied: pandas>=0.23 in /usr/lib/python3/dist-packages (from seaborn) (0.25.3) Requirement already satisfied: numpy>=1.15 in /usr/lib/python3/dist-packages (from seaborn) (1.17.4) Requirement already satisfied: matplotlib>=2.2 in /home/students/s444463/.local/lib/python3.8/site-packages (from seaborn) (3.4.3) Requirement already satisfied: kiwisolver>=1.0.1 in /home/students/s444463/.local/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (1.3.2) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=2.2->seaborn) (8.3.2) Requirement already satisfied: cycler>=0.10 in /home/students/s444463/.local/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (0.10.0) Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=2.2->seaborn) (2.4.7) Requirement already satisfied: python-dateutil>=2.7 in /usr/lib/python3/dist-packages (from matplotlib>=2.2->seaborn) (2.7.3) Requirement already satisfied: six in /usr/lib/python3/dist-packages (from cycler>=0.10->matplotlib>=2.2->seaborn) (1.14.0) Installing collected packages: seaborn Successfully installed seaborn-0.11.2 [33mWARNING: You are using pip version 21.2.4; however, version 22.0.4 is available. You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
import pandas as pd
data=pd.read_csv('fake_job_postings.csv')
data
job_id | title | location | department | salary_range | company_profile | description | requirements | benefits | telecommuting | has_company_logo | has_questions | employment_type | required_experience | required_education | industry | function | fraudulent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Marketing Intern | US, NY, New York | Marketing | NaN | We're Food52, and we've created a groundbreaki... | Food52, a fast-growing, James Beard Award-winn... | Experience with content management systems a m... | NaN | 0 | 1 | 0 | Other | Internship | NaN | NaN | Marketing | 0 |
1 | 2 | Customer Service - Cloud Video Production | NZ, , Auckland | Success | NaN | 90 Seconds, the worlds Cloud Video Production ... | Organised - Focused - Vibrant - Awesome!Do you... | What we expect from you:Your key responsibilit... | What you will get from usThrough being part of... | 0 | 1 | 0 | Full-time | Not Applicable | NaN | Marketing and Advertising | Customer Service | 0 |
2 | 3 | Commissioning Machinery Assistant (CMA) | US, IA, Wever | NaN | NaN | Valor Services provides Workforce Solutions th... | Our client, located in Houston, is actively se... | Implement pre-commissioning and commissioning ... | NaN | 0 | 1 | 0 | NaN | NaN | NaN | NaN | NaN | 0 |
3 | 4 | Account Executive - Washington DC | US, DC, Washington | Sales | NaN | Our passion for improving quality of life thro... | THE COMPANY: ESRI – Environmental Systems Rese... | EDUCATION: Bachelor’s or Master’s in GIS, busi... | Our culture is anything but corporate—we have ... | 0 | 1 | 0 | Full-time | Mid-Senior level | Bachelor's Degree | Computer Software | Sales | 0 |
4 | 5 | Bill Review Manager | US, FL, Fort Worth | NaN | NaN | SpotSource Solutions LLC is a Global Human Cap... | JOB TITLE: Itemization Review ManagerLOCATION:... | QUALIFICATIONS:RN license in the State of Texa... | Full Benefits Offered | 0 | 1 | 1 | Full-time | Mid-Senior level | Bachelor's Degree | Hospital & Health Care | Health Care Provider | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
17875 | 17876 | Account Director - Distribution | CA, ON, Toronto | Sales | NaN | Vend is looking for some awesome new talent to... | Just in case this is the first time you’ve vis... | To ace this role you:Will eat comprehensive St... | What can you expect from us?We have an open cu... | 0 | 1 | 1 | Full-time | Mid-Senior level | NaN | Computer Software | Sales | 0 |
17876 | 17877 | Payroll Accountant | US, PA, Philadelphia | Accounting | NaN | WebLinc is the e-commerce platform and service... | The Payroll Accountant will focus primarily on... | - B.A. or B.S. in Accounting- Desire to have f... | Health & WellnessMedical planPrescription ... | 0 | 1 | 1 | Full-time | Mid-Senior level | Bachelor's Degree | Internet | Accounting/Auditing | 0 |
17877 | 17878 | Project Cost Control Staff Engineer - Cost Con... | US, TX, Houston | NaN | NaN | We Provide Full Time Permanent Positions for m... | Experienced Project Cost Control Staff Enginee... | At least 12 years professional experience.Abil... | NaN | 0 | 0 | 0 | Full-time | NaN | NaN | NaN | NaN | 0 |
17878 | 17879 | Graphic Designer | NG, LA, Lagos | NaN | NaN | NaN | Nemsia Studios is looking for an experienced v... | 1. Must be fluent in the latest versions of Co... | Competitive salary (compensation will be based... | 0 | 0 | 1 | Contract | Not Applicable | Professional | Graphic Design | Design | 0 |
17879 | 17880 | Web Application Developers | NZ, N, Wellington | Engineering | NaN | Vend is looking for some awesome new talent to... | Who are we?Vend is an award winning web based ... | We want to hear from you if:You have an in-dep... | NaN | 0 | 1 | 1 | Full-time | Mid-Senior level | NaN | Computer Software | Engineering | 0 |
17880 rows × 18 columns
#Wielkosc zbioru
!wc -l fake_job_postings.csv
17880 fake_job_postings.csv
data = data.replace(np.nan, '', regex=True)
data
[0;31m---------------------------------------------------------------------------[0m [0;31mNameError[0m Traceback (most recent call last) [0;32m/tmp/ipykernel_8616/866736318.py[0m in [0;36m<module>[0;34m[0m [0;32m----> 1[0;31m [0mdata[0m [0;34m=[0m [0mdata[0m[0;34m.[0m[0mreplace[0m[0;34m([0m[0mnp[0m[0;34m.[0m[0mnan[0m[0;34m,[0m [0;34m''[0m[0;34m,[0m [0mregex[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m [0m[1;32m 2[0m [0mdata[0m[0;34m[0m[0;34m[0m[0m [0;31mNameError[0m: name 'np' is not defined
data["department"].value_counts()
import numpy as np
data = data.replace(np.nan, '', regex=True)
data.describe(include='all')
job_id | title | location | department | salary_range | company_profile | description | requirements | benefits | telecommuting | has_company_logo | has_questions | employment_type | required_experience | required_education | industry | function | fraudulent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 17880.000000 | 17880 | 17880 | 17880 | 17880 | 17880 | 17880 | 17880 | 17880 | 17880.000000 | 17880.000000 | 17880.000000 | 17880 | 17880 | 17880 | 17880 | 17880 | 17880.000000 |
unique | NaN | 11231 | 3106 | 1338 | 875 | 1710 | 14802 | 11969 | 6206 | NaN | NaN | NaN | 6 | 8 | 14 | 132 | 38 | NaN |
top | NaN | English Teacher Abroad | GB, LND, London | Play with kids, get paid for it Love travel? J... | NaN | NaN | NaN | Full-time | NaN | |||||||||
freq | NaN | 311 | 718 | 11547 | 15012 | 3308 | 379 | 2695 | 7210 | NaN | NaN | NaN | 11620 | 7050 | 8105 | 4903 | 6455 | NaN |
mean | 8940.500000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.042897 | 0.795302 | 0.491723 | NaN | NaN | NaN | NaN | NaN | 0.048434 |
std | 5161.655742 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.202631 | 0.403492 | 0.499945 | NaN | NaN | NaN | NaN | NaN | 0.214688 |
min | 1.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | 0.000000 |
25% | 4470.750000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 1.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | 0.000000 |
50% | 8940.500000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 1.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | 0.000000 |
75% | 13410.250000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 1.000000 | 1.000000 | NaN | NaN | NaN | NaN | NaN | 0.000000 |
max | 17880.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 1.000000 | 1.000000 | NaN | NaN | NaN | NaN | NaN | 1.000000 |
data.median()
job_id 8940.5 telecommuting 0.0 has_company_logo 1.0 has_questions 0.0 fraudulent 0.0 dtype: float64
pip install -U scikit-learn
from sklearn.model_selection import train_test_split
import sklearn
data_train, data_test = train_test_split(data, test_size=5000, random_state=1)
data_dev, data_test = train_test_split(data_test, test_size=2500, random_state=1)
data_train["title"].value_counts()
English Teacher Abroad 230 Customer Service Associate 106 Graduates: English Teacher Abroad (Conversational) 96 English Teacher Abroad 71 Software Engineer 67 ... Physician - MD, CMO 1 Financial News Editor 1 Senior Client Services Engineer 1 Online Marketing Manager Italy 1 Infrastructure Project Manager 1 Name: title, Length: 8461, dtype: int64
print(len(data_train))
print(len(data_dev))
print(len(data_test))
12880 2500 2500
data_train["title"].value_counts()
English Teacher Abroad 235 Customer Service Associate 110 Graduates: English Teacher Abroad (Conversational) 104 English Teacher Abroad 72 Software Engineer 68 ... Manager-Plastics Mfg Engineering - Full Time Permanent Job 1 Ruby on Rails Developer/Programmer 1 Appliance Technician 1 Need Oracle Fusion HCM Resource 1 Recruitment specialist 1 Name: title, Length: 8761, dtype: int64