64 KiB
64 KiB
#Skrypt do ściagnięcia zbiory danych
!pip install --user kaggle #API Kaggle, do pobrania zbioru
!pip install --user pandas
!pip install --user numpy
Requirement already satisfied: kaggle in /home/mikolaj/.local/lib/python3.8/site-packages (1.5.12) Requirement already satisfied: urllib3 in /usr/lib/python3/dist-packages (from kaggle) (1.22) Requirement already satisfied: six>=1.10 in /usr/lib/python3/dist-packages (from kaggle) (1.11.0) Requirement already satisfied: requests in /home/mikolaj/.local/lib/python3.8/site-packages (from kaggle) (2.25.1) Requirement already satisfied: certifi in /usr/lib/python3/dist-packages (from kaggle) (2018.1.18) Requirement already satisfied: python-slugify in /home/mikolaj/.local/lib/python3.8/site-packages (from kaggle) (6.1.1) Requirement already satisfied: python-dateutil in /home/mikolaj/.local/lib/python3.8/site-packages (from kaggle) (2.8.1) Requirement already satisfied: tqdm in /home/mikolaj/.local/lib/python3.8/site-packages (from kaggle) (4.59.0) Requirement already satisfied: text-unidecode>=1.3 in /home/mikolaj/.local/lib/python3.8/site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: chardet<5,>=3.0.2 in /usr/lib/python3/dist-packages (from requests->kaggle) (3.0.4) Requirement already satisfied: idna<3,>=2.5 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->kaggle) (2.10) [33mWARNING: You are using pip version 21.3.1; however, version 22.0.4 is available. You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m Requirement already satisfied: pandas in /home/mikolaj/.local/lib/python3.8/site-packages (1.1.5) Requirement already satisfied: numpy>=1.15.4 in /home/mikolaj/.local/lib/python3.8/site-packages (from pandas) (1.19.5) Requirement already satisfied: pytz>=2017.2 in /usr/lib/python3/dist-packages (from pandas) (2018.3) Requirement already satisfied: python-dateutil>=2.7.3 in /home/mikolaj/.local/lib/python3.8/site-packages (from pandas) (2.8.1) Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.7.3->pandas) (1.11.0) [33mWARNING: You are using pip version 21.3.1; however, version 22.0.4 is available. You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
# Żeby poniższa komenda zadziałała, musisz posiadać plik ~/.kaggle/kaggle.json, zawierający Kaggle API token.
# Instrukcje: https://www.kaggle.com/docs/api
!kaggle datasets download -d shivamb/real-or-fake-fake-jobposting-prediction
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/mikolaj/.kaggle/kaggle.json' real-or-fake-fake-jobposting-prediction.zip: Skipping, found more recently modified local copy (use --force to force download)
!unzip -o real-or-fake-fake-jobposting-prediction.zip
Archive: real-or-fake-fake-jobposting-prediction.zip inflating: fake_job_postings.csv
!pip install --user seaborn
Requirement already satisfied: seaborn in /home/mikolaj/.local/lib/python3.8/site-packages (0.11.2) Requirement already satisfied: scipy>=1.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from seaborn) (1.7.2) Requirement already satisfied: matplotlib>=2.2 in /home/mikolaj/.local/lib/python3.8/site-packages (from seaborn) (3.4.2) Requirement already satisfied: numpy>=1.15 in /home/mikolaj/.local/lib/python3.8/site-packages (from seaborn) (1.19.5) Requirement already satisfied: pandas>=0.23 in /home/mikolaj/.local/lib/python3.8/site-packages (from seaborn) (1.1.5) Requirement already satisfied: python-dateutil>=2.7 in /home/mikolaj/.local/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (2.8.1) Requirement already satisfied: pillow>=6.2.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (8.2.0) Requirement already satisfied: cycler>=0.10 in /home/mikolaj/.local/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in /home/mikolaj/.local/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (1.3.1) Requirement already satisfied: pyparsing>=2.2.1 in /home/mikolaj/.local/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (2.4.7) Requirement already satisfied: pytz>=2017.2 in /usr/lib/python3/dist-packages (from pandas>=0.23->seaborn) (2018.3) Requirement already satisfied: six in /usr/lib/python3/dist-packages (from cycler>=0.10->matplotlib>=2.2->seaborn) (1.11.0) [33mWARNING: You are using pip version 21.3.1; however, version 22.0.4 is available. You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
import pandas as pd
data=pd.read_csv('fake_job_postings.csv')
data
job_id | title | location | department | salary_range | company_profile | description | requirements | benefits | telecommuting | has_company_logo | has_questions | employment_type | required_experience | required_education | industry | function | fraudulent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Marketing Intern | US, NY, New York | Marketing | NaN | We're Food52, and we've created a groundbreaki... | Food52, a fast-growing, James Beard Award-winn... | Experience with content management systems a m... | NaN | 0 | 1 | 0 | Other | Internship | NaN | NaN | Marketing | 0 |
1 | 2 | Customer Service - Cloud Video Production | NZ, , Auckland | Success | NaN | 90 Seconds, the worlds Cloud Video Production ... | Organised - Focused - Vibrant - Awesome!Do you... | What we expect from you:Your key responsibilit... | What you will get from usThrough being part of... | 0 | 1 | 0 | Full-time | Not Applicable | NaN | Marketing and Advertising | Customer Service | 0 |
2 | 3 | Commissioning Machinery Assistant (CMA) | US, IA, Wever | NaN | NaN | Valor Services provides Workforce Solutions th... | Our client, located in Houston, is actively se... | Implement pre-commissioning and commissioning ... | NaN | 0 | 1 | 0 | NaN | NaN | NaN | NaN | NaN | 0 |
3 | 4 | Account Executive - Washington DC | US, DC, Washington | Sales | NaN | Our passion for improving quality of life thro... | THE COMPANY: ESRI – Environmental Systems Rese... | EDUCATION: Bachelor’s or Master’s in GIS, busi... | Our culture is anything but corporate—we have ... | 0 | 1 | 0 | Full-time | Mid-Senior level | Bachelor's Degree | Computer Software | Sales | 0 |
4 | 5 | Bill Review Manager | US, FL, Fort Worth | NaN | NaN | SpotSource Solutions LLC is a Global Human Cap... | JOB TITLE: Itemization Review ManagerLOCATION:... | QUALIFICATIONS:RN license in the State of Texa... | Full Benefits Offered | 0 | 1 | 1 | Full-time | Mid-Senior level | Bachelor's Degree | Hospital & Health Care | Health Care Provider | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
17875 | 17876 | Account Director - Distribution | CA, ON, Toronto | Sales | NaN | Vend is looking for some awesome new talent to... | Just in case this is the first time you’ve vis... | To ace this role you:Will eat comprehensive St... | What can you expect from us?We have an open cu... | 0 | 1 | 1 | Full-time | Mid-Senior level | NaN | Computer Software | Sales | 0 |
17876 | 17877 | Payroll Accountant | US, PA, Philadelphia | Accounting | NaN | WebLinc is the e-commerce platform and service... | The Payroll Accountant will focus primarily on... | - B.A. or B.S. in Accounting- Desire to have f... | Health & WellnessMedical planPrescription ... | 0 | 1 | 1 | Full-time | Mid-Senior level | Bachelor's Degree | Internet | Accounting/Auditing | 0 |
17877 | 17878 | Project Cost Control Staff Engineer - Cost Con... | US, TX, Houston | NaN | NaN | We Provide Full Time Permanent Positions for m... | Experienced Project Cost Control Staff Enginee... | At least 12 years professional experience.Abil... | NaN | 0 | 0 | 0 | Full-time | NaN | NaN | NaN | NaN | 0 |
17878 | 17879 | Graphic Designer | NG, LA, Lagos | NaN | NaN | NaN | Nemsia Studios is looking for an experienced v... | 1. Must be fluent in the latest versions of Co... | Competitive salary (compensation will be based... | 0 | 0 | 1 | Contract | Not Applicable | Professional | Graphic Design | Design | 0 |
17879 | 17880 | Web Application Developers | NZ, N, Wellington | Engineering | NaN | Vend is looking for some awesome new talent to... | Who are we?Vend is an award winning web based ... | We want to hear from you if:You have an in-dep... | NaN | 0 | 1 | 1 | Full-time | Mid-Senior level | NaN | Computer Software | Engineering | 0 |
17880 rows × 18 columns
#Wielkosc zbioru
!wc -l fake_job_postings.csv
17880 fake_job_postings.csv
data = data.replace(np.nan, '', regex=True)
data
job_id | title | location | department | salary_range | company_profile | description | requirements | benefits | telecommuting | has_company_logo | has_questions | employment_type | required_experience | required_education | industry | function | fraudulent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Marketing Intern | US, NY, New York | Marketing | We're Food52, and we've created a groundbreaki... | Food52, a fast-growing, James Beard Award-winn... | Experience with content management systems a m... | 0 | 1 | 0 | Other | Internship | Marketing | 0 | ||||
1 | 2 | Customer Service - Cloud Video Production | NZ, , Auckland | Success | 90 Seconds, the worlds Cloud Video Production ... | Organised - Focused - Vibrant - Awesome!Do you... | What we expect from you:Your key responsibilit... | What you will get from usThrough being part of... | 0 | 1 | 0 | Full-time | Not Applicable | Marketing and Advertising | Customer Service | 0 | ||
2 | 3 | Commissioning Machinery Assistant (CMA) | US, IA, Wever | Valor Services provides Workforce Solutions th... | Our client, located in Houston, is actively se... | Implement pre-commissioning and commissioning ... | 0 | 1 | 0 | 0 | ||||||||
3 | 4 | Account Executive - Washington DC | US, DC, Washington | Sales | Our passion for improving quality of life thro... | THE COMPANY: ESRI – Environmental Systems Rese... | EDUCATION: Bachelor’s or Master’s in GIS, busi... | Our culture is anything but corporate—we have ... | 0 | 1 | 0 | Full-time | Mid-Senior level | Bachelor's Degree | Computer Software | Sales | 0 | |
4 | 5 | Bill Review Manager | US, FL, Fort Worth | SpotSource Solutions LLC is a Global Human Cap... | JOB TITLE: Itemization Review ManagerLOCATION:... | QUALIFICATIONS:RN license in the State of Texa... | Full Benefits Offered | 0 | 1 | 1 | Full-time | Mid-Senior level | Bachelor's Degree | Hospital & Health Care | Health Care Provider | 0 | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
17875 | 17876 | Account Director - Distribution | CA, ON, Toronto | Sales | Vend is looking for some awesome new talent to... | Just in case this is the first time you’ve vis... | To ace this role you:Will eat comprehensive St... | What can you expect from us?We have an open cu... | 0 | 1 | 1 | Full-time | Mid-Senior level | Computer Software | Sales | 0 | ||
17876 | 17877 | Payroll Accountant | US, PA, Philadelphia | Accounting | WebLinc is the e-commerce platform and service... | The Payroll Accountant will focus primarily on... | - B.A. or B.S. in Accounting- Desire to have f... | Health & WellnessMedical planPrescription ... | 0 | 1 | 1 | Full-time | Mid-Senior level | Bachelor's Degree | Internet | Accounting/Auditing | 0 | |
17877 | 17878 | Project Cost Control Staff Engineer - Cost Con... | US, TX, Houston | We Provide Full Time Permanent Positions for m... | Experienced Project Cost Control Staff Enginee... | At least 12 years professional experience.Abil... | 0 | 0 | 0 | Full-time | 0 | |||||||
17878 | 17879 | Graphic Designer | NG, LA, Lagos | Nemsia Studios is looking for an experienced v... | 1. Must be fluent in the latest versions of Co... | Competitive salary (compensation will be based... | 0 | 0 | 1 | Contract | Not Applicable | Professional | Graphic Design | Design | 0 | |||
17879 | 17880 | Web Application Developers | NZ, N, Wellington | Engineering | Vend is looking for some awesome new talent to... | Who are we?Vend is an award winning web based ... | We want to hear from you if:You have an in-dep... | 0 | 1 | 1 | Full-time | Mid-Senior level | Computer Software | Engineering | 0 |
17880 rows × 18 columns
data["department"].value_counts()
11547 Sales 551 Engineering 487 Marketing 401 Operations 270 ... Pricing 1 Mobility 1 Housekeeping 1 An Impact Engine Company 1 Trainee 1 Name: department, Length: 1338, dtype: int64
data = data.replace(np.nan, '', regex=True)
data.describe(include='all')
job_id | title | location | department | salary_range | company_profile | description | requirements | benefits | telecommuting | has_company_logo | has_questions | employment_type | required_experience | required_education | industry | function | fraudulent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 17880.000000 | 17880 | 17880 | 17880 | 17880 | 17880 | 17880 | 17880 | 17880 | 17880.000000 | 17880.000000 | 17880.000000 | 17880 | 17880 | 17880 | 17880 | 17880 | 17880.000000 |
unique | NaN | 11231 | 3106 | 1338 | 875 | 1710 | 14802 | 11969 | 6206 | NaN | NaN | NaN | 6 | 8 | 14 | 132 | 38 | NaN |
top | NaN | English Teacher Abroad | GB, LND, London | Play with kids, get paid for it Love travel? J... | NaN | NaN | NaN | Full-time | NaN | |||||||||
freq | NaN | 311 | 718 | 11547 | 15012 | 3308 | 379 | 2695 | 7210 | NaN | NaN | NaN | 11620 | 7050 | 8105 | 4903 | 6455 | NaN |
mean | 8940.500000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.042897 | 0.795302 | 0.491723 | NaN | NaN | NaN | NaN | NaN | 0.048434 |
std | 5161.655742 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.202631 | 0.403492 | 0.499945 | NaN | NaN | NaN | NaN | NaN | 0.214688 |
min | 1.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | 0.000000 |
25% | 4470.750000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 1.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | 0.000000 |
50% | 8940.500000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 1.000000 | 0.000000 | NaN | NaN | NaN | NaN | NaN | 0.000000 |
75% | 13410.250000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 1.000000 | 1.000000 | NaN | NaN | NaN | NaN | NaN | 0.000000 |
max | 17880.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 1.000000 | 1.000000 | NaN | NaN | NaN | NaN | NaN | 1.000000 |
data.median()
job_id 8940.5 telecommuting 0.0 has_company_logo 1.0 has_questions 0.0 fraudulent 0.0 dtype: float64
pip install -U scikit-learn
Defaulting to user installation because normal site-packages is not writeable Collecting scikit-learn Downloading scikit_learn-1.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB) |████████████████████████████████| 26.7 MB 8.8 MB/s [?25hRequirement already satisfied: numpy>=1.14.6 in /home/mikolaj/.local/lib/python3.8/site-packages (from scikit-learn) (1.19.5) Requirement already satisfied: scipy>=1.1.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from scikit-learn) (1.7.2) Collecting threadpoolctl>=2.0.0 Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB) Requirement already satisfied: joblib>=0.11 in /home/mikolaj/.local/lib/python3.8/site-packages (from scikit-learn) (1.1.0) Installing collected packages: threadpoolctl, scikit-learn Successfully installed scikit-learn-1.0.2 threadpoolctl-3.1.0 [33mWARNING: You are using pip version 21.3.1; however, version 22.0.4 is available. You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m Note: you may need to restart the kernel to use updated packages.
from sklearn.model_selection import train_test_split
import sklearn
data_train, data_test = train_test_split(data, random_state=1)
data_train["title"].value_counts()
English Teacher Abroad 235 Customer Service Associate 110 Graduates: English Teacher Abroad (Conversational) 104 English Teacher Abroad 72 Software Engineer 68 ... Manager-Plastics Mfg Engineering - Full Time Permanent Job 1 Ruby on Rails Developer/Programmer 1 Appliance Technician 1 Need Oracle Fusion HCM Resource 1 Recruitment specialist 1 Name: title, Length: 8761, dtype: int64
data_test.size/data_train.size
0.3333333333333333
data_train["title"].value_counts()
English Teacher Abroad 235 Customer Service Associate 110 Graduates: English Teacher Abroad (Conversational) 104 English Teacher Abroad 72 Software Engineer 68 ... Manager-Plastics Mfg Engineering - Full Time Permanent Job 1 Ruby on Rails Developer/Programmer 1 Appliance Technician 1 Need Oracle Fusion HCM Resource 1 Recruitment specialist 1 Name: title, Length: 8761, dtype: int64