ium_444463/download_data.ipynb
2022-03-20 22:09:51 +01:00

64 KiB
Raw Blame History

#Skrypt do ściagnięcia zbiory danych
!pip install --user kaggle #API Kaggle, do pobrania zbioru
!pip install --user pandas
!pip install --user numpy
Requirement already satisfied: kaggle in /home/mikolaj/.local/lib/python3.8/site-packages (1.5.12)
Requirement already satisfied: urllib3 in /usr/lib/python3/dist-packages (from kaggle) (1.22)
Requirement already satisfied: six>=1.10 in /usr/lib/python3/dist-packages (from kaggle) (1.11.0)
Requirement already satisfied: requests in /home/mikolaj/.local/lib/python3.8/site-packages (from kaggle) (2.25.1)
Requirement already satisfied: certifi in /usr/lib/python3/dist-packages (from kaggle) (2018.1.18)
Requirement already satisfied: python-slugify in /home/mikolaj/.local/lib/python3.8/site-packages (from kaggle) (6.1.1)
Requirement already satisfied: python-dateutil in /home/mikolaj/.local/lib/python3.8/site-packages (from kaggle) (2.8.1)
Requirement already satisfied: tqdm in /home/mikolaj/.local/lib/python3.8/site-packages (from kaggle) (4.59.0)
Requirement already satisfied: text-unidecode>=1.3 in /home/mikolaj/.local/lib/python3.8/site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/lib/python3/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /home/mikolaj/.local/lib/python3.8/site-packages (from requests->kaggle) (2.10)
WARNING: You are using pip version 21.3.1; however, version 22.0.4 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
Requirement already satisfied: pandas in /home/mikolaj/.local/lib/python3.8/site-packages (1.1.5)
Requirement already satisfied: numpy>=1.15.4 in /home/mikolaj/.local/lib/python3.8/site-packages (from pandas) (1.19.5)
Requirement already satisfied: pytz>=2017.2 in /usr/lib/python3/dist-packages (from pandas) (2018.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /home/mikolaj/.local/lib/python3.8/site-packages (from pandas) (2.8.1)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.7.3->pandas) (1.11.0)
WARNING: You are using pip version 21.3.1; however, version 22.0.4 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
# Żeby poniższa komenda zadziałała, musisz posiadać plik ~/.kaggle/kaggle.json, zawierający Kaggle API token.
# Instrukcje: https://www.kaggle.com/docs/api
!kaggle datasets download -d shivamb/real-or-fake-fake-jobposting-prediction
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/mikolaj/.kaggle/kaggle.json'
real-or-fake-fake-jobposting-prediction.zip: Skipping, found more recently modified local copy (use --force to force download)
!unzip -o real-or-fake-fake-jobposting-prediction.zip
Archive:  real-or-fake-fake-jobposting-prediction.zip
  inflating: fake_job_postings.csv   
!pip install --user seaborn
Requirement already satisfied: seaborn in /home/mikolaj/.local/lib/python3.8/site-packages (0.11.2)
Requirement already satisfied: scipy>=1.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from seaborn) (1.7.2)
Requirement already satisfied: matplotlib>=2.2 in /home/mikolaj/.local/lib/python3.8/site-packages (from seaborn) (3.4.2)
Requirement already satisfied: numpy>=1.15 in /home/mikolaj/.local/lib/python3.8/site-packages (from seaborn) (1.19.5)
Requirement already satisfied: pandas>=0.23 in /home/mikolaj/.local/lib/python3.8/site-packages (from seaborn) (1.1.5)
Requirement already satisfied: python-dateutil>=2.7 in /home/mikolaj/.local/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (2.8.1)
Requirement already satisfied: pillow>=6.2.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (8.2.0)
Requirement already satisfied: cycler>=0.10 in /home/mikolaj/.local/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/mikolaj/.local/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: pyparsing>=2.2.1 in /home/mikolaj/.local/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: pytz>=2017.2 in /usr/lib/python3/dist-packages (from pandas>=0.23->seaborn) (2018.3)
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from cycler>=0.10->matplotlib>=2.2->seaborn) (1.11.0)
WARNING: You are using pip version 21.3.1; however, version 22.0.4 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
import pandas as pd
data=pd.read_csv('fake_job_postings.csv')
data
job_id title location department salary_range company_profile description requirements benefits telecommuting has_company_logo has_questions employment_type required_experience required_education industry function fraudulent
0 1 Marketing Intern US, NY, New York Marketing NaN We're Food52, and we've created a groundbreaki... Food52, a fast-growing, James Beard Award-winn... Experience with content management systems a m... NaN 0 1 0 Other Internship NaN NaN Marketing 0
1 2 Customer Service - Cloud Video Production NZ, , Auckland Success NaN 90 Seconds, the worlds Cloud Video Production ... Organised - Focused - Vibrant - Awesome!Do you... What we expect from you:Your key responsibilit... What you will get from usThrough being part of... 0 1 0 Full-time Not Applicable NaN Marketing and Advertising Customer Service 0
2 3 Commissioning Machinery Assistant (CMA) US, IA, Wever NaN NaN Valor Services provides Workforce Solutions th... Our client, located in Houston, is actively se... Implement pre-commissioning and commissioning ... NaN 0 1 0 NaN NaN NaN NaN NaN 0
3 4 Account Executive - Washington DC US, DC, Washington Sales NaN Our passion for improving quality of life thro... THE COMPANY: ESRI Environmental Systems Rese... EDUCATION: Bachelors or Masters in GIS, busi... Our culture is anything but corporate—we have ... 0 1 0 Full-time Mid-Senior level Bachelor's Degree Computer Software Sales 0
4 5 Bill Review Manager US, FL, Fort Worth NaN NaN SpotSource Solutions LLC is a Global Human Cap... JOB TITLE: Itemization Review ManagerLOCATION:... QUALIFICATIONS:RN license in the State of Texa... Full Benefits Offered 0 1 1 Full-time Mid-Senior level Bachelor's Degree Hospital & Health Care Health Care Provider 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17875 17876 Account Director - Distribution CA, ON, Toronto Sales NaN Vend is looking for some awesome new talent to... Just in case this is the first time youve vis... To ace this role you:Will eat comprehensive St... What can you expect from us?We have an open cu... 0 1 1 Full-time Mid-Senior level NaN Computer Software Sales 0
17876 17877 Payroll Accountant US, PA, Philadelphia Accounting NaN WebLinc is the e-commerce platform and service... The Payroll Accountant will focus primarily on... - B.A. or B.S. in Accounting- Desire to have f... Health &amp; WellnessMedical planPrescription ... 0 1 1 Full-time Mid-Senior level Bachelor's Degree Internet Accounting/Auditing 0
17877 17878 Project Cost Control Staff Engineer - Cost Con... US, TX, Houston NaN NaN We Provide Full Time Permanent Positions for m... Experienced Project Cost Control Staff Enginee... At least 12 years professional experience.Abil... NaN 0 0 0 Full-time NaN NaN NaN NaN 0
17878 17879 Graphic Designer NG, LA, Lagos NaN NaN NaN Nemsia Studios is looking for an experienced v... 1. Must be fluent in the latest versions of Co... Competitive salary (compensation will be based... 0 0 1 Contract Not Applicable Professional Graphic Design Design 0
17879 17880 Web Application Developers NZ, N, Wellington Engineering NaN Vend is looking for some awesome new talent to... Who are we?Vend is an award winning web based ... We want to hear from you if:You have an in-dep... NaN 0 1 1 Full-time Mid-Senior level NaN Computer Software Engineering 0

17880 rows × 18 columns

#Wielkosc zbioru
!wc -l fake_job_postings.csv
17880 fake_job_postings.csv
data = data.replace(np.nan, '', regex=True)
data
job_id title location department salary_range company_profile description requirements benefits telecommuting has_company_logo has_questions employment_type required_experience required_education industry function fraudulent
0 1 Marketing Intern US, NY, New York Marketing We're Food52, and we've created a groundbreaki... Food52, a fast-growing, James Beard Award-winn... Experience with content management systems a m... 0 1 0 Other Internship Marketing 0
1 2 Customer Service - Cloud Video Production NZ, , Auckland Success 90 Seconds, the worlds Cloud Video Production ... Organised - Focused - Vibrant - Awesome!Do you... What we expect from you:Your key responsibilit... What you will get from usThrough being part of... 0 1 0 Full-time Not Applicable Marketing and Advertising Customer Service 0
2 3 Commissioning Machinery Assistant (CMA) US, IA, Wever Valor Services provides Workforce Solutions th... Our client, located in Houston, is actively se... Implement pre-commissioning and commissioning ... 0 1 0 0
3 4 Account Executive - Washington DC US, DC, Washington Sales Our passion for improving quality of life thro... THE COMPANY: ESRI Environmental Systems Rese... EDUCATION: Bachelors or Masters in GIS, busi... Our culture is anything but corporate—we have ... 0 1 0 Full-time Mid-Senior level Bachelor's Degree Computer Software Sales 0
4 5 Bill Review Manager US, FL, Fort Worth SpotSource Solutions LLC is a Global Human Cap... JOB TITLE: Itemization Review ManagerLOCATION:... QUALIFICATIONS:RN license in the State of Texa... Full Benefits Offered 0 1 1 Full-time Mid-Senior level Bachelor's Degree Hospital & Health Care Health Care Provider 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17875 17876 Account Director - Distribution CA, ON, Toronto Sales Vend is looking for some awesome new talent to... Just in case this is the first time youve vis... To ace this role you:Will eat comprehensive St... What can you expect from us?We have an open cu... 0 1 1 Full-time Mid-Senior level Computer Software Sales 0
17876 17877 Payroll Accountant US, PA, Philadelphia Accounting WebLinc is the e-commerce platform and service... The Payroll Accountant will focus primarily on... - B.A. or B.S. in Accounting- Desire to have f... Health &amp; WellnessMedical planPrescription ... 0 1 1 Full-time Mid-Senior level Bachelor's Degree Internet Accounting/Auditing 0
17877 17878 Project Cost Control Staff Engineer - Cost Con... US, TX, Houston We Provide Full Time Permanent Positions for m... Experienced Project Cost Control Staff Enginee... At least 12 years professional experience.Abil... 0 0 0 Full-time 0
17878 17879 Graphic Designer NG, LA, Lagos Nemsia Studios is looking for an experienced v... 1. Must be fluent in the latest versions of Co... Competitive salary (compensation will be based... 0 0 1 Contract Not Applicable Professional Graphic Design Design 0
17879 17880 Web Application Developers NZ, N, Wellington Engineering Vend is looking for some awesome new talent to... Who are we?Vend is an award winning web based ... We want to hear from you if:You have an in-dep... 0 1 1 Full-time Mid-Senior level Computer Software Engineering 0

17880 rows × 18 columns

data["department"].value_counts()
                            11547
Sales                         551
Engineering                   487
Marketing                     401
Operations                    270
                            ...  
Pricing                         1
Mobility                        1
Housekeeping                    1
An Impact Engine Company        1
Trainee                         1
Name: department, Length: 1338, dtype: int64
data = data.replace(np.nan, '', regex=True)
data.describe(include='all')
job_id title location department salary_range company_profile description requirements benefits telecommuting has_company_logo has_questions employment_type required_experience required_education industry function fraudulent
count 17880.000000 17880 17880 17880 17880 17880 17880 17880 17880 17880.000000 17880.000000 17880.000000 17880 17880 17880 17880 17880 17880.000000
unique NaN 11231 3106 1338 875 1710 14802 11969 6206 NaN NaN NaN 6 8 14 132 38 NaN
top NaN English Teacher Abroad GB, LND, London Play with kids, get paid for it Love travel? J... NaN NaN NaN Full-time NaN
freq NaN 311 718 11547 15012 3308 379 2695 7210 NaN NaN NaN 11620 7050 8105 4903 6455 NaN
mean 8940.500000 NaN NaN NaN NaN NaN NaN NaN NaN 0.042897 0.795302 0.491723 NaN NaN NaN NaN NaN 0.048434
std 5161.655742 NaN NaN NaN NaN NaN NaN NaN NaN 0.202631 0.403492 0.499945 NaN NaN NaN NaN NaN 0.214688
min 1.000000 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 0.000000 0.000000 NaN NaN NaN NaN NaN 0.000000
25% 4470.750000 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 1.000000 0.000000 NaN NaN NaN NaN NaN 0.000000
50% 8940.500000 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 1.000000 0.000000 NaN NaN NaN NaN NaN 0.000000
75% 13410.250000 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 1.000000 1.000000 NaN NaN NaN NaN NaN 0.000000
max 17880.000000 NaN NaN NaN NaN NaN NaN NaN NaN 1.000000 1.000000 1.000000 NaN NaN NaN NaN NaN 1.000000
data.median()
job_id              8940.5
telecommuting          0.0
has_company_logo       1.0
has_questions          0.0
fraudulent             0.0
dtype: float64
pip install -U scikit-learn
Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
     |████████████████████████████████| 26.7 MB 8.8 MB/s            
[?25hRequirement already satisfied: numpy>=1.14.6 in /home/mikolaj/.local/lib/python3.8/site-packages (from scikit-learn) (1.19.5)
Requirement already satisfied: scipy>=1.1.0 in /home/mikolaj/.local/lib/python3.8/site-packages (from scikit-learn) (1.7.2)
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Requirement already satisfied: joblib>=0.11 in /home/mikolaj/.local/lib/python3.8/site-packages (from scikit-learn) (1.1.0)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-1.0.2 threadpoolctl-3.1.0
WARNING: You are using pip version 21.3.1; however, version 22.0.4 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
from sklearn.model_selection import train_test_split
import sklearn
data_train, data_test = train_test_split(data, random_state=1)
data_train["title"].value_counts()
English Teacher Abroad                                        235
Customer Service Associate                                    110
Graduates: English Teacher Abroad (Conversational)            104
English Teacher Abroad                                         72
Software Engineer                                              68
                                                             ... 
Manager-Plastics Mfg Engineering - Full Time Permanent Job      1
Ruby on Rails Developer/Programmer                              1
Appliance Technician                                            1
Need Oracle Fusion HCM Resource                                 1
Recruitment specialist                                          1
Name: title, Length: 8761, dtype: int64
data_test.size/data_train.size
0.3333333333333333
data_train["title"].value_counts()
English Teacher Abroad                                        235
Customer Service Associate                                    110
Graduates: English Teacher Abroad (Conversational)            104
English Teacher Abroad                                         72
Software Engineer                                              68
                                                             ... 
Manager-Plastics Mfg Engineering - Full Time Permanent Job      1
Ruby on Rails Developer/Programmer                              1
Appliance Technician                                            1
Need Oracle Fusion HCM Resource                                 1
Recruitment specialist                                          1
Name: title, Length: 8761, dtype: int64