projekt-UMA/UMA12_ANTONIO_ProjektFakeNews.ipynb
2022-06-20 22:40:40 +02:00

551 lines
37 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "statewide-crown",
"metadata": {},
"source": [
"**12. Projekt**\n",
"================="
]
},
{
"cell_type": "markdown",
"id": "explicit-bunch",
"metadata": {},
"source": [
"## **1. Cel projektu**"
]
},
{
"cell_type": "markdown",
"id": "encouraging-officer",
"metadata": {},
"source": [
"#### Celem projektu jest przewidzenie ze zbioru danych jakie widomości są Fake Newsami, użyte algorytmy:\n",
"* TfidfVectorizer\n",
"* PassiveAggressiveClassifier\n",
"\n",
"Opis algorytmów.\n",
"\n",
"**TF (Term Frequency):** Liczba wystąpień danego słowa w dokumencie to jego częstotliwość występowania. Wyższa wartość oznacza, że dany termin pojawia się częściej niż inne, a zatem dokument jest dobrze dopasowany, jeśli termin ten jest częścią wyszukiwanych słów.\n",
"\n",
"Wektorator TfidfVectorizer przekształca zbiór dokumentów w macierz cech TF-IDF.\n",
"\n",
"**Algorytmy pasywno-agresywne** to algorytmy uczące się online. Taki algorytm pozostaje pasywny w przypadku poprawnego wyniku klasyfikacji, a staje się agresywny w przypadku błędnego obliczenia, aktualizując i dostosowując się. W przeciwieństwie do większości innych algorytmów nie jest on zbieżny. Jego zadaniem jest dokonywanie aktualizacji korygujących stratę, powodujących bardzo niewielkie zmiany w normie wektora wag.\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "informal-filename",
"metadata": {},
"source": [
"#### Dane news.csv wykorzstane do uczenia pochodzą ze strony https://paperswithcode.com/datasets?task=fake-news-detection"
]
},
{
"cell_type": "markdown",
"id": "beginning-minute",
"metadata": {},
"source": [
"## **2. Importowanie potrzebnych bibliotek**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "effective-democracy",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import itertools\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.linear_model import PassiveAggressiveClassifier\n",
"from sklearn.metrics import accuracy_score, confusion_matrix"
]
},
{
"cell_type": "markdown",
"id": "alternative-knock",
"metadata": {},
"source": [
"## **3. Wczytanie danych**"
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "worldwide-blake",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>title</th>\n",
" <th>text</th>\n",
" <th>label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>8476</td>\n",
" <td>You Can Smell Hillarys Fear</td>\n",
" <td>Daniel Greenfield, a Shillman Journalism Fello...</td>\n",
" <td>FAKE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>10294</td>\n",
" <td>Watch The Exact Moment Paul Ryan Committed Pol...</td>\n",
" <td>Google Pinterest Digg Linkedin Reddit Stumbleu...</td>\n",
" <td>FAKE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3608</td>\n",
" <td>Kerry to go to Paris in gesture of sympathy</td>\n",
" <td>U.S. Secretary of State John F. Kerry said Mon...</td>\n",
" <td>REAL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>10142</td>\n",
" <td>Bernie supporters on Twitter erupt in anger ag...</td>\n",
" <td>— Kaydee King (@KaydeeKing) November 9, 2016 T...</td>\n",
" <td>FAKE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>875</td>\n",
" <td>The Battle of New York: Why This Primary Matters</td>\n",
" <td>It's primary day in New York and front-runners...</td>\n",
" <td>REAL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6903</td>\n",
" <td>Tehran, USA</td>\n",
" <td>\\nIm not an immigrant, but my grandparents ...</td>\n",
" <td>FAKE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7341</td>\n",
" <td>Girl Horrified At What She Watches Boyfriend D...</td>\n",
" <td>Share This Baylee Luciani (left), Screenshot o...</td>\n",
" <td>FAKE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>95</td>\n",
" <td>Britains Schindler Dies at 106</td>\n",
" <td>A Czech stockbroker who saved more than 650 Je...</td>\n",
" <td>REAL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>4869</td>\n",
" <td>Fact check: Trump and Clinton at the 'commande...</td>\n",
" <td>Hillary Clinton and Donald Trump made some ina...</td>\n",
" <td>REAL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>2909</td>\n",
" <td>Iran reportedly makes new push for uranium con...</td>\n",
" <td>Iranian negotiators reportedly have made a las...</td>\n",
" <td>REAL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>1357</td>\n",
" <td>With all three Clintons in Iowa, a glimpse at ...</td>\n",
" <td>CEDAR RAPIDS, Iowa — “I had one of the most wo...</td>\n",
" <td>REAL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>988</td>\n",
" <td>Donald Trumps Shockingly Weak Delegate Game S...</td>\n",
" <td>Donald Trumps organizational problems have go...</td>\n",
" <td>REAL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>7041</td>\n",
" <td>Strong Solar Storm, Tech Risks Today | S0 News...</td>\n",
" <td>Click Here To Learn More About Alexandra's Per...</td>\n",
" <td>FAKE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>7623</td>\n",
" <td>10 Ways America Is Preparing for World War 3</td>\n",
" <td>October 31, 2016 at 4:52 am \\nPretty factual e...</td>\n",
" <td>FAKE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>1571</td>\n",
" <td>Trump takes on Cruz, but lightly</td>\n",
" <td>Killing Obama administration rules, dismantlin...</td>\n",
" <td>REAL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>4739</td>\n",
" <td>How women lead differently</td>\n",
" <td>As more women move into high offices, they oft...</td>\n",
" <td>REAL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>7737</td>\n",
" <td>Shocking! Michele Obama &amp; Hillary Caught Glamo...</td>\n",
" <td>Shocking! Michele Obama &amp; Hillary Caught Glamo...</td>\n",
" <td>FAKE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>8716</td>\n",
" <td>Hillary Clinton in HUGE Trouble After America ...</td>\n",
" <td>0 \\nHillary Clinton has barely just lost the p...</td>\n",
" <td>FAKE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>3304</td>\n",
" <td>What's in that Iran bill that Obama doesn't like?</td>\n",
" <td>Washington (CNN) For months, the White House a...</td>\n",
" <td>REAL</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>3078</td>\n",
" <td>The 1 chart that explains everything you need ...</td>\n",
" <td>While paging through Pew's best data visualiza...</td>\n",
" <td>REAL</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 title \\\n",
"0 8476 You Can Smell Hillarys Fear \n",
"1 10294 Watch The Exact Moment Paul Ryan Committed Pol... \n",
"2 3608 Kerry to go to Paris in gesture of sympathy \n",
"3 10142 Bernie supporters on Twitter erupt in anger ag... \n",
"4 875 The Battle of New York: Why This Primary Matters \n",
"5 6903 Tehran, USA \n",
"6 7341 Girl Horrified At What She Watches Boyfriend D... \n",
"7 95 Britains Schindler Dies at 106 \n",
"8 4869 Fact check: Trump and Clinton at the 'commande... \n",
"9 2909 Iran reportedly makes new push for uranium con... \n",
"10 1357 With all three Clintons in Iowa, a glimpse at ... \n",
"11 988 Donald Trumps Shockingly Weak Delegate Game S... \n",
"12 7041 Strong Solar Storm, Tech Risks Today | S0 News... \n",
"13 7623 10 Ways America Is Preparing for World War 3 \n",
"14 1571 Trump takes on Cruz, but lightly \n",
"15 4739 How women lead differently \n",
"16 7737 Shocking! Michele Obama & Hillary Caught Glamo... \n",
"17 8716 Hillary Clinton in HUGE Trouble After America ... \n",
"18 3304 What's in that Iran bill that Obama doesn't like? \n",
"19 3078 The 1 chart that explains everything you need ... \n",
"\n",
" text label \n",
"0 Daniel Greenfield, a Shillman Journalism Fello... FAKE \n",
"1 Google Pinterest Digg Linkedin Reddit Stumbleu... FAKE \n",
"2 U.S. Secretary of State John F. Kerry said Mon... REAL \n",
"3 — Kaydee King (@KaydeeKing) November 9, 2016 T... FAKE \n",
"4 It's primary day in New York and front-runners... REAL \n",
"5 \\nIm not an immigrant, but my grandparents ... FAKE \n",
"6 Share This Baylee Luciani (left), Screenshot o... FAKE \n",
"7 A Czech stockbroker who saved more than 650 Je... REAL \n",
"8 Hillary Clinton and Donald Trump made some ina... REAL \n",
"9 Iranian negotiators reportedly have made a las... REAL \n",
"10 CEDAR RAPIDS, Iowa — “I had one of the most wo... REAL \n",
"11 Donald Trumps organizational problems have go... REAL \n",
"12 Click Here To Learn More About Alexandra's Per... FAKE \n",
"13 October 31, 2016 at 4:52 am \\nPretty factual e... FAKE \n",
"14 Killing Obama administration rules, dismantlin... REAL \n",
"15 As more women move into high offices, they oft... REAL \n",
"16 Shocking! Michele Obama & Hillary Caught Glamo... FAKE \n",
"17 0 \\nHillary Clinton has barely just lost the p... FAKE \n",
"18 Washington (CNN) For months, the White House a... REAL \n",
"19 While paging through Pew's best data visualiza... REAL "
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df=pd.read_csv('news.csv')\n",
"df.shape\n",
"df.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"id": "major-section",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 FAKE\n",
"1 FAKE\n",
"2 REAL\n",
"3 FAKE\n",
"4 REAL\n",
"Name: label, dtype: object"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labels=df.label\n",
"labels.head()"
]
},
{
"cell_type": "markdown",
"id": "surprised-desperate",
"metadata": {},
"source": [
"## **4. Wizualizacja cech na histogramach**"
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "literary-correlation",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f51d81ab470>]],\n",
" dtype=object)"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x1440 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"df.hist(figsize=(20,20), xrot=-45)"
]
},
{
"cell_type": "markdown",
"id": "hungry-costa",
"metadata": {},
"source": [
"## **5. Podział na zbiór testowy i treningowy**"
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "optical-wales",
"metadata": {},
"outputs": [],
"source": [
"x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)"
]
},
{
"cell_type": "markdown",
"id": "continued-system",
"metadata": {},
"source": [
"## **6. Trenowanie**"
]
},
{
"cell_type": "markdown",
"id": "oriental-trinity",
"metadata": {},
"source": [
"### 6.1. Użycie TFIDF\n",
"\n",
"Inicjalizuje wektor TfidfVectorizer ze słowami stop z języka angielskiego i maksymalną częstotliwością występowania w dokumentach wynoszącą 0,7 (terminy o wyższej częstotliwości występowania w dokumentach zostaną odrzucone). Stop words to najczęściej występujące słowa w danym języku, które należy odfiltrować przed przetworzeniem danych języka naturalnego. Wektoryzator TfidfVectorizer przekształca zbiór nieprzetworzonych dokumentów w macierz cech TF-IDF.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "south-liability",
"metadata": {},
"outputs": [],
"source": [
"tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)\n",
"\n",
"tfidf_train=tfidf_vectorizer.fit_transform(x_train) \n",
"tfidf_test=tfidf_vectorizer.transform(x_test)"
]
},
{
"cell_type": "markdown",
"id": "linear-chest",
"metadata": {},
"source": [
"### 6.2. PassiveAggressiveClassifier"
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "flying-gabriel",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dokładność: 92.82%\n"
]
}
],
"source": [
"pac=PassiveAggressiveClassifier(max_iter=50)\n",
"pac.fit(tfidf_train,y_train)\n",
"\n",
"#Przewidzenie na zbiorze testowym i kalkulacja dokładnośći\n",
"y_pred=pac.predict(tfidf_test)\n",
"score=accuracy_score(y_test,y_pred)\n",
"print(f'Dokładność: {round(score*100,2)}%')"
]
}
],
{
"cell_type": "markdown",
"id": "balanced-security",
"metadata": {},
"source": [
"## **7. Podsumowanie wyników**"
]
},
{
"cell_type": "markdown",
"id": "military-radar",
"metadata": {},
"source": [
"W tym modelu uzyskaliśmy dokładność 92,82%. Na koniec wydrukujmy macierz konfuzji, aby uzyskać wgląd w liczbę fałszywych i prawdziwych wyników negatywnych i pozytywnych."
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "fifty-melbourne",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[590, 48],\n",
" [ 43, 586]])"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])"
]
},
{
"cell_type": "markdown",
"id": "collectible-ireland",
"metadata": {},
"source": [
"W przypadku tego modelu mamy 590 prawdziwych wyników dodatnich, 586 prawdziwych wyników ujemnych, 43 fałszywe wyniki dodatnie i 48 fałszywych wyników ujemnych."
]
},
{
"cell_type": "markdown",
"id": "natural-premiere",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "crucial-geneva",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}