{ "cells": [ { "cell_type": "markdown", "id": "statewide-crown", "metadata": {}, "source": [ "**12. Projekt**\n", "=================" ] }, { "cell_type": "markdown", "id": "explicit-bunch", "metadata": {}, "source": [ "## **1. Cel projektu**" ] }, { "cell_type": "markdown", "id": "encouraging-officer", "metadata": {}, "source": [ "#### Celem projektu jest przewidzenie ze zbioru danych jakie widomości są Fake Newsami, użyte algorytmy:\n", "* TfidfVectorizer\n", "* PassiveAggressiveClassifier\n", "\n", "Opis algorytmów.\n", "\n", "**TF (Term Frequency):** Liczba wystąpień danego słowa w dokumencie to jego częstotliwość występowania. Wyższa wartość oznacza, że dany termin pojawia się częściej niż inne, a zatem dokument jest dobrze dopasowany, jeśli termin ten jest częścią wyszukiwanych słów.\n", "\n", "Wektorator TfidfVectorizer przekształca zbiór dokumentów w macierz cech TF-IDF.\n", "\n", "**Algorytmy pasywno-agresywne** to algorytmy uczące się online. Taki algorytm pozostaje pasywny w przypadku poprawnego wyniku klasyfikacji, a staje się agresywny w przypadku błędnego obliczenia, aktualizując i dostosowując się. W przeciwieństwie do większości innych algorytmów nie jest on zbieżny. Jego zadaniem jest dokonywanie aktualizacji korygujących stratę, powodujących bardzo niewielkie zmiany w normie wektora wag.\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "informal-filename", "metadata": {}, "source": [ "#### Dane news.csv wykorzstane do uczenia pochodz" ] }, { "cell_type": "markdown", "id": "beginning-minute", "metadata": {}, "source": [ "## **2. Importowanie potrzebnych bibliotek**" ] }, { "cell_type": "code", "execution_count": null, "id": "effective-democracy", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import itertools\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.linear_model import PassiveAggressiveClassifier\n", "from sklearn.metrics import accuracy_score, confusion_matrix" ] }, { "cell_type": "markdown", "id": "alternative-knock", "metadata": {}, "source": [ "## **3. Wczytanie danych**" ] }, { "cell_type": "code", "execution_count": 56, "id": "worldwide-blake", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0titletextlabel
08476You Can Smell Hillary’s FearDaniel Greenfield, a Shillman Journalism Fello...FAKE
110294Watch The Exact Moment Paul Ryan Committed Pol...Google Pinterest Digg Linkedin Reddit Stumbleu...FAKE
23608Kerry to go to Paris in gesture of sympathyU.S. Secretary of State John F. Kerry said Mon...REAL
310142Bernie supporters on Twitter erupt in anger ag...— Kaydee King (@KaydeeKing) November 9, 2016 T...FAKE
4875The Battle of New York: Why This Primary MattersIt's primary day in New York and front-runners...REAL
56903Tehran, USA\\nI’m not an immigrant, but my grandparents ...FAKE
67341Girl Horrified At What She Watches Boyfriend D...Share This Baylee Luciani (left), Screenshot o...FAKE
795‘Britain’s Schindler’ Dies at 106A Czech stockbroker who saved more than 650 Je...REAL
84869Fact check: Trump and Clinton at the 'commande...Hillary Clinton and Donald Trump made some ina...REAL
92909Iran reportedly makes new push for uranium con...Iranian negotiators reportedly have made a las...REAL
101357With all three Clintons in Iowa, a glimpse at ...CEDAR RAPIDS, Iowa — “I had one of the most wo...REAL
11988Donald Trump’s Shockingly Weak Delegate Game S...Donald Trump’s organizational problems have go...REAL
127041Strong Solar Storm, Tech Risks Today | S0 News...Click Here To Learn More About Alexandra's Per...FAKE
13762310 Ways America Is Preparing for World War 3October 31, 2016 at 4:52 am \\nPretty factual e...FAKE
141571Trump takes on Cruz, but lightlyKilling Obama administration rules, dismantlin...REAL
154739How women lead differentlyAs more women move into high offices, they oft...REAL
167737Shocking! Michele Obama & Hillary Caught Glamo...Shocking! Michele Obama & Hillary Caught Glamo...FAKE
178716Hillary Clinton in HUGE Trouble After America ...0 \\nHillary Clinton has barely just lost the p...FAKE
183304What's in that Iran bill that Obama doesn't like?Washington (CNN) For months, the White House a...REAL
193078The 1 chart that explains everything you need ...While paging through Pew's best data visualiza...REAL
\n", "
" ], "text/plain": [ " Unnamed: 0 title \\\n", "0 8476 You Can Smell Hillary’s Fear \n", "1 10294 Watch The Exact Moment Paul Ryan Committed Pol... \n", "2 3608 Kerry to go to Paris in gesture of sympathy \n", "3 10142 Bernie supporters on Twitter erupt in anger ag... \n", "4 875 The Battle of New York: Why This Primary Matters \n", "5 6903 Tehran, USA \n", "6 7341 Girl Horrified At What She Watches Boyfriend D... \n", "7 95 ‘Britain’s Schindler’ Dies at 106 \n", "8 4869 Fact check: Trump and Clinton at the 'commande... \n", "9 2909 Iran reportedly makes new push for uranium con... \n", "10 1357 With all three Clintons in Iowa, a glimpse at ... \n", "11 988 Donald Trump’s Shockingly Weak Delegate Game S... \n", "12 7041 Strong Solar Storm, Tech Risks Today | S0 News... \n", "13 7623 10 Ways America Is Preparing for World War 3 \n", "14 1571 Trump takes on Cruz, but lightly \n", "15 4739 How women lead differently \n", "16 7737 Shocking! Michele Obama & Hillary Caught Glamo... \n", "17 8716 Hillary Clinton in HUGE Trouble After America ... \n", "18 3304 What's in that Iran bill that Obama doesn't like? \n", "19 3078 The 1 chart that explains everything you need ... \n", "\n", " text label \n", "0 Daniel Greenfield, a Shillman Journalism Fello... FAKE \n", "1 Google Pinterest Digg Linkedin Reddit Stumbleu... FAKE \n", "2 U.S. Secretary of State John F. Kerry said Mon... REAL \n", "3 — Kaydee King (@KaydeeKing) November 9, 2016 T... FAKE \n", "4 It's primary day in New York and front-runners... REAL \n", "5 \\nI’m not an immigrant, but my grandparents ... FAKE \n", "6 Share This Baylee Luciani (left), Screenshot o... FAKE \n", "7 A Czech stockbroker who saved more than 650 Je... REAL \n", "8 Hillary Clinton and Donald Trump made some ina... REAL \n", "9 Iranian negotiators reportedly have made a las... REAL \n", "10 CEDAR RAPIDS, Iowa — “I had one of the most wo... REAL \n", "11 Donald Trump’s organizational problems have go... REAL \n", "12 Click Here To Learn More About Alexandra's Per... FAKE \n", "13 October 31, 2016 at 4:52 am \\nPretty factual e... FAKE \n", "14 Killing Obama administration rules, dismantlin... REAL \n", "15 As more women move into high offices, they oft... REAL \n", "16 Shocking! Michele Obama & Hillary Caught Glamo... FAKE \n", "17 0 \\nHillary Clinton has barely just lost the p... FAKE \n", "18 Washington (CNN) For months, the White House a... REAL \n", "19 While paging through Pew's best data visualiza... REAL " ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df=pd.read_csv('news.csv')\n", "df.shape\n", "df.head(20)" ] }, { "cell_type": "code", "execution_count": 57, "id": "major-section", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 FAKE\n", "1 FAKE\n", "2 REAL\n", "3 FAKE\n", "4 REAL\n", "Name: label, dtype: object" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels=df.label\n", "labels.head()" ] }, { "cell_type": "markdown", "id": "surprised-desperate", "metadata": {}, "source": [ "## **4. Wizualizacja cech na histogramach**" ] }, { "cell_type": "code", "execution_count": 52, "id": "literary-correlation", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[]],\n", " dtype=object)" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df.hist(figsize=(20,20), xrot=-45)" ] }, { "cell_type": "markdown", "id": "hungry-costa", "metadata": {}, "source": [ "## **5. Podział na zbiór testowy i treningowy**" ] }, { "cell_type": "code", "execution_count": 53, "id": "optical-wales", "metadata": {}, "outputs": [], "source": [ "x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)" ] }, { "cell_type": "markdown", "id": "continued-system", "metadata": {}, "source": [ "## **6. Trenowanie**" ] }, { "cell_type": "markdown", "id": "oriental-trinity", "metadata": {}, "source": [ "### 6.1. Użycie TFIDF\n", "\n", "Inicjalizuje wektor TfidfVectorizer ze słowami stop z języka angielskiego i maksymalną częstotliwością występowania w dokumentach wynoszącą 0,7 (terminy o wyższej częstotliwości występowania w dokumentach zostaną odrzucone). Stop words to najczęściej występujące słowa w danym języku, które należy odfiltrować przed przetworzeniem danych języka naturalnego. Wektoryzator TfidfVectorizer przekształca zbiór nieprzetworzonych dokumentów w macierz cech TF-IDF.\n", "\n" ] }, { "cell_type": "code", "execution_count": 58, "id": "south-liability", "metadata": {}, "outputs": [], "source": [ "tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)\n", "\n", "tfidf_train=tfidf_vectorizer.fit_transform(x_train) \n", "tfidf_test=tfidf_vectorizer.transform(x_test)" ] }, { "cell_type": "markdown", "id": "linear-chest", "metadata": {}, "source": [ "### 6.2. PassiveAggressiveClassifier" ] }, { "cell_type": "code", "execution_count": 55, "id": "flying-gabriel", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dokładność: 92.82%\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/lib/python3/dist-packages/sklearn/linear_model/base.py:283: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.\n", "Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n", " indices = (scores > 0).astype(np.int)\n" ] } ], "source": [ "pac=PassiveAggressiveClassifier(max_iter=50)\n", "pac.fit(tfidf_train,y_train)\n", "\n", "#Przewidzenie na zbiorze testowym i kalkulacja dokładnośći\n", "y_pred=pac.predict(tfidf_test)\n", "score=accuracy_score(y_test,y_pred)\n", "print(f'Dokładność: {round(score*100,2)}%')" ] }, { "cell_type": "markdown", "id": "balanced-security", "metadata": {}, "source": [ "## **8. Podsumowanie wyników**" ] }, { "cell_type": "markdown", "id": "military-radar", "metadata": {}, "source": [ "W tym modelu uzyskaliśmy dokładność 92,82%. Na koniec wydrukujmy macierz konfuzji, aby uzyskać wgląd w liczbę fałszywych i prawdziwych wyników negatywnych i pozytywnych." ] }, { "cell_type": "code", "execution_count": 60, "id": "fifty-melbourne", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[590, 48],\n", " [ 43, 586]])" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])" ] }, { "cell_type": "markdown", "id": "collectible-ireland", "metadata": {}, "source": [ "W przypadku tego modelu mamy 589 prawdziwych wyników dodatnich, 587 prawdziwych wyników ujemnych, 42 fałszywe wyniki dodatnie i 49 fałszywych wyników ujemnych." ] }, { "cell_type": "markdown", "id": "natural-premiere", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "id": "crucial-geneva", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 5 }