58 KiB
58 KiB
Analiza sentymentu w opiniach z Twitter'a
Download dataset and prepare data
Installation of packages
%pip install pandas
%pip install scikit-learn
%pip install emoji
%pip install gensim
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (2.0.3) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.4) Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2024.1) Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas) (1.25.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.25.2) Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.5.0) Requirement already satisfied: emoji in /usr/local/lib/python3.10/dist-packages (2.12.1) Requirement already satisfied: typing-extensions>=4.7.0 in /usr/local/lib/python3.10/dist-packages (from emoji) (4.12.1) Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (4.3.2) Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.10/dist-packages (from gensim) (1.25.2) Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from gensim) (1.11.4) Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim) (6.4.0)
Importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import emoji
from gensim.utils import simple_preprocess
Download the dataset
!kaggle datasets download -d jp797498e/twitter-entity-sentiment-analysis
Dataset URL: https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis License(s): CC0-1.0 twitter-entity-sentiment-analysis.zip: Skipping, found more recently modified local copy (use --force to force download)
Unzip the dataset
!unzip -o twitter-entity-sentiment-analysis.zip
Archive: twitter-entity-sentiment-analysis.zip inflating: twitter_training.csv inflating: twitter_validation.csv
Load the dataset
cols = ["tweetid", "entity", "sentiment", "content"]
twitter_training = pd.read_csv("twitter_training.csv", names=cols)
twitter_validation = pd.read_csv("twitter_validation.csv", names=cols)
dataset = pd.concat([twitter_training, twitter_validation])
Info about the dataset
dataset.info()
<class 'pandas.core.frame.DataFrame'> Index: 75682 entries, 0 to 999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweetid 75682 non-null int64 1 entity 75682 non-null object 2 sentiment 75682 non-null object 3 content 74996 non-null object dtypes: int64(1), object(3) memory usage: 2.9+ MB
dataset.shape
(75682, 4)
dataset["sentiment"].value_counts()
sentiment Negative 22808 Positive 21109 Neutral 18603 Irrelevant 13162 Name: count, dtype: int64
dataset.isna().sum()
tweetid 0 entity 0 sentiment 0 content 686 dtype: int64
dataset.duplicated().sum()
3217
Prepare the dataset
Drop tweetid and entity columns
dataset = dataset.drop(columns=["tweetid", "entity"], axis=1)
Drop null values
dataset.dropna(inplace=True)
Remove emojis
dataset["content"] = dataset["content"].apply(
lambda x: emoji.replace_emoji(x, replace="")
)
Simple Preprocess
dataset["content"] = dataset["content"].apply(lambda x: " ".join(simple_preprocess(x)))
Drop null values
dataset.dropna(inplace=True)
Drop duplicates
dataset.drop_duplicates(inplace=True)
Info about the dataset after cleaning
dataset.info()
<class 'pandas.core.frame.DataFrame'> Index: 65839 entries, 0 to 991 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sentiment 65839 non-null object 1 content 65839 non-null object dtypes: object(2) memory usage: 1.5+ MB
dataset.shape
(65839, 2)
dataset["sentiment"].value_counts()
sentiment Negative 20147 Positive 17868 Neutral 16193 Irrelevant 11631 Name: count, dtype: int64
dataset.isna().sum()
sentiment 0 content 0 dtype: int64
dataset.duplicated().sum()
0
Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
dataset["content"], dataset["sentiment"], test_size=0.2, random_state=0
)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((52671,), (13168,), (52671,), (13168,))
TD-IDF - Logistic Regression
Importing libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
Text Vectorization Using TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
Training a Logistic Regression model
model = LogisticRegression(solver="lbfgs", penalty="l2", max_iter=1000)
model.fit(X_train_tfidf, y_train)
LogisticRegression(max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000)
Predicting
y_pred = model.predict(X_test_tfidf)
Classification report
print(classification_report(y_test, y_pred))
precision recall f1-score support Irrelevant 0.82 0.70 0.75 2304 Negative 0.80 0.86 0.83 4024 Neutral 0.79 0.74 0.77 3169 Positive 0.78 0.82 0.80 3671 accuracy 0.79 13168 macro avg 0.80 0.78 0.79 13168 weighted avg 0.79 0.79 0.79 13168
TD-IDF - Random Forest Classifier
Importing libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
Text Vectorization Using TF-IDF
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
Training a Random Forest Classifier model
model = RandomForestClassifier(criterion="gini")
model.fit(X_train_tfidf, y_train)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
Predicting
y_pred = model.predict(X_test_tfidf)
Classification report
print(classification_report(y_test, y_pred))
precision recall f1-score support Irrelevant 0.95 0.87 0.91 2304 Negative 0.92 0.95 0.93 4024 Neutral 0.94 0.91 0.93 3169 Positive 0.90 0.94 0.92 3671 accuracy 0.93 13168 macro avg 0.93 0.92 0.92 13168 weighted avg 0.93 0.93 0.92 13168
Word2Vec - LSTM
Installation of packages
%pip install tensorflow
%pip install numpy
Requirement already satisfied: tensorflow in /usr/local/lib/python3.10/dist-packages (2.15.0) Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.4.0) Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.6.3) Requirement already satisfied: flatbuffers>=23.5.26 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (24.3.25) Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.5.4) Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.2.0) Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.9.0) Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (18.1.1) Requirement already satisfied: ml-dtypes~=0.2.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.2.0) Requirement already satisfied: numpy<2.0.0,>=1.23.5 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.25.2) Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.3.0) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from tensorflow) (24.0) Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.20.3) Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from tensorflow) (67.7.2) Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.16.0) Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.4.0) Requirement already satisfied: typing-extensions>=3.6.6 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (4.12.1) Requirement already satisfied: wrapt<1.15,>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.14.1) Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.37.0) Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.64.1) Requirement already satisfied: tensorboard<2.16,>=2.15 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.15.2) Requirement already satisfied: tensorflow-estimator<2.16,>=2.15.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.15.0) Requirement already satisfied: keras<2.16,>=2.15.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.15.0) Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from astunparse>=1.6.0->tensorflow) (0.43.0) Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (2.27.0) Requirement already satisfied: google-auth-oauthlib<2,>=0.5 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (1.2.0) Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (3.6) Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (2.31.0) Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (0.7.2) Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (3.0.3) Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) (5.3.3) Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) (0.4.0) Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) (4.9) Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow) (1.3.1) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow) (3.7) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow) (2024.6.2) Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=1.0.1->tensorboard<2.16,>=2.15->tensorflow) (2.1.5) Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) (0.6.0) Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow) (3.2.2) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.25.2)
Importing libraries
from gensim.models import Word2Vec
import numpy as np
import tensorflow as tf
from sklearn.calibration import LabelEncoder
Function to convert text to Word2Vec vectors
def text_to_vector(text, word2vec, vector_size):
words = simple_preprocess(text)
text_vector = np.zeros(vector_size)
word_count = 0
for word in words:
if word in word2vec.wv:
text_vector += word2vec.wv[word]
word_count += 1
if word_count > 0:
text_vector /= word_count
return text_vector
Tokenize texts
tokenized_text = dataset["content"].apply(lambda x: x.split())
Vector size parameter
vector_size = 100
Train Word2Vec model
model_word2vec = Word2Vec(
tokenized_text, window=5, min_count=2, workers=4, vector_size=vector_size, epochs=20
)
Convert texts to Word2Vec vectors
train_vectors = np.array(
[text_to_vector(text, model_word2vec, vector_size) for text in X_train]
)
test_vectors = np.array(
[text_to_vector(text, model_word2vec, vector_size) for text in X_test]
)
Find the maximum sequence length in the training set
max_len = max(len(seq) for seq in train_vectors)
Pad sequences to the same length
X_train_emb = tf.keras.preprocessing.sequence.pad_sequences(
train_vectors, maxlen=max_len, dtype="float32", padding="post"
)
X_test_emb = tf.keras.preprocessing.sequence.pad_sequences(
test_vectors, maxlen=max_len, dtype="float32", padding="post"
)
Encode labels
label_encoder = LabelEncoder()
y_train_enc = label_encoder.fit_transform(y_train)
y_test_enc = label_encoder.transform(y_test)
Define LSTM model
model = tf.keras.Sequential(
[
tf.keras.layers.Embedding(input_dim=X_train_emb.shape[1], output_dim=100),
tf.keras.layers.LSTM(128),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(32, activation="relu"),
tf.keras.layers.Dense(4, activation="softmax"),
]
)
Compile the model
model.compile(
optimizer=tf.optimizers.Adam(learning_rate=1e-3),
loss="sparse_categorical_crossentropy",
metrics=["accuracy"],
)
Train the model
model.fit(X_train_emb, y_train_enc, epochs=50, batch_size=64)
Epoch 1/50 823/823 [==============================] - 10s 9ms/step - loss: 1.3439 - accuracy: 0.3438 Epoch 2/50 823/823 [==============================] - 7s 9ms/step - loss: 1.3261 - accuracy: 0.3678 Epoch 3/50 823/823 [==============================] - 6s 8ms/step - loss: 1.3163 - accuracy: 0.3774 Epoch 4/50 823/823 [==============================] - 7s 9ms/step - loss: 1.3020 - accuracy: 0.3975 Epoch 5/50 823/823 [==============================] - 6s 7ms/step - loss: 1.2904 - accuracy: 0.4119 Epoch 6/50 823/823 [==============================] - 8s 9ms/step - loss: 1.2814 - accuracy: 0.4186 Epoch 7/50 823/823 [==============================] - 6s 7ms/step - loss: 1.2741 - accuracy: 0.4262 Epoch 8/50 823/823 [==============================] - 8s 9ms/step - loss: 1.2667 - accuracy: 0.4325 Epoch 9/50 823/823 [==============================] - 6s 7ms/step - loss: 1.2588 - accuracy: 0.4372 Epoch 10/50 823/823 [==============================] - 7s 9ms/step - loss: 1.2513 - accuracy: 0.4407 Epoch 11/50 823/823 [==============================] - 6s 7ms/step - loss: 1.2451 - accuracy: 0.4450 Epoch 12/50 823/823 [==============================] - 7s 8ms/step - loss: 1.2365 - accuracy: 0.4491 Epoch 13/50 823/823 [==============================] - 6s 7ms/step - loss: 1.2291 - accuracy: 0.4560 Epoch 14/50 823/823 [==============================] - 7s 9ms/step - loss: 1.2218 - accuracy: 0.4593 Epoch 15/50 823/823 [==============================] - 6s 7ms/step - loss: 1.2144 - accuracy: 0.4636 Epoch 16/50 823/823 [==============================] - 7s 9ms/step - loss: 1.2066 - accuracy: 0.4669 Epoch 17/50 823/823 [==============================] - 6s 7ms/step - loss: 1.1989 - accuracy: 0.4707 Epoch 18/50 823/823 [==============================] - 7s 9ms/step - loss: 1.1887 - accuracy: 0.4759 Epoch 19/50 823/823 [==============================] - 7s 9ms/step - loss: 1.1810 - accuracy: 0.4803 Epoch 20/50 823/823 [==============================] - 7s 9ms/step - loss: 1.1717 - accuracy: 0.4846 Epoch 21/50 823/823 [==============================] - 6s 7ms/step - loss: 1.1631 - accuracy: 0.4883 Epoch 22/50 823/823 [==============================] - 7s 8ms/step - loss: 1.1533 - accuracy: 0.4948 Epoch 23/50 823/823 [==============================] - 6s 7ms/step - loss: 1.1426 - accuracy: 0.4983 Epoch 24/50 823/823 [==============================] - 7s 9ms/step - loss: 1.1338 - accuracy: 0.5040 Epoch 25/50 823/823 [==============================] - 6s 7ms/step - loss: 1.1229 - accuracy: 0.5075 Epoch 26/50 823/823 [==============================] - 7s 8ms/step - loss: 1.1126 - accuracy: 0.5125 Epoch 27/50 823/823 [==============================] - 6s 7ms/step - loss: 1.1042 - accuracy: 0.5167 Epoch 28/50 823/823 [==============================] - 7s 8ms/step - loss: 1.0920 - accuracy: 0.5237 Epoch 29/50 823/823 [==============================] - 6s 7ms/step - loss: 1.0809 - accuracy: 0.5266 Epoch 30/50 823/823 [==============================] - 7s 8ms/step - loss: 1.0730 - accuracy: 0.5307 Epoch 31/50 823/823 [==============================] - 6s 7ms/step - loss: 1.0628 - accuracy: 0.5357 Epoch 32/50 823/823 [==============================] - 7s 9ms/step - loss: 1.0536 - accuracy: 0.5422 Epoch 33/50 823/823 [==============================] - 6s 7ms/step - loss: 1.0399 - accuracy: 0.5480 Epoch 34/50 823/823 [==============================] - 7s 9ms/step - loss: 1.0350 - accuracy: 0.5503 Epoch 35/50 823/823 [==============================] - 6s 7ms/step - loss: 1.0237 - accuracy: 0.5553 Epoch 36/50 823/823 [==============================] - 7s 9ms/step - loss: 1.0217 - accuracy: 0.5550 Epoch 37/50 823/823 [==============================] - 6s 7ms/step - loss: 1.0073 - accuracy: 0.5633 Epoch 38/50 823/823 [==============================] - 7s 8ms/step - loss: 0.9927 - accuracy: 0.5703 Epoch 39/50 823/823 [==============================] - 6s 8ms/step - loss: 0.9848 - accuracy: 0.5732 Epoch 40/50 823/823 [==============================] - 6s 8ms/step - loss: 0.9786 - accuracy: 0.5748 Epoch 41/50 823/823 [==============================] - 7s 8ms/step - loss: 0.9735 - accuracy: 0.5774 Epoch 42/50 823/823 [==============================] - 6s 7ms/step - loss: 0.9633 - accuracy: 0.5839 Epoch 43/50 823/823 [==============================] - 7s 8ms/step - loss: 0.9530 - accuracy: 0.5873 Epoch 44/50 823/823 [==============================] - 6s 7ms/step - loss: 0.9506 - accuracy: 0.5893 Epoch 45/50 823/823 [==============================] - 7s 9ms/step - loss: 0.9364 - accuracy: 0.5958 Epoch 46/50 823/823 [==============================] - 6s 7ms/step - loss: 0.9260 - accuracy: 0.6006 Epoch 47/50 823/823 [==============================] - 7s 9ms/step - loss: 0.9257 - accuracy: 0.6008 Epoch 48/50 823/823 [==============================] - 6s 7ms/step - loss: 0.9155 - accuracy: 0.6048 Epoch 49/50 823/823 [==============================] - 7s 8ms/step - loss: 0.9103 - accuracy: 0.6066 Epoch 50/50 823/823 [==============================] - 6s 7ms/step - loss: 0.8999 - accuracy: 0.6122
<keras.src.callbacks.History at 0x790a27fa2b60>
Predicting
y_pred = model.predict(X_test_emb)
y_preds_argmax = []
for i in range(len(y_pred)):
y_preds_argmax.append(y_pred[i].argmax())
412/412 [==============================] - 2s 4ms/step
Classification report
print(classification_report(y_test_enc, y_preds_argmax))
precision recall f1-score support 0 0.32 0.20 0.25 2304 1 0.46 0.62 0.53 4024 2 0.44 0.43 0.44 3169 3 0.45 0.39 0.42 3671 accuracy 0.44 13168 macro avg 0.42 0.41 0.41 13168 weighted avg 0.43 0.44 0.42 13168