{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Wieloklasowa klasyfikacja tekstu\n", "
Celem projektu było stworzenie modelu, który klasyfikuje wypowiedzi zgłaszane przez studentów z Indii przygotowujących się do egzaminów JEE Advanced, JEE Mains i NEET do jednej z kilku możliwych klas opisujących przedmiot związany z wypowiedzią.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " | eng | \n", "Subject | \n", "
---|---|---|
0 | \n", "An anti-forest measure is\\nA. Afforestation\\nB... | \n", "Biology | \n", "
1 | \n", "Among the following organic acids, the acid pr... | \n", "Chemistry | \n", "
2 | \n", "If the area of two similar triangles are equal... | \n", "Maths | \n", "
3 | \n", "In recent year, there has been a growing\\nconc... | \n", "Biology | \n", "
4 | \n", "Which of the following statement\\nregarding tr... | \n", "Physics | \n", "
\n", " | eng | \n", "Subject_Biology | \n", "Subject_Chemistry | \n", "Subject_Maths | \n", "Subject_Physics | \n", "
---|---|---|---|---|---|
0 | \n", "An anti-forest measure is\\nA. Afforestation\\nB... | \n", "True | \n", "False | \n", "False | \n", "False | \n", "
1 | \n", "Among the following organic acids, the acid pr... | \n", "False | \n", "True | \n", "False | \n", "False | \n", "
2 | \n", "If the area of two similar triangles are equal... | \n", "False | \n", "False | \n", "True | \n", "False | \n", "
3 | \n", "In recent year, there has been a growing\\nconc... | \n", "True | \n", "False | \n", "False | \n", "False | \n", "
4 | \n", "Which of the following statement\\nregarding tr... | \n", "False | \n", "False | \n", "False | \n", "True | \n", "
\n", " | eng | \n", "Subject_Biology | \n", "Subject_Chemistry | \n", "Subject_Maths | \n", "Subject_Physics | \n", "prepared_text | \n", "
---|---|---|---|---|---|---|
0 | \n", "An anti-forest measure is\\nA. Afforestation\\nB... | \n", "True | \n", "False | \n", "False | \n", "False | \n", "an anti-forest measur is a. afforest b . selec... | \n", "
1 | \n", "Among the following organic acids, the acid pr... | \n", "False | \n", "True | \n", "False | \n", "False | \n", "among the follow organ acid , the acid present... | \n", "
2 | \n", "If the area of two similar triangles are equal... | \n", "False | \n", "False | \n", "True | \n", "False | \n", "if the area of two similar triangl are equal ,... | \n", "
3 | \n", "In recent year, there has been a growing\\nconc... | \n", "True | \n", "False | \n", "False | \n", "False | \n", "in recent year , there ha been a grow concern ... | \n", "
4 | \n", "Which of the following statement\\nregarding tr... | \n", "False | \n", "False | \n", "False | \n", "True | \n", "which of the follow statement regard transform... | \n", "
Model składający się z warstw Embedding, LSTM, GlobalAveragePooling1D oraz kilku warstw Dense
" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_1\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_1 (Embedding) (None, None, 16) 160000 \n", " \n", " lstm_1 (LSTM) (None, None, 64) 20736 \n", " \n", " global_average_pooling1d_1 (None, 64) 0 \n", " (GlobalAveragePooling1D) \n", " \n", " dense (Dense) (None, 254) 16510 \n", " \n", " dense_1 (Dense) (None, 128) 32640 \n", " \n", " dense_2 (Dense) (None, 4) 516 \n", " \n", "=================================================================\n", "Total params: 230,402\n", "Trainable params: 230,402\n", "Non-trainable params: 0\n", "_________________________________________________________________\n" ] } ], "source": [ "model_1 = Sequential()\n", "\n", "model_1.add(Embedding(input_dim=10000, output_dim=16))\n", "model_1.add(LSTM(units=64, return_sequences=True))\n", "model_1.add(GlobalAveragePooling1D())\n", "model_1.add(Dense(254, activation='relu'))\n", "model_1.add(Dense(128, activation='relu'))\n", "model_1.add(Dense(4, activation='softmax'))\n", "\n", "model_1.compile(optimizer='adam', loss='categorical_crossentropy', \n", " metrics=['accuracy', \n", " tf.keras.metrics.Precision(name='precision'),\n", " tf.keras.metrics.Recall(name='recall'),\n", " tf.keras.metrics.AUC(name='auc'),\n", " tf.keras.metrics.TruePositives(name='tp'),\n", " tf.keras.metrics.FalsePositives(name='fp'),\n", " tf.keras.metrics.TrueNegatives(name='tn'),\n", " tf.keras.metrics.FalseNegatives(name='fn')])\n", "\n", "model_1.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Model konwolucyjny z warstwą Embedding i jedną warstwą konwolucyjną 1D
" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_2\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_2 (Embedding) (None, None, 16) 160000 \n", " \n", " conv1d (Conv1D) (None, None, 128) 10368 \n", " \n", " global_max_pooling1d (Globa (None, 128) 0 \n", " lMaxPooling1D) \n", " \n", " dense_3 (Dense) (None, 4) 516 \n", " \n", "=================================================================\n", "Total params: 170,884\n", "Trainable params: 170,884\n", "Non-trainable params: 0\n", "_________________________________________________________________\n" ] } ], "source": [ "model_2 = Sequential()\n", "\n", "model_2.add(Embedding(input_dim=10000, output_dim=16))\n", "model_2.add(Conv1D(filters=128, kernel_size=5, activation='relu'))\n", "model_2.add(GlobalMaxPooling1D())\n", "model_2.add(Dense(4, activation='softmax'))\n", "model_2.compile(optimizer='adam', loss='categorical_crossentropy', \n", " metrics=['accuracy', \n", " tf.keras.metrics.Precision(name='precision'),\n", " tf.keras.metrics.Recall(name='recall'),\n", " tf.keras.metrics.AUC(name='auc'),\n", " tf.keras.metrics.TruePositives(name='tp'),\n", " tf.keras.metrics.FalsePositives(name='fp'),\n", " tf.keras.metrics.TrueNegatives(name='tn'),\n", " tf.keras.metrics.FalseNegatives(name='fn')])\n", "\n", "model_2.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Model rekurencyjny z warstwą Embedding i jedną warstwą GRU
" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_4\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " embedding_4 (Embedding) (None, None, 16) 160000 \n", " \n", " gru_1 (GRU) (None, 64) 15744 \n", " \n", " dense_5 (Dense) (None, 4) 260 \n", " \n", "=================================================================\n", "Total params: 176,004\n", "Trainable params: 176,004\n", "Non-trainable params: 0\n", "_________________________________________________________________\n" ] } ], "source": [ "model_3 = Sequential()\n", "\n", "model_3.add(Embedding(input_dim=10000, output_dim=16))\n", "model_3.add(GRU(units=64))\n", "model_3.add(Dense(units=4, activation='softmax'))\n", "model_3.compile(optimizer='adam', loss='categorical_crossentropy', \n", " metrics=['accuracy', \n", " tf.keras.metrics.Precision(name='precision'),\n", " tf.keras.metrics.Recall(name='recall'),\n", " tf.keras.metrics.AUC(name='auc'),\n", " tf.keras.metrics.TruePositives(name='tp'),\n", " tf.keras.metrics.FalsePositives(name='fp'),\n", " tf.keras.metrics.TrueNegatives(name='tn'),\n", " tf.keras.metrics.FalseNegatives(name='fn')])\n", "\n", "model_3.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "