- added preprocess_data.py
- added movies_data_short.py for testing
This commit is contained in:
parent
7d27ba0fe1
commit
937415496e
20
data/movies_data_short.csv
Normal file
20
data/movies_data_short.csv
Normal file
@ -0,0 +1,20 @@
|
||||
title,release_date,genres,directors,stars,rating,duration,description,storyline,keywords
|
||||
Remo: Unarmed and Dangerous,1985,"['Action', 'Adventure', 'B-Action', 'Comedy', 'Crime', 'Romance', 'Thriller']",['Guy Hamilton'],"['Fred Ward', 'Joel Grey', 'Wilford Brimley']",6.4,2h 1m,"An officially ""dead"" cop is trained to become an extraordinary unique assassin in service of the US President.","A cop who answers a call is ambushed. The next day he is buried. But in reality he is in a hospital and his appearance has been altered. He is then told by a man named McCleary that he now belongs to ""them"". ""Them"" being CURE an organization whose job is to battle corruption. They give him the new name of Remo Williams. He then meets the head of CURE Harold Smith, who spends most of his time sitting in front of a copmuter and perusing over reports of individuals that have to be dealt with. They then give him to Chiun, a Shinanju master, which is the art of killing someone and making it seem like an accident or natural causes. Chiun's regimen is hard on him. Smith then discovers a man named Grove, who is a defense contractor. It seems that whenever there's a case against him, the key witnesses and investigators disappear. Currently a military investigator is pursuing him about his new project which for some reason, he is tight lipped about. Smith sends McCleary and Remo to help her but Grove discovers them and wanting to know about them decides to stir things up. He sends some people to take Remo out but he outwits them. And when he tells Smith about it, Smith doesn't care who then tells him that unless they have more evidence against Grove they can't do anything and if they are about to be exposed, they have to disappear. And while Smith and McCleary have made arrangements for their demise, Remo is told that Chiun will take him out. So he and McCleary have to get the evidence they need.","['race against time', 'assassin', 'korean war veteran', 'soldier', 'action hero']"
|
||||
Godzilla,1998,"['Action', 'Dinosaur Adventure', 'Disaster', 'Kaiju', 'Sci-Fi', 'Thriller']",['Roland Emmerich'],"['Matthew Broderick', 'Jean Reno', 'Maria Pitillo']",5.5,2h 19m,French nuclear tests irradiate an iguana into a giant monster that heads off to New York City. The American military must chase the monster across the city to stop it before it reproduces.,"In the wake of extensive nuclear testing in the South Pacific Ocean, the low-profile scientist, Niko Tatopoulos, is summoned by the U.S. Army to shed light on the mysterious attack on a fishing ship, and the ominous sightings of a gargantuan sea-dragon. Before long, a mutated scaly nightmare in the shape of Godzilla--a massive and all-powerful radioactive sauroid--threatens to level the rain-soaked New York City, against the backdrop of a crippling bureaucracy and the military's futile attempts to stop the invincible beast from the ocean. Now, it's up to Niko; the cryptic insurance agent, Philippe; the determined reporter, Audrey, and her brave cameraman, Victor, to put an end to Godzilla's reign of terror before it's too late. Is there a reason why Godzilla has chosen Manhattan for its den?","['giant monster', 'iguana', 'military', 'giant footprint', 'monster']"
|
||||
Die Hard: With a Vengeance,1995,"['Action', 'Adventure', 'Dark Comedy', 'One-Person Army Action', 'Thriller', 'Urban Adventure']",['John McTiernan'],"['Bruce Willis', 'Jeremy Irons', 'Samuel L. Jackson']",7.6,2h 8m,"John McClane and a Harlem store owner are targeted by German terrorist Simon in New York City, where he plans to rob the Federal Reserve Building.","In New York City, there are some explosions. Someone then calls the police claiming to be the man responsible and says his name is Simon. He wants Detective John McClane to do certain things, thing is McClane is not exactly in good spirits, he's been suspended from the force and is drinking. But Simon insists that's what he wants, McClane is sent to Harlem where he attracts the ire of some of the residents and a repairman named Zeus, helps him. Later Simon calls again and gives McClane his next instructions and wants Zeus to accompany him. Eventually they end up on Wall Street where there's another explosion. That's when the Feds show up and tells McClane who Simon is, a man who has a grudge against McClane. Simon calls and sends McClane on his next task but along the way McClane realizes that Simon is playing them.","['diversionary tactic', 'grudge', 'telephone booth', 'tough cop', 'bomb']"
|
||||
Fast & Furious: Hobbs & Shaw,2019,"['Action', 'Adventure', 'Car Action', 'Dark Comedy', 'Thriller']",['David Leitch'],"['Dwayne Johnson', 'Jason Statham', 'Idris Elba']",6.5,2h 17m,Lawman Luke Hobbs and outcast Deckard Shaw form an unlikely alliance when a cyber-genetically enhanced villain threatens the future of humanity.,Lawman Luke Hobbs and outcast Deckard Shaw form an unlikely alliance when a cyber-genetically enhanced villain threatens the future of humanity.,"['samoa', 'flamethrower', 'fast and furious franchise', 'spin off', 'shared universe']"
|
||||
Spider-Man: No Way Home,2021,"['Action', 'Adventure', 'Fantasy', 'Sci-Fi', 'Superhero', 'Supernatural Fantasy', 'Urban Adventure']",['Jon Watts'],"['Tom Holland', 'Zendaya', 'Benedict Cumberbatch']",8.2,2h 28m,"With Spider-Man's identity now revealed, Peter asks Doctor Strange for help. When a spell goes wrong, dangerous foes from other worlds start to appear.","With his identity compromised, right after the spectacular confrontation with super-hero charlatan Mysterio in Spider-Man: Far from Home (2019), Peter Parker is now with his back to the wall. On the run and having no one to turn to for advice, desperate Peter seeks a radical and equally dangerous solution to right a wrong, utterly unaware of the grave consequences of his ill-advised decision. And, as the unfathomable Multiverse expands with a vengeance, formidable adversaries from a not-so-distant past, too, seek closure, demanding the Spider's head on a platter. But when there's no way home and nowhere to hide, who can Parker trust?","['spider man character', 'superhero', 'marvel cinematic universe', 'multiverse', 'green goblin character']"
|
||||
Red Cliff,2008,"['Action', 'Adventure', 'Drama', 'Historical Epic', 'History', 'War', 'War Epic', 'Wuxia']",['John Woo'],"['Tony Leung Chiu-wai', 'Takeshi Kaneshiro', 'Fengyi Zhang']",7.3,2h 28m,The first chapter of a two-part story centered on a battle fought in China's Three Kingdoms period (220-280 A.D.).,"In 208 A.D., in the Han Dinasty of China, the tyrannic and greedy Prime Minster Cao Cao forces the reluctant Emperor Han to declare war against the kingdoms of Liu Bei and Sun Quan in the South of China. Cao Cao heads with a mighty army of one million soldiers and attacks Liu Bei. His advisor and war strategist Zhuge Liang heads to South in a diplomatic mission trying to convince Sun Quan to join force with Liu Bei against the powerful warlord. When Zhuge Liang meets the viceroy Zhou Yu, he succeeds in his assignment with the alliance of the two kingdoms against Cao Cao. The armies fight against each other in many battles until the final one in Red Cliff where guile, knowledge and strategy prevail.","['han dynasty china', 'female spy', '3rd century', 'historical event', 'epic war']"
|
||||
Full Metal Jacket,1987,"['Dark Comedy', 'Drama', 'Period Drama', 'War']",['Stanley Kubrick'],"['Matthew Modine', 'R. Lee Ermey', ""Vincent D'Onofrio""]",8.2,1h 56m,A pragmatic U.S. Marine observes the dehumanizing effects the Vietnam War has on his fellow recruits from their brutal boot camp training to the bloody street fighting in Hue.,"The exploits of J.T. Davis in two distinct phases of his time associated to the Vietnam War are presented. He generally acts as an active observer of the proceedings around him, either by his own choice or by the design of others often in authority. The first phase is as a recruit in basic training for the US Marine Corps at Parris Island, where he receives the nickname Joker from his platoon drill sergeant Gunnery Sergeant Hartman for his sarcastic quips, generally muttered under his breath, usually mimicking John Wayne. He learns that Hartman's foul-mouthed and uncompromisingly harsh ways, probably a reflection of the Marines as an organization, is not to produce robotic troops as some may believe, but rather produce killing machines. Joker's time in basic training is largely affected by Hartman using overweight and slightly slow Private Leonard Lawrence as the platoon's whipping boy, Lawrence's nickname provided by Hartman being Gomer Pyle for his ineptness as a service member, much like his television character namesake. The second phase is in active duty in Vietnam - Da Nang - he assigned to write for Stars and Stripes. He is not to write exactly what he sees but to skew stories in a way to boost serviceman morale and to convince non-military people the reason for American political and thus military involvement in this region of the world, especially in many Americans seeing the war as futile. In covering what becomes the Tet offensive, Joker may come face to face with the kill or be killed mentality without it jibing either with his own outlook or his assigned job.","['vietnam war', 'u.s. marine', 'military', 'drill instructor', 'boot camp']"
|
||||
Star Wars: Episode VIII - The Last Jedi,2017,"['Action', 'Action Epic', 'Adventure', 'Adventure Epic', 'Fantasy', 'Fantasy Epic', 'Sci-Fi', 'Sci-Fi Epic', 'Space Sci-Fi']",['Rian Johnson'],"['Daisy Ridley', 'John Boyega', 'Mark Hamill']",6.9,2h 32m,Rey develops her abilities with the help of Luke Skywalker as the Resistance prepares for battle against the First Order.,"Following the battle of Starkiller Base, General Leia Organa leads Resistance forces to flee D'Qar when a First Order fleet arrives. Poe Dameron leads a costly counterattack that destroys a First Order dreadnought, but after the Resistance escapes to hyperspace, the First Order tracks them and attacks the Resistance convoy. Kylo Ren, Leia's son, hesitates to fire on the lead Resistance ship after sensing his mother's presence, but his wing-men destroy the bridge, killing most of the Resistance leadership and incapacitating Leia, who survives by using the Force. Disapproving of new leader Vice Admiral Holdo's passive strategy, Poe helps Finn, BB-8, and mechanic Rose Tico embark on a secret mission to disable the First Order's tracking device..","['wisecrack humor', 'deception', 'betrayal', 'mother son relationship', 'sabotage']"
|
||||
Man of Tai Chi,2013,"['Action', 'Drama', 'Kung Fu', 'Martial Arts']",['Keanu Reeves'],"['Hu Chen', 'Keanu Reeves', 'Karen Mok']",6.0,1h 45m,A young martial artist's unparalleled Tai Chi skills land him in a highly lucrative underworld fight club.,"Hong Kong tycoon Donaka Mark runs an underground network of martial arts applied in bloody duels for paying audiences, live on site or pay per view. The police hasn't got anywhere investigating him after the suspicious death of an informer who was put down after a fight. Donaka also recruits in mainland China, looking for novelties at mainstream martial arts events. In Beijing he spots and invites Tiger' Chen Lin Hu, a bike courier and the last student of tai chi master Yang, who almost despairs if his gifted disciple will finally embrace the meditation-based lifestyle or be destroyed by the destructive logic of violent ambition. To make tiger accept fighting for cash, Donaka arranges for the temple to need a small fortune within a month to avoid demolition. Tiger wins fast, but descends in an amoral spiral and the police is on their trail.","['martial arts', 'tai chi', 'temple', 'chinese', 'martial art']"
|
||||
Transporter 2,2005,"['Action', 'Car Action', 'Thriller']",['Louis Leterrier'],"['Jason Statham', 'Amber Valletta', 'Kate Nauta']",6.3,1h 27m,"Transporter Frank Martin, surfaces in Miami, Florida and is implicated in the kidnapping of the young son of a powerful USA official.","Former soldier turned hired criminal Frank Martin, now living in Miami, has been hired for his latest assignment. Frank has been hired as a bodyguard to Jack Billings, son of Jefferson Billings, a wealthy US official for the US government drug control organization who is attending a conference with the DEA. When Jack is kidnapped by a international crime boss known as Gianni and his associates including his murderous lover Lola and gets implicated in the kidnapping, Frank with help of trusted friend, French police detective Tarconi, sets out to rescue Jack and takes on Gianni and his henchmen, as Gianni infects Jack with a engineered virus which will infect those who come into contact with Jack, as Gianni plans to infect Jefferson and sabotage the conference.","['sequel', 'ex soldier', 'biological weapon', 'viral infection', 'deadly virus']"
|
||||
Mad Max,1979,"['Action', 'Adventure', 'B-Action', 'Car Action', 'Desert Adventure', 'Dystopian Sci-Fi', 'Sci-Fi', 'Thriller']",['George Miller'],"['Mel Gibson', 'Joanne Samuel', 'Hugh Keays-Byrne']",6.8,1h 28m,"In a self-destructing world, a vengeful Australian policeman sets out to stop a violent motorcycle gang.","In a dystopic future Australia, a vicious violent biker gang murder nicknamed the Nightrider, a cop's family and make his fight with them personal. He escapes from police custody by killing an officer and stealing his vehicle. Max pursues the Nightrider in a high-speed chase, which results in the Nightrider's death by fiery explosion. Following the dangerous chase, which resulted in injuries for a number of officers, the police chief warns Max who thinks nothing of it at the time that now the bandits are out for him because of the death of the Nightrider. The biker gang, which is led by the Toecutter plans to avenge Nightrider's death by killing MFP officers. Toecutter's young protegé, the biker Johnny the Boy, sets a trap for Max's close friend and fellow officer, Jim Goose. When Goose's vehicle is flipped over, the bikers burn him alive in retaliation for the Nightrider's death.","['post apocalypse', 'dystopia', 'motorcycle gang', 'biker', 'revenge']"
|
||||
The Hunger Games,2012,"['Action', 'Adventure', 'Dystopian Sci-Fi', 'Sci-Fi', 'Teen Adventure', 'Thriller']",['Gary Ross'],"['Jennifer Lawrence', 'Josh Hutcherson', 'Liam Hemsworth']",7.2,2h 22m,Katniss Everdeen voluntarily takes her younger sister's place in the Hunger Games: a televised competition in which two teenagers from each of the twelve Districts of Panem are chosen at ran... Read all,"In order to control future rebellions by remembering the past rebellion, the Powers That Be of the dystopian society of Panem force two youngsters from each of the twelve districts to participate in The Hunger Games. The rules are very simple: the twenty-four players must kill each other and survive in the wilderness until only one remains. The games are broadcast through the Capital and the twelve districts to entertain and intimidate the population. In District 12, teenager Katniss Everdeen is a great hunter and archer. When her younger sister, Primrose Everdeen, is selected as one of the ""tributes"" of their district, Katniss volunteers to take her place in the games. Together with Peeta Mellark, they head by train to the Capital to be prepared for the brutal game.","['female protagonist', 'self survival', 'tough girl', 'teenage killer', 'child murderer']"
|
||||
The Matrix Revolutions,2003,"['Action', 'Cyberpunk', 'Gun Fu', 'Martial Arts', 'Sci-Fi', 'Sci-Fi Epic', 'Superhero']","['Lana Wachowski', 'Lilly Wachowski']","['Keanu Reeves', 'Laurence Fishburne', 'Carrie-Anne Moss']",6.7,2h 9m,The human city of Zion defends itself against the massive invasion of the machines as Neo fights to end the war at another front while also opposing the rogue Agent Smith.,"After single-handedly defeating the unstoppable Sentinels in The Matrix Reloaded (2003), Neo finds himself trapped between the Matrix and the machine world. And as Trinity and Morpheus cut a deal with the hateful Merovingian, indestructible Agent Smith grows stronger by the minute, bent on destroying Neo once and for all. With Zion under attack, the remaining humans brace themselves up to make their tragic and heroic last stand against the enemy, and Neo makes a pivotal, game-changing decision. Now, the fate of humankind is hanging by a thread. Can Neo's prophetic visions ensure victory?","['good versus evil', '2200s', 'human versus machine', 'one against many', 'virtual reality simulation']"
|
||||
Bloodsport,1988,"['Action', 'B-Action', 'Biography', 'Drama', 'Sport']",['Newt Arnold'],"['Jean-Claude Van Damme', 'Donald Gibb', 'Leah Ayres']",6.8,1h 32m,"""Bloodsport"" follows Frank Dux, an American martial artist serving in the military, who decides to leave the army to compete in a martial arts tournament in Hong Kong where fights to the dea... Read all","Frank Dux is an American martial artist. His former teacher in the martial arts, gives him an invitation to ""The Kumite"", the secret martial arts tournament where only the world's best fighters are invited. Frank shows up in Hong Kong for the tournament, but his CO's in the US Army are right on his tail. Frank wins match after match, and shows promise that he may be the first person from the Western Hemisphere to win the tournament, until the defending champion gets his hands on Frank's friend, Jackson, and injures him in the Quarter Finals. Now Frank faces an uphill climb. His friend is hurt, the US Army is on his tail, and he is on the verge of making martial arts history. The question is, will he?","['tournament', 'martial arts', 'kumite', 'hand to hand combat', 'martial arts tournament']"
|
||||
Jack Reacher,2012,"['Action', 'Conspiracy Thriller', 'One-Person Army Action', 'Thriller']",['Christopher McQuarrie'],"['Tom Cruise', 'Rosamund Pike', 'Richard Jenkins']",7.0,2h 10m,"Jack Reacher, a homicide investigator, digs deeper into a case involving a trained military sniper responsible for a mass shooting.","When a crazed sniper guns down five seemingly random people on a crowded Pittsburgh riverfront, Det. Emerson (David Oyelowo) quickly amasses enough evidence at the scene to implicate an unstable ex-military sniper named James Barr (Joseph Sikora). Upon being questioned by Emerson and DA Alex Rodin (Richard Jenkins), however, Barr demands to speak with Jack Reacher (Tom Cruise). A former military investigator who fell off the grid following his service, Reacher soon shows up on the scene and begins gathering clues with the aid of talented defense attorney Helen Rodin (Rosamund Pike), the daughter of the DA. Meanwhile, when Reacher is assaulted in a local bar, he correctly surmises that someone is determined to impede his investigation. His theory plays out when he becomes the prime suspect in the murder of a young woman shortly thereafter. Now, with the police closing in from one side and a gang of ruthless killers gaining ground on the other, Reacher must use his formidable detective skills in order to catch the gunman and uncover his true motives.","['sniper', 'coma', 'action hero', 'murder mystery', 'brawl']"
|
||||
Dragon,2011,"['Action', 'Crime', 'Drama', 'Thriller', 'Wuxia']",['Peter Ho-Sun Chan'],"['Donnie Yen', 'Takeshi Kaneshiro', 'Tang Wei']",7.0,1h 55m,A papermaker gets involved with a murder case concerning two criminals leading to a determined detective suspecting him and the former's vicious father searching for him.,A papermaker gets involved with a murder case concerning two criminals leading to a determined detective suspecting him and the former's vicious father searching for him.,"['martial arts', 'wuxia', 'acupuncture', 'master', 'chinese']"
|
||||
Face/Off,1997,"['Action', 'Crime', 'Gun Fu', 'Sci-Fi', 'Thriller']",['John Woo'],"['John Travolta', 'Nicolas Cage', 'Joan Allen']",7.3,2h 18m,"To foil a terrorist plot, FBI agent Sean Archer assumes the identity of the criminal Castor Troy who murdered his son through facial transplant surgery, but the crook wakes up prematurely an... Read all","While trying to kill the FBI Agent Sean Archer with a sniper rifle, the terrorist Castor Troy hits him on the chest but accidentally kills his little son Mike. The relentless Archer hunts down Troy and his brother Pollux Troy and he learns that Troy has planted a bomb in Los Angeles that will provoke destruction and several killings. Archer and his team chase Castor and Pollux in the airport, and he kills Castor, but Pollux survives. He is the only person who knows where the bomb is, but he refuses to tell. FBI Agents Hollis Miller and Tito Biondi propose an alternative secret solution to find the location of the bomb. The surgeon Dr. Malcolm Walsh would remove the face of Castor Troy, who is alive in coma in a hospital, and replace Sean Archer's face by his. Then he would go to the prison where Pollux is to learn the location of the bomb. Archer accepts the arrangements and finds where the bomb is. But the problem is that Castor has woken up from the coma, forced Dr. Walsh to put Sean Archer's face on him and killed everybody that knows the operation, burning the hospital do the ashes. Now, how can the real Archer prove his identity?","['face transplant', 'death of child', 'gore', 'face ripped off', 'severed face']"
|
||||
Rumble in the Bronx,1995,"['Action', 'Comedy', 'Crime', 'Dark Comedy', 'Kung Fu', 'Martial Arts', 'Thriller']",['Stanley Tong'],"['Jackie Chan', 'Anita Mui', 'Françoise Yip']",6.8,1h 44m,A young man visiting and helping his uncle in New York City finds himself forced to fight a street gang and the mob with his martial art skills.,"Keong comes from Hong Kong to visit New York for his uncle's wedding. His uncle runs a market in the Bronx and Keong offers to help out while Uncle is on his honeymoon. During his stay in the Bronx, Keong befriends a neighbor kid and beats up some neighborhood thugs who cause problems at the market. Meanwhile, one of those petty thugs in the local gang stumbles into a criminal situation way over his head. Blinded by greed, his involvement draws his gang, the kid, Keong, and the whole neighborhood into a deadly crossfire. When the lazy cops fail to successfully resolve matters, Keong takes things into his own hands. Needless to say, much spectacular kung-fu and outrageous action sequences follow....","['chinese american', 'body lands on a car', 'child in jeopardy', 'martial arts action', 'dark comedy']"
|
||||
Bullet Train,2022,"['Action', 'Comedy', 'One-Person Army Action', 'Thriller']",['David Leitch'],"['Brad Pitt', 'Joey King', 'Aaron Taylor-Johnson']",7.3,2h 7m,Five assassins aboard a swiftly-moving bullet train find out that their missions have something in common.,"Code name, Ladybug, is a hitman who's had difficult times and is working with a therapist to reduce stress in his life and to be more positive. He's actively trying to be less violent. He's given an assignment to steal a briefcase from a moving bullet train. Unfortunately for Ladybug, the case is in the possession of other hitmen. Oh, and there are also several other dangerous people and 1 snake aboard the train. As the journey progresses, the briefcase changes hands several times and the reason so many violent people are on the train becomes apparent. The plot is carried on by the device of sudden cuts to earlier events to explain how the current situation develops.","['bullet train', 'train', 'assassin', 'fight on a train', 'japan']"
|
|
58
src/model/preprocess_data.py
Normal file
58
src/model/preprocess_data.py
Normal file
@ -0,0 +1,58 @@
|
||||
import pandas as pd
|
||||
|
||||
def convert_duration_to_minutes(duration):
|
||||
|
||||
if not isinstance(duration, str):
|
||||
return None
|
||||
try:
|
||||
parts = duration.split()
|
||||
hours = int(parts[0][:-1]) if 'h' in parts[0] else 0
|
||||
minutes = int(parts[1][:-1]) if len(parts) > 1 and 'm' in parts[1] else 0
|
||||
return hours * 60 + minutes
|
||||
except (ValueError, IndexError):
|
||||
return None
|
||||
|
||||
|
||||
def load_and_preprocess(data_path, save_path=None):
|
||||
print("Loading data...")
|
||||
|
||||
df = pd.read_csv(data_path)
|
||||
|
||||
print("Preprocessing data...")
|
||||
|
||||
df['duration_minutes'] = df['duration'].apply(convert_duration_to_minutes)
|
||||
|
||||
columns_to_parse = ['genres', 'directors', 'stars', 'keywords']
|
||||
for column in columns_to_parse:
|
||||
df[column] = df[column].apply(eval)
|
||||
|
||||
|
||||
df['description'] = df['description'].fillna('').astype(str)
|
||||
df['storyline'] = df['storyline'].fillna('').astype(str)
|
||||
|
||||
|
||||
df['combined_text'] = (
|
||||
df['description'] + " " +
|
||||
df['storyline'] + " " +
|
||||
df['keywords'].apply(lambda x: " ".join(x))
|
||||
)
|
||||
|
||||
|
||||
df = df.drop(columns=['description', 'storyline', 'keywords', 'duration'])
|
||||
|
||||
|
||||
print("Preprocessing complete.")
|
||||
|
||||
if save_path:
|
||||
df.to_csv(save_path, index=False)
|
||||
print(f"Processed data saved to {save_path}")
|
||||
|
||||
return df
|
||||
|
||||
data_path = "../../data/movies_data.csv"
|
||||
save_path = "../../data/preprocessed_data.csv"
|
||||
|
||||
df = load_and_preprocess(data_path, save_path)
|
||||
|
||||
# print("\nPreprocessed Data Preview:")
|
||||
# print(df.head())
|
80
src/model/test.py
Normal file
80
src/model/test.py
Normal file
@ -0,0 +1,80 @@
|
||||
import pandas as pd
|
||||
|
||||
|
||||
def convert_duration_to_minutes(duration):
|
||||
"""
|
||||
Convert duration from 'h m' format to total minutes.
|
||||
|
||||
Args:
|
||||
duration (str): Duration in the format 'Xh Ym'.
|
||||
|
||||
Returns:
|
||||
int: Total duration in minutes.
|
||||
"""
|
||||
try:
|
||||
# Split the duration into hours and minutes
|
||||
parts = duration.split()
|
||||
hours = int(parts[0][:-1]) if 'h' in parts[0] else 0
|
||||
minutes = int(parts[1][:-1]) if len(parts) > 1 and 'm' in parts[1] else 0
|
||||
return hours * 60 + minutes
|
||||
except (ValueError, IndexError):
|
||||
return None # Return None if the format is invalid
|
||||
|
||||
|
||||
def load_and_preprocess(data_path, save_path=None):
|
||||
"""
|
||||
Load and preprocess movie data from a CSV file.
|
||||
|
||||
Args:
|
||||
data_path (str): Path to the movies_data.csv file.
|
||||
save_path (str, optional): Path to save the processed data. Defaults to None.
|
||||
|
||||
Returns:
|
||||
pd.DataFrame: Preprocessed DataFrame.
|
||||
"""
|
||||
print("Loading data...")
|
||||
# Load dataset
|
||||
df = pd.read_csv(data_path)
|
||||
|
||||
print("Preprocessing data...")
|
||||
# Parse columns with list-like strings
|
||||
columns_to_parse = ['genres', 'directors', 'stars', 'keywords']
|
||||
for column in columns_to_parse:
|
||||
df[column] = df[column].apply(eval)
|
||||
|
||||
# Handle missing values in text fields
|
||||
df['description'] = df['description'].fillna('').astype(str)
|
||||
df['storyline'] = df['storyline'].fillna('').astype(str)
|
||||
|
||||
# Combine relevant fields for embedding generation
|
||||
df['combined_text'] = (
|
||||
df['description'] + " " +
|
||||
df['storyline'] + " " +
|
||||
df['keywords'].apply(lambda x: " ".join(x))
|
||||
)
|
||||
|
||||
# Convert duration to minutes
|
||||
print("Converting duration to minutes...")
|
||||
df['duration_minutes'] = df['duration'].apply(convert_duration_to_minutes)
|
||||
|
||||
print("Preprocessing complete.")
|
||||
|
||||
# Save the processed data if save_path is provided
|
||||
if save_path:
|
||||
df.to_csv(save_path, index=False)
|
||||
print(f"Processed data saved to {save_path}")
|
||||
|
||||
return df
|
||||
|
||||
|
||||
|
||||
# Specify input and output file paths
|
||||
input_path = "../../data/movies_data.csv" # Replace with your dataset path
|
||||
output_path = "../../data/processed_data.csv"
|
||||
|
||||
# Preprocess data and save
|
||||
df = load_and_preprocess(input_path, save_path=output_path)
|
||||
|
||||
# Print a preview of the processed data
|
||||
print("\nFirst few rows of the preprocessed data:")
|
||||
print(df[['title', 'duration', 'duration_minutes']].head())
|
Loading…
Reference in New Issue
Block a user