View on GitHub

Movie-sDataAnalyticsUsingPig-

Movie-sDataAnalyticsUsingPig-

image_1

Movies have been a great source of entertainment for the people ever since their inception in the late 18th century. The term movie is very broad and its definition contains language and genres such as drama, comedy, science fiction and action. The data about movies over the years is very vast and to analyze it, there is a need to break away from the traditional analytics techniques and adopt big data analytics. In this paper I have taken the data set on movies and analyzed it against various queries to uncover real nuggets from the dataset for effective recommendation system and ratings for the upcoming movies

Problem statement:

  1. Find the number of movies released between 1950 and 1960.
  2. Find the number of movies having rating more than 4.
  3. Find the movies whose rating are between 3 and 4.
  4. Find the number of movies with duration more than 2 hours(7200 second).
  5. Find the list of years and number of movies released each year.
  6. Find the total number of movies in the dataset.
  7. Finding sample percent of movies in the dataset.

Dataset description:

Column1: Movie ID

Column2: Movie name

Column3: Year of release

Column4: Rating of the movie

Column5: Movie duration in seconds