I am a Data scientiest. Enjoy to analyse large datasets and derive meaningful information. Technology lover . Hunger for learning automation tools and bigdata and hadoop technology . Eager to use knowledge to solve business problems using my knowledge
Write code to analyse data and derive meaningful information . Manage code in version control system and use the change accoridng to business need . Write scripts to automate the process and reduce the process time .
I work in TCS from past 2 years where where my distributed team and I help people to transform their code and help to manage change. Analyse large sets of data to provide meaningful information to end users
In my free time i read novels ,learn automated technology, travel new places and enjoy the taste of different foods.
This article covers the sentiment analysis of any topic by parsing the tweets fetched from Twitter using apache fume.Sentiment Analysis is the process of ‘computationally’ determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker.In this project ,main concentration points are Retweet_Count and Screen_Name. For more info , kindly click on SentimentalAnalysisOfTwitter .
The proposed method is made by considering following scenario under consideration An Airport has huge amount of data related to number of flights, data and time of arrival and dispatch, flight routes, No. of airports operating in each country, list of active airlines in each country. The problem they faced till now it’s, they have ability to analyze limited data from databases. The Proposed model intension is to develop a model for the airline data to provide platform for new analytics based on the following queries. kindly click on AirlinesAnalysis
Analysis of structured data has seen tremendous success in the past. However, analysis of large scale unstructured data in the form of video format remains a challenging area. YouTube, a Google company, has over a billion users and generate billions of views. Since YouTube data is getting created in a very huge amount and with an equally great speed, there is a huge demand to store, process and carefully study this large amount of data to make it usable The main objective of this project is to demonstrate by using Hadoop concepts, how data generated from YouTube can be mined and utilized to make targeted, real time and informed decision.For more info, click on Youtube-sDataAnalytics
Movies have been a great source of entertainment for the people ever since their inception in the late 18th century. The term movie is very broad and its definition contains language and genres such as drama, comedy, science fiction and action. The data about movies over the years is very vast and to analyze it, there is a need to break away from the traditional analytics techniques and adopt big data analytics. In this paper I have taken the data set on movies and analyzed it against various queries to uncover real nuggets from the dataset for effective recommendation system and ratings for the upcoming movies.For more info,click on Movie-sDataAnalyticsUsingPig
Here, we have chosen the loan dataset on which we have performed map-reduce operations. We hproblems to fetch the data. In that, we have explained only 2 over here. Rest you can try at your enand solution in given in next section.
For complete info, click on LoanProjectusingMapReduce
We need to use the HiveQL commands to analyse the stocks data from a ‘New York Stock Exchange’ dataset and calculate the covariance for a stock. Dataset: This dataset is s a comma separated file (CSV) named ‘NYSE_daily_prices_Q.csv’ that contains the stock information such as daily quotes etc. at New York Stock Exchange. Covariance: This finance term represents the degree or amount that two stocks or financial instruments move together or apart from each other. With covariance, investors have the opportunity to seek out different investment options based upon their respective risk profile. It is a statistical measure of how one investment moves in relation to another.
For complete code, kindly view on Stock-sCovarianceCalculationUsingHive
Here we used four files that are used to execute Health Care Use Case: myudf.jar: The UDF used to deidentify the health care dataset. Healthcare_Sample_dataset1 - Input dataset that contains the information of patients. DeIdentify.java : The java code of UDF myudf.jar: The UDF used to deidentify the health care dataset. myqueries.q : The hive queries to be executed
For complete code, kindly view on HiveHealthCare
Pig code and dependent udf for finding all information regaring patient like address, dob ,Disease etc.
source: HealthCareUseCaseUsingPig
Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world. – Atul Butte, Stanford
Big Data is a buzzword making rounds in almost all the industries. The industry we would specifically speak about today is ‘Healthcare’.
So, how is Big Data helping the healthcare sector?
Like any other sector, the healthcare sector also contributes to vast amounts of data floating around. Data comes from various sources such as Electronic Medical Records (EHR), labs, imaging systems, medical correspondence, claims, database system and finance.
For code, click on Bigdata_Example_in_Healthcare
A MapReduce program to process a dataset with temperature records. You need to find the Hot and Cold days in a year based on the maximum and minimum temperatures on those days. The dataset for this problem is the ‘WeatherData’ records file available in your LMS. This dataset has been taken from National Climatic Data Center (NCDC) public datasets.
Fork on Find_Hot-_Cold_Days_mapreduce
maximum temperature calculation using mapreduce A MapReduce program to process a dataset with multiple temperatures for a year. You need to process the dataset to find out the maximum temperature for each year in the dataset.
Problem Explanation: In this data set, the first field represents the year and the second field represents the temperature in that year. As the temperature will not be constant throughout the year, each year has multiple temperatures listed in the dataset. You need to process the dataset and find the maximum temperature during a year. Here is the sample output:
Sample Output Year Maximum Temperature 1900 36 1901 48 1902 49
Our task is to process the ‘temperature’ records using MapReduce program and find out the maximum temperature during each year in this dataset.
Fork on Maximum_Temp_Calculation_mapreduce
Fork on Pig_Concepts
Fork on AdvanceMapreduce
Edureka certified bigdata and hadoop developer certification Path
Edureka certified Datawarehousing Professional