Rkrahul04.github.io

Project maintained by Rkrahul04 Hosted on GitHub Pages — Theme by mattgraham

Welcome To My Website

About Me

Who Am I ?

I am a Data scientiest. Enjoy to analyse large datasets and derive meaningful information. Technology lover . Hunger for learning automation tools and bigdata and hadoop technology . Eager to use knowledge to solve business problems using my knowledge

What I do ?

Write code to analyse data and derive meaningful information . Manage code in version control system and use the change accoridng to business need . Write scripts to automate the process and reduce the process time .

Work

I work in TCS from past 2 years where where my distributed team and I help people to transform their code and help to manage change. Analyse large sets of data to provide meaningful information to end users

Free Time

In my free time i read novels ,learn automated technology, travel new places and enjoy the taste of different foods.

This Is Me

Technical Skill

Big Data Ecosystems: Map Reduce, HDFS, HBase, Hive, Pig, Sqoop, Oozie and Flume.
Programming Languages: Pig, hive, sql and Java
Scripting Language: Unix
Databases: Oracle, MySQL and NoSql.
Tools: Eclipse, QTP ,IBM WMQ, zekyll, talend,weblogic
Version Control: SVN, Github
Operating Systems: Windows […] and Linux , aix, CentOS.

Projects On Github

SentimentalAnalysisOfTwitter

This article covers the sentiment analysis of any topic by parsing the tweets fetched from Twitter using apache fume.Sentiment Analysis is the process of ‘computationally’ determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker.In this project ,main concentration points are Retweet_Count and Screen_Name. For more info , kindly click on SentimentalAnalysisOfTwitter .

AirlinesAnalysis

The proposed method is made by considering following scenario under consideration An Airport has huge amount of data related to number of flights, data and time of arrival and dispatch, flight routes, No. of airports operating in each country, list of active airlines in each country. The problem they faced till now it’s, they have ability to analyze limited data from databases. The Proposed model intension is to develop a model for the airline data to provide platform for new analytics based on the following queries. kindly click on AirlinesAnalysis

Youtube-sDataAnalytics

Analysis of structured data has seen tremendous success in the past. However, analysis of large scale unstructured data in the form of video format remains a challenging area. YouTube, a Google company, has over a billion users and generate billions of views. Since YouTube data is getting created in a very huge amount and with an equally great speed, there is a huge demand to store, process and carefully study this large amount of data to make it usable The main objective of this project is to demonstrate by using Hadoop concepts, how data generated from YouTube can be mined and utilized to make targeted, real time and informed decision.For more info, click on Youtube-sDataAnalytics

Movie-sDataAnalyticsUsingPig

Movies have been a great source of entertainment for the people ever since their inception in the late 18th century. The term movie is very broad and its definition contains language and genres such as drama, comedy, science fiction and action. The data about movies over the years is very vast and to analyze it, there is a need to break away from the traditional analytics techniques and adopt big data analytics. In this paper I have taken the data set on movies and analyzed it against various queries to uncover real nuggets from the dataset for effective recommendation system and ratings for the upcoming movies.For more info,click on Movie-sDataAnalyticsUsingPig

LoanProjectusingMapReduce

Here, we have chosen the loan dataset on which we have performed map-reduce operations. We hproblems to fetch the data. In that, we have explained only 2 over here. Rest you can try at your enand solution in given in next section.

Finding the list of people with particular grade who have taken loan.
Finding the list of people with having interest more than certain value like 1000.
Finding the list of people with having loan amount more than certain value.
Get maximum number of loan given to which grade users (A-G).
Highest loan amount given in that year with that Employee id and Employees annual income.
Get the total number of loans with loan id and load amount which are all having loa status as Late?
Average loan interest rate with 60 month term and 36 month term.

For complete info, click on LoanProjectusingMapReduce

Stock-sCovarianceCalculationUsingHive

We need to use the HiveQL commands to analyse the stocks data from a ‘New York Stock Exchange’ dataset and calculate the covariance for a stock. Dataset: This dataset is s a comma separated file (CSV) named ‘NYSE_daily_prices_Q.csv’ that contains the stock information such as daily quotes etc. at New York Stock Exchange. Covariance: This finance term represents the degree or amount that two stocks or financial instruments move together or apart from each other. With covariance, investors have the opportunity to seek out different investment options based upon their respective risk profile. It is a statistical measure of how one investment moves in relation to another.

A positive covariance means that asset returns moved together. If investment instruments or stocks tend to be up or down during the same time periods, they have positive covariance.
A negative covariance means returns move inversely. If one investment tends to be up while the other is down, they have negative covariance.
Problem statement Use HiveQL to analyse the stock exchange dataset and calculate the covariance between the stocks for each month. This will help a stock broker in recommending the stocks to his customers.

For complete code, kindly view on Stock-sCovarianceCalculationUsingHive

HiveHealthCare

Here we used four files that are used to execute Health Care Use Case: myudf.jar: The UDF used to deidentify the health care dataset. Healthcare_Sample_dataset1 - Input dataset that contains the information of patients. DeIdentify.java : The java code of UDF myudf.jar: The UDF used to deidentify the health care dataset. myqueries.q : The hive queries to be executed

For complete code, kindly view on HiveHealthCare

HealthCareUseCaseUsingPig

Pig code and dependent udf for finding all information regaring patient like address, dob ,Disease etc.

source: HealthCareUseCaseUsingPig

Bigdata_Example_in_Healthcare

Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world. – Atul Butte, Stanford

Big Data is a buzzword making rounds in almost all the industries. The industry we would specifically speak about today is ‘Healthcare’.

So, how is Big Data helping the healthcare sector?

Like any other sector, the healthcare sector also contributes to vast amounts of data floating around. Data comes from various sources such as Electronic Medical Records (EHR), labs, imaging systems, medical correspondence, claims, database system and finance.

For code, click on Bigdata_Example_in_Healthcare

Projects On MapReduce

Find_Hot-_Cold_Days_mapreduce

A MapReduce program to process a dataset with temperature records. You need to find the Hot and Cold days in a year based on the maximum and minimum temperatures on those days. The dataset for this problem is the ‘WeatherData’ records file available in your LMS. This dataset has been taken from National Climatic Data Center (NCDC) public datasets.

Fork on Find_Hot-_Cold_Days_mapreduce

Maximum_Temp_Calculation_mapreduce

maximum temperature calculation using mapreduce A MapReduce program to process a dataset with multiple temperatures for a year. You need to process the dataset to find out the maximum temperature for each year in the dataset.

Problem Explanation: In this data set, the first field represents the year and the second field represents the temperature in that year. As the temperature will not be constant throughout the year, each year has multiple temperatures listed in the dataset. You need to process the dataset and find the maximum temperature during a year. Here is the sample output:

Sample Output Year Maximum Temperature 1900 36 1901 48 1902 49

Our task is to process the ‘temperature’ records using MapReduce program and find out the maximum temperature during each year in this dataset.

Fork on Maximum_Temp_Calculation_mapreduce

Tutorials Project

Pig_Concepts Contains example of different concepts like pig macro, sample log and pig scripts

Fork on Pig_Concepts

AdvanceMapreduce In this project, we will learn Advance MapReduce concepts such as Counters, Distributed Cache, MRunit, Reduce Join, Custom Input Format, Sequence Input Format and how to deal with complex MapReduce programs. Different folders have been commited for each concepts. Each folder contain one example on concepts and related java files.

Fork on AdvanceMapreduce

Achievements

Edureka certified bigdata and hadoop developer certification Path
Edureka certified Datawarehousing Professional

Contacts

linkedin
Mail- irahulece@gmail.com
Mobile no -9821902605