Tuesday 11 April 2017

Big Data Introduction

Below are the notes I have taken while taking the course "Big Data Foundations - Level 1" in https://bigdatauniversity.com. Keeping it on the blog for my own reference and it may be helpful to others.  


Big Data Adoption

State:
Data coming from:
transactions or log data
Clinical, Mobile cam - 70% unstructured
Educate>Explore>Engage>Execute

Source of Big Data:
Good Data scientist:
pick the right problem for the org rather solving the problem


Data is new oil, is a natural resource that is going.

Sensors are the important contribution to data.
4 engines 1PB of data from LON to SIN
CERN 1GB per sec
Radio telescope20,000 PB/day


Big data with examples:
-----------------------------
Holistic approach:
What to achieve?
Collect traffic data to avoid congestion. To find best time and method to travel.

Intelligence: Handles data stream:
Instrumented: Gather from diff sources
Interconnected: can handle both structured and unstructured

Case1: 
Political debate:
Predict public sentiment.
Process>filter>analysis
Connected to twitter firehose (100%) vs public API (1%)
analyze each tweet and aggregate

NLP: 
the project at Penn Treebank project, Stanford
Carnegie Mellon Univ

public opinion vs political analysts



Characteristics of BigData:


What and How?
new insight from untouched data
New tools to do more analysis

Four Vs
Volume, Velocity, Variety, Veracity

BigData is not only Hadoop.

Governance:
Can we trust the source?

NoSQL - Not Only SQL
NoHadoop - Not Only 

Big Data is a platform and not a software.

Data Warehouse > provided OLAP (Online Analytic Processing)
Stream Computing > Real-time analytic processing (RTAP)

Hadoop - Sofware for structured and unstructured

Accelerators are s/w libraries above the DataWarehouse, Hadoop, Stream


The BigData platform


Key aspect of BigData platform:
----------------------------------------
1 Integration : It is more than just 1 tech
2 Analytics
3 Visualization : Tools
4 Development tools for engine
5 Oprimization
6 Security and Governance

RTAP (Realtime analytic processing) & HDFS (long 

Governance for BigData
-------------------------------
Growing variety and volume makes it difficult to manage

Infosphere: IBM producttime storage) stores the result in warehouse zone

Alaytics must be driven by trusted data

Different data requires different types of governance

True insight requires confidence in data.

FOurth V - Veracity means truthfulness and confidence


Highvalue Big Data Usecase


Sweet spots
1. Big data Exploration
2. Enhance CRM - for cross sell and up sell
3. Security/Intelligence extension
   * improve intelligence & law enforcement
   * Find pattern
4. Operation Analysis
   * analyse large volume of multi structured data which are in motion. To integrate with existing enterprise data, need large amount of analysis 
5. Dataware house augumentation
   * Builds on top of existing datawarehouse to leverage big data
   * Hadoop as a source for data warehouse

Examples:
1. Airbus:
Data Expplorer is starting point

Multiple Hypothesis tracking (MHT)

2. Terra Echos
Stream and Hadoop is the starting point

3. Cisco
Intelligence infrastructure monitoring
log analystics
Energy bill forecasting


Technical Details of Big Data Component


Sentiment from Social media, Machine log, call center log, email, finanical services etc

AQL - Annotation Query Language (AQL program, tells what needs to be done rather how it needs to be done)
-->
Text Analytics Optimizer, ( will compile the AQL and optimize it and generate execution plan)
-->
Text Analytic Runtime (analyse the stream or doc)


tuples goes to operators for execution
Filter->Transform->Annotate
Correlate (Join from multiple source) ->Classify (training data)

Stream softwares:
----------------------
IBM Infosphere Streams
Storm - Twitter
S4 - Yahoo
Apache Spark
Samza - Linked In
Kinesis - Amazon

IBM Hadoop - BigInsight

SQL for Hadoop (ANSI 92 support)
* Hive
* Impala (Cloud Era)
* Big SQL (IBM)
* Stinger (Hortonworks)
* Drill (MapR)
* HAWQ (Pivotal)
* SQL-H (Teradata)

Improvements:
In Multimedia Analysis

No comments:

Post a Comment