Below are the notes I have taken while taking the course "Big Data Foundations - Level 1" in https://bigdatauniversity.com. Keeping it on the blog for my own reference and it may be helpful to others.
Data coming from:
transactions or log data
Clinical, Mobile cam - 70% unstructured
Educate>Explore>Engage>Execute
Source of Big Data:
Good Data scientist:
pick the right problem for the org rather solving the problem
Data is new oil, is a natural resource that is going.
Sensors are the important contribution to data.
4 engines 1PB of data from LON to SIN
CERN 1GB per sec
Radio telescope20,000 PB/day
Big data with examples:
-----------------------------
Holistic approach:
What to achieve?
Collect traffic data to avoid congestion. To find best time and method to travel.
Intelligence: Handles data stream:
Instrumented: Gather from diff sources
Interconnected: can handle both structured and unstructured
Case1:
Political debate:
Predict public sentiment.
Process>filter>analysis
Connected to twitter firehose (100%) vs public API (1%)
analyze each tweet and aggregate
NLP:
the project at Penn Treebank project, Stanford
Carnegie Mellon Univ
public opinion vs political analysts
What and How?
new insight from untouched data
New tools to do more analysis
Four Vs
Volume, Velocity, Variety, Veracity
BigData is not only Hadoop.
Governance:
Can we trust the source?
NoSQL - Not Only SQL
NoHadoop - Not Only
Big Data is a platform and not a software.
Data Warehouse > provided OLAP (Online Analytic Processing)
Stream Computing > Real-time analytic processing (RTAP)
Hadoop - Sofware for structured and unstructured
Accelerators are s/w libraries above the DataWarehouse, Hadoop, Stream
Key aspect of BigData platform:
----------------------------------------
1 Integration : It is more than just 1 tech
2 Analytics
3 Visualization : Tools
4 Development tools for engine
5 Oprimization
6 Security and Governance
RTAP (Realtime analytic processing) & HDFS (long
Governance for BigData
-------------------------------
Growing variety and volume makes it difficult to manage
Infosphere: IBM producttime storage) stores the result in warehouse zone
Alaytics must be driven by trusted data
Different data requires different types of governance
True insight requires confidence in data.
FOurth V - Veracity means truthfulness and confidence
Sweet spots
1. Big data Exploration
2. Enhance CRM - for cross sell and up sell
3. Security/Intelligence extension
* improve intelligence & law enforcement
* Find pattern
4. Operation Analysis
* analyse large volume of multi structured data which are in motion. To integrate with existing enterprise data, need large amount of analysis
5. Dataware house augumentation
* Builds on top of existing datawarehouse to leverage big data
* Hadoop as a source for data warehouse
Examples:
1. Airbus:
Data Expplorer is starting point
Multiple Hypothesis tracking (MHT)
2. Terra Echos
Stream and Hadoop is the starting point
3. Cisco
Intelligence infrastructure monitoring
log analystics
Energy bill forecasting
Sentiment from Social media, Machine log, call center log, email, finanical services etc
AQL - Annotation Query Language (AQL program, tells what needs to be done rather how it needs to be done)
-->
Text Analytics Optimizer, ( will compile the AQL and optimize it and generate execution plan)
-->
Text Analytic Runtime (analyse the stream or doc)
tuples goes to operators for execution
Filter->Transform->Annotate
Correlate (Join from multiple source) ->Classify (training data)
Stream softwares:
----------------------
IBM Infosphere Streams
Storm - Twitter
S4 - Yahoo
Apache Spark
Samza - Linked In
Kinesis - Amazon
IBM Hadoop - BigInsight
SQL for Hadoop (ANSI 92 support)
* Hive
* Impala (Cloud Era)
* Big SQL (IBM)
* Stinger (Hortonworks)
* Drill (MapR)
* HAWQ (Pivotal)
* SQL-H (Teradata)
Improvements:
In Multimedia Analysis
Big Data Adoption
State:Data coming from:
transactions or log data
Clinical, Mobile cam - 70% unstructured
Educate>Explore>Engage>Execute
Source of Big Data:
Good Data scientist:
pick the right problem for the org rather solving the problem
Data is new oil, is a natural resource that is going.
Sensors are the important contribution to data.
4 engines 1PB of data from LON to SIN
CERN 1GB per sec
Radio telescope20,000 PB/day
Big data with examples:
-----------------------------
Holistic approach:
What to achieve?
Collect traffic data to avoid congestion. To find best time and method to travel.
Intelligence: Handles data stream:
Instrumented: Gather from diff sources
Interconnected: can handle both structured and unstructured
Case1:
Political debate:
Predict public sentiment.
Process>filter>analysis
Connected to twitter firehose (100%) vs public API (1%)
analyze each tweet and aggregate
NLP:
the project at Penn Treebank project, Stanford
Carnegie Mellon Univ
public opinion vs political analysts
Characteristics of BigData:
What and How?
new insight from untouched data
New tools to do more analysis
Four Vs
Volume, Velocity, Variety, Veracity
BigData is not only Hadoop.
Governance:
Can we trust the source?
NoSQL - Not Only SQL
NoHadoop - Not Only
Big Data is a platform and not a software.
Data Warehouse > provided OLAP (Online Analytic Processing)
Stream Computing > Real-time analytic processing (RTAP)
Hadoop - Sofware for structured and unstructured
Accelerators are s/w libraries above the DataWarehouse, Hadoop, Stream
The BigData platform
Key aspect of BigData platform:
----------------------------------------
1 Integration : It is more than just 1 tech
2 Analytics
3 Visualization : Tools
4 Development tools for engine
5 Oprimization
6 Security and Governance
RTAP (Realtime analytic processing) & HDFS (long
Governance for BigData
-------------------------------
Growing variety and volume makes it difficult to manage
Infosphere: IBM producttime storage) stores the result in warehouse zone
Alaytics must be driven by trusted data
Different data requires different types of governance
True insight requires confidence in data.
FOurth V - Veracity means truthfulness and confidence
Highvalue Big Data Usecase
Sweet spots
1. Big data Exploration
2. Enhance CRM - for cross sell and up sell
3. Security/Intelligence extension
* improve intelligence & law enforcement
* Find pattern
4. Operation Analysis
* analyse large volume of multi structured data which are in motion. To integrate with existing enterprise data, need large amount of analysis
5. Dataware house augumentation
* Builds on top of existing datawarehouse to leverage big data
* Hadoop as a source for data warehouse
Examples:
1. Airbus:
Data Expplorer is starting point
Multiple Hypothesis tracking (MHT)
2. Terra Echos
Stream and Hadoop is the starting point
3. Cisco
Intelligence infrastructure monitoring
log analystics
Energy bill forecasting
Technical Details of Big Data Component
Sentiment from Social media, Machine log, call center log, email, finanical services etc
AQL - Annotation Query Language (AQL program, tells what needs to be done rather how it needs to be done)
-->
Text Analytics Optimizer, ( will compile the AQL and optimize it and generate execution plan)
-->
Text Analytic Runtime (analyse the stream or doc)
tuples goes to operators for execution
Filter->Transform->Annotate
Correlate (Join from multiple source) ->Classify (training data)
Stream softwares:
----------------------
IBM Infosphere Streams
Storm - Twitter
S4 - Yahoo
Apache Spark
Samza - Linked In
Kinesis - Amazon
IBM Hadoop - BigInsight
SQL for Hadoop (ANSI 92 support)
* Hive
* Impala (Cloud Era)
* Big SQL (IBM)
* Stinger (Hortonworks)
* Drill (MapR)
* HAWQ (Pivotal)
* SQL-H (Teradata)
Improvements:
In Multimedia Analysis
No comments:
Post a Comment