Skip to main content

Posts

HDFS

What is HDFS? The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.

HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN.

Applications that run on HDFS need streaming access to their data sets. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.

HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of the files is supported but cannot be updated at arbitrary point. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model.

Moving Computation is Cheape…

Hive

What is Hive? The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Hive is built on top of Apache Hadoop.
Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™.
Query execution via Apache Tez™, Apache Spark™, or MapReduce.

Hive provides standard SQL functionality, including many of the later SQL:2003 and SQL:2011 features for analytics. 
There is not a single "Hive format" in which data must be stored. Hive comes with built in connectors for comma and tab-separated values (CSV/TSV) text files, Apache Parquet™, Apache ORC™, and other formats. Users can extend Hive with connectors for other formats.

Hive is not designed for online transaction processing (OLTP) workloads. It is best used for traditional data warehousing tasks.

--https://cwiki.apache.org/confluence/display/Hive/Home
Architecture


As shown in that figure, the main compo…

Agile and Scrum

Lately I was taking Agile Software Development on Coursera, and have been applying the framework at work. Below is the learning notes from the course, and also my thinkings based on my personal experiences.


LEARNING NOTES 
Agile Manifesto - 4 values   * Individuals and interactions over processes and tools   * Working software over comprehensive documentation   * Customer collaboration over contract negotiation   * Responding to change over following a plan - 12 Principles
User Story - Characteristic of good user stories:   Independent, Negotiable, Valuable, Estimate, Small, Testable - Generate user stories   Store map
Scrum - Sprint planning - Sprint execution and daily standup   Answer 3 questions during stand-up   What did I do yesterday? What am I going to do today? Any blockers? - Sprint review   * Review work done   * Get feedback   * Celebrate - Sprint retrospective   * What’s working and not working?   * Action items?


THOUGHTS In my opinion, the Scrum framework is a bit micro…

Presto

What is Presto?
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes...
Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.
-- https://prestodb.io/
Architecture
Presto system components:
Coordinator: responsible for parsing statements, planning queries, and managing Presto worker nodes.
Worker: responsible for executing tasks and processing data.
Connectors: Storage plugins are called as connectors. Hive, HBase, MySQL, Cassandra and many more act as a connector; otherwise you can also implement a custom one.
SQL Statements Cheat Sheet
1. Join SELECT * FROM table1 JOIN table2 ON table1.a = table2.b
2. Inner Join vs. Outer Join https://www.diffen.com/difference/Inner_Join_vs_Outer_Join
3. If-else case S…

Blockchain - Part 3 - Cryptocurrency

Blockchain - Part 2 - Ethereum

Blockchain - Part 1 - Bitcoin