Skip to main content


Showing posts from August, 2018


What is Hive? The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Hive is built on top of Apache Hadoop.
Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™.
Query execution via Apache Tez™, Apache Spark™, or MapReduce.

Hive provides standard SQL functionality, including many of the later SQL:2003 and SQL:2011 features for analytics. 
There is not a single "Hive format" in which data must be stored. Hive comes with built in connectors for comma and tab-separated values (CSV/TSV) text files, Apache Parquet™, Apache ORC™, and other formats. Users can extend Hive with connectors for other formats.

Hive is not designed for online transaction processing (OLTP) workloads. It is best used for traditional data warehousing tasks.


As shown in that figure, the main compo…

Agile and Scrum

Lately I was taking Agile Software Development on Coursera, and have been applying the framework at work. Below is the learning notes from the course, and also my thinkings based on my personal experiences.

Agile Manifesto - 4 values   * Individuals and interactions over processes and tools   * Working software over comprehensive documentation   * Customer collaboration over contract negotiation   * Responding to change over following a plan - 12 Principles
User Story - Characteristic of good user stories:   Independent, Negotiable, Valuable, Estimate, Small, Testable - Generate user stories   Store map
Scrum - Sprint planning - Sprint execution and daily standup   Answer 3 questions during stand-up   What did I do yesterday? What am I going to do today? Any blockers? - Sprint review   * Review work done   * Get feedback   * Celebrate - Sprint retrospective   * What’s working and not working?   * Action items?

THOUGHTS In my opinion, the Scrum framework is a bit micro…


What is Presto?
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes...
Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.
Presto system components:
Coordinator: responsible for parsing statements, planning queries, and managing Presto worker nodes.
Worker: responsible for executing tasks and processing data.
Connectors: Storage plugins are called as connectors. Hive, HBase, MySQL, Cassandra and many more act as a connector; otherwise you can also implement a custom one.
SQL Statements Cheat Sheet
1. Join SELECT * FROM table1 JOIN table2 ON table1.a = table2.b
2. Inner Join vs. Outer Join
3. If-else case S…