Skip to main content


Kafka Topic Storage

In Kafka, a topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

The only metadata retained on a per-consumer basis is the offset or position of that consumer in the log.


What is HDFS? The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.

HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN.

Applications that run on HDFS need streaming access to their data sets. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access.

HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends and truncates. Appending the content to the end of the files is supported but cannot be updated at arbitrary point. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model.

Moving Computation is Cheape…


What is Hive? The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Hive is built on top of Apache Hadoop.
Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™.
Query execution via Apache Tez™, Apache Spark™, or MapReduce.

Hive provides standard SQL functionality, including many of the later SQL:2003 and SQL:2011 features for analytics. 
There is not a single "Hive format" in which data must be stored. Hive comes with built in connectors for comma and tab-separated values (CSV/TSV) text files, Apache Parquet™, Apache ORC™, and other formats. Users can extend Hive with connectors for other formats.

Hive is not designed for online transaction processing (OLTP) workloads. It is best used for traditional data warehousing tasks.


As shown in that figure, the main compo…

Agile and Scrum

Lately I was taking Agile Software Development on Coursera, and have been applying the framework at work. Below is the learning notes from the course, and also my thinkings based on my personal experiences.

Agile Manifesto - 4 values   * Individuals and interactions over processes and tools   * Working software over comprehensive documentation   * Customer collaboration over contract negotiation   * Responding to change over following a plan - 12 Principles
User Story - Characteristic of good user stories:   Independent, Negotiable, Valuable, Estimate, Small, Testable - Generate user stories   Store map
Scrum - Sprint planning - Sprint execution and daily standup   Answer 3 questions during stand-up   What did I do yesterday? What am I going to do today? Any blockers? - Sprint review   * Review work done   * Get feedback   * Celebrate - Sprint retrospective   * What’s working and not working?   * Action items?

THOUGHTS In my opinion, the Scrum framework is a bit micro…


What is Presto?
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes...
Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.
Presto system components:
Coordinator: responsible for parsing statements, planning queries, and managing Presto worker nodes.
Worker: responsible for executing tasks and processing data.
Connectors: Storage plugins are called as connectors. Hive, HBase, MySQL, Cassandra and many more act as a connector; otherwise you can also implement a custom one.
SQL Statements Cheat Sheet
1. Join SELECT * FROM table1 JOIN table2 ON table1.a = table2.b
2. Inner Join vs. Outer Join
3. If-else case S…

Blockchain - Part 3 - Cryptocurrency

Blockchain - Part 2 - Ethereum