Chetna’s Blog

Stream Data Processing Approaches

2019-01-20T00:00:00+00:00

There are different approaches which stream processing applications take to handle reprocessing of messages. Depending on the requirements of solution an architect / developer can choose between one of the below

At least once:

Each message is guaranteed to be processed
Message may get processed more than once
This guarantees no data loss, but does result in duplicate records passing through the system.

At most once:

Each message may or may not be processed
If a message is processed, it’s only processed once.
This mostly leads to a missing data issues.

Exactly once:

Each message is guaranteed to be processed once and only once
An example: Credit card transactions processing, in this case if we process a message multiple times it means we’re paying multiple times, and if we drop a message means we’re not processing a payment.

DAMA Framework

2018-05-14T00:00:00+00:00

In this post I’ll cover what dama framework is, what are the different pillars of it and how it can be used to implement a data strategy.

The term Data Management refers to the development, implementation, and supervision of policies, programs, and practices that deliver, control, protect, and improve the value of data and information assets.

According to DAMA framework, there are 11 knowledge areas or pillars of data management. we’ll look at each of them below

1. Data Governance

This pillar provides direction and oversight for data management by establishing a system of decision rights over data that accounts for the needs of the enterprise.
This pillar focuses on vision, strtegy and target operating model which enables other 10 areas.
Think it like a base of a building, poor data governance leads to failed/weak data management projects.

2. Data Architecture

This pillar defines the blueprint for managing data assets by aligning with organizational strategy to establish strategic data requirements and designs to meet these requirements.
This pillar focuses on Enterprise data models, tool standards, and system naming conventions

3: Data Modeling and Design

This is the process of discovering, analyzing, representing, and communicating data requirements in a precise form called the data model
This pillar focuses on data model management procedures, data modeling naming conventions, definition standards, standard domains, and standard abbreviations

4. Data Storage and Operations

This pillar includes the design,implementation,and support of stored data to maximize its value. Operations provide support throughout the data lifecycle from planning for to disposal of data.
Tool standards, standards for database recovery and business continuity, database performance, data retention, and external data acquisition

5. Data Security

This pillar ensures that data privacy and confidentiality are maintained, that data is not breached, and that data is accessed appropriately.
Data access security standards, monitoring and audit procedures, storage security standards, and training requirements

6. Data Integration and Interoperability

This pillar includes processes related to the movement and consolidation of data within and between data stores, applications, and organizations
Standard methods and tools used for data integration and interoperability

7. Document and Content Management

This pillar covers planning, implementation, and control activities used to manage the lifecycle of data and information found in a range of unstructured media, especially documents needed to support legal and regulatory compliance requirements
this focuses on content management standards and procedures, including use of enterprise taxonomies, support for legal discovery, document and email retention periods, electronic signatures, and report distribution approaches

8. Reference and Master Data

This knowledge area covers ongoing reconciliation and maintenance of core critical shared data to enable consistent use across systems of the most accurate, timely, and relevant version of truth about essential business entities.
Reference Data Management control procedures, systems of data record, assertions establishing and mandating use, standards for entity resolution

9. Data Warehousing and Business Intelligence

This includes the planning, implementation, and control processes to manage decision support data and to enable knowledge workers to get value from data via analysis and reporting.
Tool standard, processing standards and procedures, report and visualization formatting standards, standards for Big Data handling

10. Metadata

This pillar includes planning, implementation,andc ontrol activities to enable access to high quality, integrated Metadata, including definitions, models, data flows, and other information critical to understanding data and the systems through which it is created, maintained, and accessed.
Standard business and technical Metadata to be captured, Metadata integration procedures and usage

11. Data Quality

This pillar covers the planning and implementation of quality management techniques to measure, assess, and improve the fitness of data for use within an organization.
Data quality rules, standard measurement methodologies, data remediation standards and procedures

I’ll try to cover each of this pillar in more detail in my coming posts.

CAP Theorem

2018-03-30T00:00:00+00:00

CAP is an acronym that stands for Consistency, Availability and Partition Tolerance. According to CAP theorem, any distributed system can only guarantee two of the three properties at any point of time. You can’t guarantee all three properties at once.

Consistency

Consistency is where all nodes in our distributed system see the same data at the same time.
A read is guaranteed to return the most recent write for a given client.
This is achieved by updating multiple nodes before any reads are allowed.
When data is written to a single node, it is then replicated across the other nodes in the system.

Availability

Availability means that every request gets a proper response, even if nodes have failed.
A non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).
Every request will get a response regardless of the individual state of the nodes.
This is accomplished by replicating data across servers.

Partition Tolerance

The system will continue to function when network partitions occur.
In CAP theorem, a “partition” is a break in communication between two nodes.
If a partition occurs between two a pair of nodes, say, in master-master replication, then there are two options:
- Mark these nodes as being down, meaning that they are no longer available
- Allow the nodes to become out of sync, which means that we have given up consistency

Solutions

Consistency and Availability (CA)

This one is problematic.
Many claim that systems that are both consistent and available are not possible. Their reasoning lies in the idea that you do not choose to have partition tolerance, it is something that naturally arises.
For example, you could have a database that is not sharded, but has an entire copy of that database to retain availability.
When a write comes in, you either choose to accept the write, knowing that the master and the replication will be out of sync, or you choose to refuse the write. In the former case, you’ve chosen availability, and in the latter, you’ve chosen consistency.
relational databases such as PostgreSQL use this principle.

Consistency/Partition Tolerance (CP)

This method ensures that the data is consistent between all nodes and becomes unavailable in the case of a partition.
HBase, MongoDB and BigTable use this principle.

Availability/Partition Tolerance (AP)

This method ensures that all of the nodes remain available (through replication), and, in the case of a partition, will resync data between the partitioned nodes once the partition has been resolved. However, this means that the data between nodes might not be consistent.
Cassandra and CouchDB use this principle.

How to install protoc 2.5.0 on MacOS

2017-09-20T00:00:00+00:00

Recently I faced this issue, while building hadoop on my MacOS machine. Hadoop trunk 3.0 Snapshot build fails if compiled with a protoc newer than 2.5. While building I got the following error:

[ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.1.0-SNAPSHOT:protoc (compile-protoc) on project hadoop-common: org.apache.maven.plugin.MojoExecutionException: protoc version is 'libprotoc 3.4.0', expected version is '2.5.0' -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.

To fix this, install protoc 2.5.0 on your mac.

Steps:

Building from source Download latest version of protocol buffer https://github.com/google/protobuf/releases/download.
```
wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.bz2
```
Untar the tar.bz2 file
```
tar xfvj protobuf-2.5.0.tar.bz2
```

Configure the protobuf.

 cd protobuf-2.5.0
 ./configure CC=clang CXX=clang++ CXXFLAGS='-std=c++11 -stdlib=libc++ -O3 -g' LDFLAGS='-stdlib=libc++' LIBS="-lc++ -lc++abi"

You can use the --prefix parameter to install to a location other than the default /usr/local/bin Make the source
```
make -j 4
sudo make install 
```
You’ll need to unlink old installed version.

How to install redis on MacOS using Homebrew

2016-05-20T00:00:00+00:00

Using homebrew you can install redis on MacOS. This article will cover how to install and start redis. Hit the following command to install redis

$ brew install redis

Get redis package information:

$ brew info redis

Launch Redis on computer startup:

$ ln -sfv /usr/local/opt/redis/*.plist ~/Library/LaunchAgents

Start redis server using `launchctl`:

$ launchctl load ~/Library/LaunchAgents/homebrew.mxcl.redis.plist

Start redis server using configuration file:

$ redis-server /usr/local/etc/redis.conf

Here /usr/local/etc/redis.conf is the location of redis configuration file. You can pass different path.

Stop redis on auto startup:

$ launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.redis.plist

To uninstall redis:

$ brew uninstall redis
$ rm ~/Library/LaunchAgents/homebrew.mxcl.redis.plist

Test if redis server is up or not:

$ redis-cli ping

This command should return PONG response.

How to edit hosts file on MacOS

2016-05-09T00:00:00+00:00

On MacOS, hosts file is present at two places i.e /etc/hosts and /private/etc/hosts. Bit if you do detailed listing on /etc path, you will notice that its pointing to /private/etc/hosts file.

chetnachaudhari@chetnas-MacBook-Pro:~$ ls -lsa /etc
8 lrwxr-xr-x@ 1 root  wheel  11 Jan 12  2017 /etc -> private/etc

To update new hosts entry on your machine, edit /private/etc/hosts file. Following is sample of how this file looks:

chetnachaudhari@chetnas-MacBook-Pro:~$ cat /private/etc/hosts
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1	localhost
255.255.255.255	broadcasthost
::1             localhost

How to enable debugfs on linux system.

2016-05-08T00:00:00+00:00

Debugfs is Debug Filesystem , its RAM based filesystem which can be used for kernel debugging information. This makes kernel space information available in user space.

How to enable debugfs :

To enable it for onetime, i.e information will be available until next boot of system.

mount -t debugfs none /sys/kernel/debug

To make the change permanent, add following line to /etc/fstab file.

debugfs    /sys/kernel/debug      debugfs  defaults  0 0

Once you enable debugfs, you can see multiple directories inside /sys/kernel/debug :

[root@sandbox ~]# ls /sys/kernel/debug
bdi    boot_params  dynamic_debug  gpio  kprobes  sched_features  usb  xen
block  dma_buf      extfrag        hid   mce      tracing         x86

These files holds information about kernel subsystems which helps in debugging.

lsblk - List block device information.

2016-05-08T00:00:00+00:00

Lsblk is a linux utility to list block device information. In this blog post, I’ll cover some useful lsblk commands.

To see list of devices :

[root@sandbox ~]# lsblk
NAME                          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                             8:0    0 48.8G  0 disk
|-sda1                          8:1    0  500M  0 part /boot
`-sda2                          8:2    0 48.3G  0 part
  |-vg_sandbox-lv_root (dm-0) 253:0    0 43.5G  0 lvm  /
  `-vg_sandbox-lv_swap (dm-1) 253:1    0  4.9G  0 lvm  [SWAP]

By default lsblk prints information in tree view, if you want to see information in list view, you can use -l option.

[root@sandbox ~]# lsblk -l
NAME                      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                         8:0    0 48.8G  0 disk
sda1                        8:1    0  500M  0 part /boot
sda2                        8:2    0 48.3G  0 part
vg_sandbox-lv_root (dm-0) 253:0    0 43.5G  0 lvm  /
vg_sandbox-lv_swap (dm-1) 253:1    0  4.9G  0 lvm  [SWAP]

Here,

NAME is name of device ,

MAJ:MIN is major:minor version of device

RM tells that its a removal device

SIZE is size of device in human readable format

RO tells that its Read Only device

TYPE is device type

MOUNTPOINT is location where device is mounted.

To see device size in bytes

[root@sandbox ~]# lsblk -b
NAME                          MAJ:MIN RM        SIZE RO TYPE MOUNTPOINT
sda                             8:0    0 52428800000  0 disk
|-sda1                          8:1    0   524288000  0 part /boot
`-sda2                          8:2    0 51903463424  0 part
  |-vg_sandbox-lv_root (dm-0) 253:0    0 46657437696  0 lvm  /
  `-vg_sandbox-lv_swap (dm-1) 253:1    0  5242880000  0 lvm  [SWAP]

To see filesystem information

[root@sandbox ~]# lsblk -fl
NAME                      FSTYPE      LABEL UUID                                   MOUNTPOINT
sda
sda1                      ext4              8ed32b8c-b23a-423b-b96f-29eaa1303ae1   /boot
sda2                      LVM2_member       6CXjrD-6st6-olYP-BQAK-psA0-dS3T-8KeIRU
vg_sandbox-lv_root (dm-0) ext4              d6e7730a-608a-4e67-8814-131e23411619   /
vg_sandbox-lv_swap (dm-1) swap              dc07cc2c-1b35-4b06-a52b-c0d162669afe   [SWAP]

Here

FSTYPE is filesystem type

LABEL is filesystem label

UUID is filesystem UUID

To see device permissions

[root@sandbox ~]# lsblk -m
NAME                           SIZE OWNER GROUP MODE
sda                           48.8G root  disk  brw-rw----
|-sda1                         500M root  disk  brw-rw----
`-sda2                        48.3G root  disk  brw-rw----
  |-vg_sandbox-lv_root (dm-0) 43.5G root  disk  brw-rw----
  `-vg_sandbox-lv_swap (dm-1)  4.9G root  disk  brw-rw----

Here,

OWNER is user who created this device

GROUP is group name to which user belongs

MODE is device permissions

To see device topology information

[root@sandbox ~]# lsblk -tl
NAME                      ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE   RA
sda                               0    512      0     512     512    1 cfq       128  128
sda1                              0    512      0     512     512    1 cfq       128  128
sda2                              0    512      0     512     512    1 cfq       128  128
vg_sandbox-lv_root (dm-0)         0    512      0     512     512    1           128  128
vg_sandbox-lv_swap (dm-1)         0    512      0     512     512    1           128  128

Here,

ALIGNMENT is alignment offset of device

MIN-IO is minimum I/O size

OPT-IO is optimal I/O size

PHY-SEC is physical sector size

LOG-SEC is logical sector size

ROTA tells that its a rotational device

SCHED is name of I/O scheduler

RQ-SIZE is size of request queue

RA is read ahead of device.

How to split a string on first occurrence of character in Hive.

2016-03-07T00:00:00+00:00

In this article we will see how to split a string in hive on first occurrence of a character. Lets say, you have strings like apl_finance_reporting or org_namespace . Where you want to split by org (i.e string before first occurrence of ‘_’) or namespace (string after ‘_’).

hive> create table testSplit(namespace string);
hive> insert into table testSplit values ("scp_apl_finance");
hive> insert into table testSplit values ("apl_finance_reporting");
hive> select namespace from testSplit;
OK
scp_apl_finance
apl_finance_reporting
Time taken: 0.118 seconds, Fetched: 2 row(s)
hive> select regexp_extract(namespace, '^(.*?)(?:_)(.*)$', 0)  from testSplit;
OK
scp_apl_finance
apl_finance_reporting
Time taken: 0.064 seconds, Fetched: 2 row(s)

To get list of all orgs we can execute following query:

hive> select regexp_extract(namespace, '^(.*?)(?:_)(.*)$', 1)  from testSplit;
OK
scp
apl
Time taken: 0.056 seconds, Fetched: 2 row(s)

And to get list of all namespaces, use following one:

hive> select regexp_extract(namespace, '^(.*?)(?:_)(.*)$', 2)  from testSplit;
OK
apl_finance
finance_reporting
Time taken: 0.066 seconds, Fetched: 2 row(s)

Linux command for Base64 encode and decode

2016-03-05T00:00:00+00:00

Linux has base64 command to encode and decode using Base64 representation. Here is an example : To encode a string Chetna Chaudhari you can use following command:

echo "Chetna Chaudhari" | base64
Q2hldG5hIENoYXVkaGFyaQo=

You can enable debug mode using -d flag to see more details :

echo "Chetna Chaudhari" | base64 -d
May 16 10:56:35 Chetna.local base64[26454] <Info>: Read 17 bytes.
May 16 10:56:35 Chetna.local base64[26454] <Info>: Wrote 24 bytes.
Q2hldG5hIENoYXVkaGFyaQo=

To decode the encoded text,

echo Q2hldG5hIENoYXVkaGFyaQo= | base64 --decode
Chetna Chaudhari

You can check more details using following command:

echo Q2hldG5hIENoYXVkaGFyaQo= | base64 -d --decode
May 16 10:56:37 Chetna.local base64[26431] <Info>: Read 25 bytes.
May 16 10:56:37 Chetna.local base64[26431] <Info>: Decoded to 17 bytes.
Chetna Chaudhari
May 16 10:56:37 Chetna.local base64[26431] <Info>: Wrote 17 bytes.

Chetna’s Blog

Stream Data Processing Approaches

At least once:

At most once:

Exactly once:

DAMA Framework

1. Data Governance

2. Data Architecture

3: Data Modeling and Design

4. Data Storage and Operations

5. Data Security

6. Data Integration and Interoperability

7. Document and Content Management

8. Reference and Master Data

9. Data Warehousing and Business Intelligence

10. Metadata

11. Data Quality

CAP Theorem

Consistency

Availability

Partition Tolerance

Solutions

Consistency and Availability (CA)

Consistency/Partition Tolerance (CP)

Availability/Partition Tolerance (AP)

How to install protoc 2.5.0 on MacOS

Steps:

How to install redis on MacOS using Homebrew

Get redis package information:

Launch Redis on computer startup:

Start redis server using launchctl:

Start redis server using configuration file:

Stop redis on auto startup:

To uninstall redis:

Test if redis server is up or not:

How to edit hosts file on MacOS

How to enable debugfs on linux system.

How to enable debugfs :

lsblk - List block device information.

To see list of devices :

To see device size in bytes

To see filesystem information

To see device permissions

To see device topology information

How to split a string on first occurrence of character in Hive.

Linux command for Base64 encode and decode

Start redis server using `launchctl`: