Creating a new file system in Linux

Here how you would create a new file system :

  1. First create a new partition using fdisk. eg:
    fdisk /dev/sdb     –> Options (m , n,p,1,  ,t,8e,w)
  2. Create volume group
    vgcreate myvg /dev/sdb1
  3. Create a logical volume
    lvcreate -L 512G -n my_lv myvg
  4. mkfs.ext4 /dev/myvg/my_lv
  5. You can optionally add the file system in /etc/fstab to automatically mount
  6. Mount /myfilesystem
Posted in how to, linux | Tagged , , , , | Leave a comment

Drill your data with Apache Drill !

Feeling the Drill

I have been using Apache Drill to explore data for a while now. Apache Drill is a low latency distributed query engine for large-scale datasets, including structured and semi-structured/nested data. Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files.  To be clear Drill is not limited to Hadoop, you can query NoSQL databases like MongoDB, Hbase or cloud storage like Amazon S3, Azure Blob Storage  or even local files on your computer.  I have it installed on my laptop and use it as embedded mode to query my txt and cvs files. Apache Drill can be installed on Windows, Linux and MacOS with JDK.

Drill data like a table even when its not – schema on read

Drill is based on schema on read, meaning unlike traditional query engines that requires to have a predefined schema and structure, drill lets you define schema as you query the data. Cool uh ? Wait there is more  with Drill there’s no need to load the data or transform the data before it can be processed. Simply, point the query to the file or database you want to query and start querying the data.
For instance lets say you have a file  customers.csv  on a directory  /data/customer/. Once you have Drill installed (which takes about 3 mins) all you have to from a Drill prompt is :
select * from dfs./data/customer/customers.csv`;  and drill get you the data. You can even bring specific columns :
select column[0],column[1],column[6] from dfs./data/customer/customers.csv`

Drill also allows you to query against wild card files :
select * from dfs./data/orders/orders-08-*-2016.csv`
Drill lets you create views and static tables to even increase ease of use and improve performance.  You can check out the documentation for more options.

In love with your query or BI tool ? No problemo

Apache Drill supports standard SQL. So you can continue to use your favorite query tools and SQL that you have been using. Drill supports ODBC and JDBC drivers, so you it will let you access Drill using tool of your choice.  Data users can use standard BI/analytics tools such as Tableau, Qlik, MicroStrategy and so on to interact with non-relational datastores by leveraging Drill’s JDBC and ODBC drivers. Developers can leverage Drill’s simple REST API in their custom applications to create beautiful visualizations.  Drill comes with a web interface when you install in distributed mode. Drill also provides a native tool called Drill Explorer which I find really useful. You can find all the details on how to configure your tool to access Drill in the documentation.

Lets get it going …

Apache Drill is easy to download and run Drill on your computer . It runs on all standard OS and takes few minutes to install. Drill can also be installed on a cluster of servers to serve a scalable and high performance execution engine.  Drill has two install options:
1.  Installing in Embedded mode
2. Installing in Distributed mode.

Installing in your computer that has JDK installed involves:
1. Downloading the tar file
2. Untar the file
3. cd to the apache-drill<version>
4. run  bin/drill-embedded (Mac and Linux) . On windows : C:\bin\sqlline sqlline.bat –u “jdbc:drill:zk=local;schema=dfs”


Drill in to your data with Apache Drill and hopefully you will enjoy drilling as much as I do.


Posted in BigData, database, hadoop | Tagged , , | Leave a comment

Script check the index fragmentation in SQL Server

Today I had a need to find the index fragmentation on one of my sql server database and I found this script buried on my computer, thought I will share it here:

DECLARE @DatabaseID int

SET @DatabaseID = DB_ID()

SELECT DB_NAME(@DatabaseID) AS DatabaseName,
schemas.[name] AS SchemaName,
objects.[name] AS ObjectName,
indexes.[name] AS IndexName,
objects.type_desc AS ObjectType,
indexes.type_desc AS IndexType,
dm_db_index_physical_stats.partition_number AS PartitionNumber,
dm_db_index_physical_stats.page_count AS [PageCount],
dm_db_index_physical_stats.avg_fragmentation_in_percent AS AvgFragmentationInPercent
FROM sys.dm_db_index_physical_stats (@DatabaseID, NULL, NULL, NULL, ‘LIMITED’) dm_db_index_physical_stats
INNER JOIN sys.indexes indexes ON dm_db_index_physical_stats.[object_id] = indexes.[object_id] AND dm_db_index_physical_stats.index_id = indexes.index_id
INNER JOIN sys.objects objects ON indexes.[object_id] = objects.[object_id]
INNER JOIN sys.schemas schemas ON objects.[schema_id] = schemas.[schema_id]
WHERE objects.[type] IN(‘U’,’V’)
AND objects.is_ms_shipped = 0
AND indexes.[type] IN(1,2,3,4)
AND indexes.is_disabled = 0
AND indexes.is_hypothetical = 0
AND dm_db_index_physical_stats.alloc_unit_type_desc = ‘IN_ROW_DATA’
AND dm_db_index_physical_stats.index_level = 0
AND dm_db_index_physical_stats.page_count >= 1000


Best practice, you rebuild the index if the fragmentation is more than 40 % , if not a reorganization will suffice.

Posted in database, how to, sql server | Tagged , , , | Leave a comment

SQL Server Replication status

Few days back I had to query to my production sql replication server to find what all publications are running and what is the status for each of those. A search on the web found me this (unfortunately I don’t know the author of this to give credit ) :
WHEN mdh.runstatus = ‘1’ THEN ‘Start – ‘+cast(mdh.runstatus as varchar)
WHEN mdh.runstatus = ‘2’ THEN ‘Succeed – ‘+cast(mdh.runstatus as varchar)
WHEN mdh.runstatus = ‘3’ THEN ‘InProgress – ‘+cast(mdh.runstatus as varchar)
WHEN mdh.runstatus = ‘4’ THEN ‘Idle – ‘+cast(mdh.runstatus as varchar)
WHEN mdh.runstatus = ‘5’ THEN ‘Retry – ‘+cast(mdh.runstatus as varchar)
WHEN mdh.runstatus = ‘6’ THEN ‘Fail – ‘+cast(mdh.runstatus as varchar)
ELSE CAST(mdh.runstatus AS VARCHAR)
END) [Run Status],
mda.subscriber_db [Subscriber DB],
mda.publication [PUB Name],
right(left(,LEN(, LEN(left(,LEN( when mda.publisher_db=’ALL’ then 1 else LEN(mda.publication)+2 end))) [SUBSCRIBER],
CONVERT(VARCHAR(25),mdh.[time]) [LastSynchronized],
und.UndelivCmdsInDistDB [UndistCom],
mdh.comments [Comments],
‘select * from distribution.dbo.msrepl_errors (nolock) where id = ‘ + CAST(mdh.error_id AS VARCHAR(8)) [Query More Info],
mdh.xact_seqno [SEQ_NO],
WHEN mda.subscription_type = ‘0’ THEN ‘Push’
WHEN mda.subscription_type = ‘1’ THEN ‘Pull’
WHEN mda.subscription_type = ‘2’ THEN ‘Anonymous’
ELSE CAST(mda.subscription_type AS VARCHAR)
END) [SUB Type],

mda.publisher_db+’ – ‘+CAST(mda.publisher_database_id as varchar) [Publisher DB], [Pub – DB – Publication – SUB – AgentID]
FROM distribution.dbo.MSdistribution_agents mda
LEFT JOIN distribution.dbo.MSdistribution_history mdh ON mdh.agent_id =
(SELECT s.agent_id, MaxAgentValue.[time], SUM(CASE WHEN xact_seqno > MaxAgentValue.maxseq THEN 1 ELSE 0 END) AS UndelivCmdsInDistDB
FROM distribution.dbo.MSrepl_commands t (NOLOCK)
JOIN distribution.dbo.MSsubscriptions AS s (NOLOCK) ON (t.article_id = s.article_id AND t.publisher_database_id=s.publisher_database_id )
(SELECT hist.agent_id, MAX(hist.[time]) AS [time], h.maxseq
FROM distribution.dbo.MSdistribution_history hist (NOLOCK)
JOIN (SELECT agent_id,ISNULL(MAX(xact_seqno),0x0) AS maxseq
FROM distribution.dbo.MSdistribution_history (NOLOCK)
GROUP BY agent_id) AS h
ON (hist.agent_id=h.agent_id AND h.maxseq=hist.xact_seqno)
GROUP BY hist.agent_id, h.maxseq
) AS MaxAgentValue
ON MaxAgentValue.agent_id = s.agent_id
GROUP BY s.agent_id, MaxAgentValue.[time]
) und
ON = und.agent_id AND und.[time] = mdh.[time]
where mda.subscriber_db<>’virtual’ — created when your publication has the immediate_sync property set to true. This property dictates whether snapshot is available all the time for new subscriptions to be initialized. This affects the cleanup behavior of transactional replication. If this property is set to true, the transactions will be retained for max retention period instead of it getting cleaned up as soon as all the subscriptions got the change.
–and mdh.runstatus=’6′ –Fail
–and mdh.runstatus<>’2′ –Succeed
order by mdh.[time]

The script can be downloaded  here

Posted in database, how to, sql server | Tagged , , , , | Leave a comment