Bigdata: Wing of modern data associated problems

Aditisinha
7 min readSep 16, 2020

I was yesterday years old when I got to know that Bigdata is not at all a technology name but on the contrary it is the name given to the problem which is faced by almost any field, be it social media, journalism, Film making, IOT(internet of things) and that is Data related.

Bigdata is not at all a single problem but it is the name given to umbrella of problems we face at the time of data storing, processing, retrieving, uploading, analyzing and what not.

Lets just try to visualize the amount of data to grab the actual zist of this problem:

Single Binary Digit (1 or 0) → Byte(8 bits) → Kilobyte (1,024 Bytes) →Megabyte (1,024 Kilobytes) →Gigabyte (1,024 Megabytes) →Terabyte (1,024 Gigabytes) →Petabyte (1,024 Terabytes) →Exabyte (EB))(1,024 Petabytes) → Zettabyte (1,024 Petabytes) → Yottabyte (1,024 Zettabyte) →Brontobyte (1,024 Yottabyte)….and still this flow is incomplete…Nowadays , we are processing Yottabytes and Zettabytes of data, just imagine the storage requirement!

There are three big collectors of data: companies, governments and the police and the security services. Consumers may have grown accustomed to this data collection for various purposes, like analysis , revenue , etc.

Lets just look at various companies at various sectors that , their actual stats about the amount of data they receive:

In Social Media:

Facebook big stats on big data : its system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day and it scans roughly 105 terabytes of data each half hour.

Similarly , any social media platform , be it instagram, twitter, youtube, lots of data is generated per day. Snapchat is another social media application, but it saves some of its storage requirement because of its policies:

Snapchat doesn’t deliberately store Snaps for longer than they need to run the service, but that does mean they could sit on their server for up to 30 days. From the Snapchat privacy policy: Snapchat lets you capture what it’s like to live in the moment. On their end, that means that they automatically delete the content of your Snaps (the photo and video messages that you send your friends) from their servers after they detect that a Snap has been opened by all recipients or has expired.

In the feild of mass-communication/ journalism:Big news events generate a lot more data, and you don’t know when they are going to happen, so more TBs everywhere is always better. That said, all these operations run w/ their storage at 75–90% full because nobody likes to delete news that could pop up again, and while the archives grow constantly, even restoring from archive is often seen as painfully slow (since it’s not instant) if content is required for a breaking news story.

In IOT: The digitized world of today presents us with an issue we’ve never faced before. Each and every little device in our home is either now or soon to be connected to the Internet of Things (IoT) and that means it is able to collect data. And the amount of data that is generated though these is humongous and require constant IO operations and that too in very less moment of time.

The above example focuses over IO operations, lets just discuss more about it:

Sending any data to hard-disk is known as Output process. Retrieving any data from hard-disk is known as Input process. So, any thing you do hard-disk related which is a primary resource to store the data permanently is known as IO operations or IP processing. And this is a time consuming process.

Lets just deep dive into the series of problems we face in Bigdata world:

As clear from above examples we see that the humongous amount of data is generated each day and to store data we need resources that stores data permanently. Suppose a company gets 1 petabyte of data , and to store such data it has only Terabyte hard-disk available. No problem! the company could buy another hard-disk that has 1 petabyte storage. This problem is known as volume(storage) related. And this is easily solved by buying the size of hard-disk we have requirement for.

Now, the data gets stored permanently, and now a client came and put a requirement to retrieve a certain data which is stored in 1 Petabyte storage. And just imagine the time it will take for a hard-disk to perform this IO operation . Imagine telling your clients two wait for 3–4 days to fulfill the requirement!! LOL!! the client will no longer be your client.

So, suppose to improve the processing time we brought a hard-disk with fast IO operations speed, lets say instead of SATAs we replace SSDs , the speed will definitely improve by twice. This problem is known as Velocity (speed to perform IO operations). And to some extent we solved this problem too.

But now the new problem again arise that is costing. A company cant spend a huge amount just for hard-disk requirements as it has to spend more towards its business logic.

And hence, this is just a series of problem one by one. Hence, Bigdata is a umbrella of problems: Volume(storage related), Velocity(speed related) and then costing(resources).

Objectives:

  • To store all the data permanently.
  • The velocity of IO operations should be high.
  • Cost reduction

And all the above requirements can be fulfilled through one single solution that is Distributive Storage: Lets try to visualize a scenario:

If we have a single file of size say 30 GB but we can store only 10GB data in our Laptop. But bringing 2 more laptops will fulfill our Volume Objective, as altogether now we have 30 GB storage. Now, while storing file we can split it into 3 parts and can store each part of 10GB in each Laptop. Now, each part is getting stored at the same time that is parallely , we just saved our 3 x of time to store data at a single laptop.This improved the velocity of IO operations.Now, you want more speed → increases the no. of splits →increate the no. of laptops →and speed increases exponentially. Since, no hard-disk upgradation needed and we saved the cost TOO.

Hence, rather than storing the entire data in single laptop we used multiple laptops to store the data that is split already. Distributing your data in different storages is known as Distributed storage. Here, we face no problem in the limit of volume , as we need to store more and more data, we can buy more laptops, and as the no. of laptops increases the velocity also increases(directly proportional).

Distributed storage require network to connect the laptops, which can also be termed as storage contributers and are known as slaves. and the system/laptop to which they contribute is known as Master. Together this topology forms a Master-Slave Model, and provide the storage with the technique of DS. Together they look like a single Hard-disk and work like a team. Hence it is also known as Distributed Storage Cluster.

Many tech giants are using this concept ex: facebook, yahoo, etc.

Distributed storage is a Concept through which we solved multiple issues which are data associated. To implement this concept we need a software product to create such cluster. And Hadoop Distributed File System is one such popular product .

Distributed file system that serve the Facebook is mainly Hadoop distributed file system (HDFS) ,which is designed to run on low-cost hardware ,and being highly faulttolerance . HDFS is designed to store very large data sets reliably; it is able to stream those data sets at high bandwidth to user applications. It used In a large cluster, thousands of servers are directly attached storage and execute user application tasks. By distributing storage and computation across many servers, which give the system ability to dynamically scale ,the resource can grow on demand while remaining economical at every size and retaining the system available and reliable . An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data; a typical file in HDFS is gigabytes to terabytes in size.

All these concepts I got to know under the guidance of Mr. Vimal Daga. Thankyou so Much!

--

--

Aditisinha
0 Followers

B. Tech 3rd Year student. Keen interest in knowing new technologies and spend most of the time in learning them.