Introduction

Big Data was a buzzword a few decades ago. But today we all are reaping the benefits of Big Data processing systems. Here are some examples;

Face Detection

The early face detection first appeared in Facebook in 2011 and in the same year Google plus social media platform also added the same feature, in which when you upload a group picture, the social media platform would search the web with each face in the picture and pull out their social media page, for you to send friend request to.

Now, of course facial recognition is integral part of many systems!

So the question that comes up to mind is

  • How many pictures will it search against?
  • How can it return the results so fast?

Here are some facts to ponder

  • there are 2.8 billion users in Facebook as of 2020
  • there are 250 billion photos uploaded to facebook as of 2020

And facial recognition seems to return the results within seconds at the most! How is this possible?

Well traditional RDBMS systems were unable to handle this data size or processing time requirement.

They had two main drawbacks:

  • They could not store this kind of data efficiently.
  • They could not process and return results efficiently (quickly enough).

Note the word efficiently here.

A traditional enterprise systems will have between 5000 to 50,000 servers and can handle about a few Terabytes of data and millions of transactions/day. But in the current social media platforms we talk in millions of servers, Petabytes of data and billions of transactions/day!

Perspective on the scale of social media platforms

  • The Web holds about trillion pages.
  • Hundreds of millions of Twitter Accounts
  • Hundreds of millions of Tweets per day
  • Billions of google queries each day
  • Millions of servers, petabytes of data!
  • Zettabytes(21), Yottabytes(24) are already here!

Different storage and processing systems were developed by these social media and search giants to handle their business. These were coined as Big Data processing systems.

Big Data - data sets that are too large for traditional data processing systems (usually do not fit into RAM memory), and therefore require new processing technologies.

IBM Watson

When we are still understanding the history of Big Data, the one system that needs recognition is the IBM Watson - a question answering computing system. In 2011, IBM Watson competed in the Jeopardy - quiz show, against the then champions Brad Rutter and Ken Jennings and won first prize of 1 million by understanding the questions posed in natural language by a human!

  • It took 20 researchers three years to build this system
  • Scientists fed, 200 million pages of structured and unstructured content consuming four terabytes of disk storage including the full text of Wikipedia, encyclopedias, dictionaries, thesauri, newswire articles, and literary work
  • Watson also used databases, taxonomies, and ontologies

Watson was not connected to the Internet during the game and Watson had to spit out an answers in a matter of seconds to make sure it was first to the buzzer!

How was it built?

Early Watson's software was written in various languages, including Java, C++, and Prolog, and uses Apache Hadoop framework for distributed computing, Apache UIMA (Unstructured Information Management Architecture) framework, IBM’s DeepQA software and SUSE Linux Enterprise Server 11 operating system. According to IBM, 'more than 100 different techniques are used to analyze natural language, identify sources, find and generate hypotheses, find and score evidence, and merge and rank hypotheses'

Watson could process 500 gigabytes, the equivalent of a million books, per second

Case for Commercial Solutions for Big Data Processing

The explosive growth in digital computing, & communications worldwide the last two decades.

Data storage price reduction

  • Every day, we create 2.5 quintillion bytes of data. More than 90% of the data in the world today has been created since 2011 (Zikoroulos, 2013)
  • Posts to social media sites
  • Digital pictures and videos
  • Purchase transaction records
  • Cell phone GPS signals etc.

Challenges of Big Data

From a more technological point of view, Big Data refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities.

Volume

Refers to the amount of data being generated. Think in terms of gigabytes, terabytes, and petabytes. Many systems and applications are just not able to store, let alone ingest or process, that much data. Many factors contribute to the increase in data volume including: transaction-based data stored for years, unstructured data streaming in from social media, and the ever-increasing amounts of sensor and machine data being produced and collected. Volume issues include: Storage cost, Filtering and finding relevant and valuable information in large quantities of data that often contains much information that is not valuable. The ability to analyze data quickly enough in order to maximize business value today and not just next quarter or next year

Velocity

Refers to the rate at which new data is created. Think in terms of megabytes or gigabytes per second. Data is streaming in at unprecedented speed and must be dealt with in a timely manner in order to extract maximum value from the data. Sources of this data include logs, social media, RFID tags, sensors, and smart metering. Velocity issues include: Not reacting quickly enough to benefit from the data. For example, data could be used to create a dashboard that could warn of imminent failure or a security breach - failure to react in time could lead to service outages. Data flows tend to be highly inconsistent with daily, seasonal, or event-triggered changes in peak loads. For example, a change in political leadership could cause an a peak in social media

Variety

Refers to the number of types of data being generated. Data can be gathered from databases, XML or JSON files, text documents, email, video, audio, stock ticker data, and financial transactions. Varieties of data include: Structured, Semi-structured, Unstructured (schema on read) Variety issues include: How to gather, link, match, cleanse, and transform data across systems? How to connect and correlate data relationships and hierarchies in order to extract business value from the data? Note: Instead of “unstructured”, currently we tend towards talking in terms of “schema on read”. Data is applied to a plan (schema) as it is pulled out of a stored location, rather than as it goes in

Veracity

Refers to the quality or trust worthiness of data that is to be analyzed. The quality of data is dependent on certain factors such as; where the data has been collected from, how it was collected, and how it will be analyzed. The veracity of a users data, dictates how reliable and significant the data really is.

Web Intelligence Use Cases using Big Data for Business

  • Online Advertising - predicting intent and interest
  • Gauging consumer sentiment
  • Intelligent question answer
  • Image recognition, facial recognition etc
  • Personalized Genomic medicine

Tangible Benefits

  • Improved customer experience
  • Better fact based decision making
  • Increased sales
  • Reduced risk
  • More efficient operations

results matching ""

    No results matching ""