My learnings about Big Data

19 October 2015education

Hi folks! Yesterday I joined a course in Coursera regarding Big Data. It is taught by the University of San Diego in California. I thought it would be a good idea to write a blog post about my learnings, so you can have an idea of the type of things you will learn if you decide to enrol as well.

I will keep posting notes here while doing the course, so don’t expect a very organised article until the end!.

Week 1: Welcome to Big Data

  • There’s a huge need of data scientists
  • 90% of World’s data has been created in the last two years!
  • In 2009 we had .8 ZB of data and in 2020 we will have 20 ZB
  • It’s growing very rapidly
  • The majority of Big Data out there is unstructured
  • It is generated from everywhere around us (mobiles, GPS, etc)
  • Companies need to capture information about their products, services, customers, pricing, segmentation, social networks, etc, in order to gain insight from the data
  • Gather, store and manipulate large amounts of data at the right speed and the right time
  • There’s a lot of untapped valued in Big Data
  • Predictive and deep analytics
  • Come up with answers and improve ROI or try to understand our customers and learn their habits and predict their future behaviours
  • Functional requirements: collection, integrate, organise, analyse (statistical, summary, predictive,..), management, take action, decisions
  • Big Data stack: analyse then offer some focussed services on top of those analytics
  • Tools that provide fast, scalable access to the data then push that to the analytics stack
  • Many different areas of booming new technologies. Crowded and diverse space
  • Marketing companies are in the fore front
  • Needs: real-time, scalable, high performance analytics on large datasets
  • Bring storage capacity and computational capacity together: Hadoop
  • Apache Hadoop: open source, low cost, reliable, scalable, distributed computing. From a single server to thousands of machines
  • Fault tolerant, flexible environment (structured or unstructured data)
  • Lower layer: Hadoop Distributed File System (HDFS)
  • Middle layer: Hadoop MapReduce, a model for large scale data processing
  • Top layer: we can have software like Pig, Hive, Mahout, etc to manipulate the data through the MapReduce processes
  • Minimize data movement
  • This is how MapReduce works:
  • We will learn how to submit MapReduce jobs
  • In Hadoop 2.0 we have YARN, which allows us to do more complex stuff

Week 2: why Big Data?

  • Computers are no longer deterministic machines. Not physically available
  • Bring technologies together to find meaning in large, fast-moving, uncertain data
  • Before: relational databases. Now: clickstream
  • Machine data is very fast
  • Streaming data. IoT. Very fast too
  • Before: structured datasets. Now: raw, complex, unstructured
  • Going beyond data warehouse. SQL? HBASE, Hive,…
  • Expanded ‘views’ of data. Behavioural, Social Media challenge: integration
  • Find meaning in the chaos: integration, transformation, load
  • Analytics: simple, advanced, statistical
  • Predictive dashboards
  • Parallelised, distributed, optimised
  • Before: sample, do machine learning, build predictive models, score larger data set. Now: justa analyse all data and run models? Exploding sample size
  • Correlation vs causality. Does not necessarily explain it
  • New methods from research community: deep learning, move beyond flat files to more complex data
  • Past and present. Before: white-coat PhD expensive tools. Now: data scientist open source tools
  • Who are data scientists? Need to understand statistics, machine learning, databases, data mining, how to query, order, visualisation,…
  • Communication skills. Understand the domain.
  • Ask the right questions that will bring the value to the business
  • Intellectual curiosity, intuition, communication and engagement, presentational skills, creativity, business savvy. Interact with business analysts.
  • Data preparation, understanding, modelling
  • Need to code, create equations
  • Most successful data scientists have substantial, deep expertise in at least one aspect of data science: statistics, machine learning, Big data, Business communication
  • Data science is inherently collaborative and creative
  • Curriculum topics: Data manipulation at scale, Analytics, Communicating results