My learnings about Big Data
Sharing some interesting things I've learned recently
Hi folks! Yesterday I joined a course in Coursera regarding Big Data. It is taught by the University of San Diego in California. I thought it would be a good idea to write a blog post about my learnings, so you can have an idea of the type of things you will learn if you decide to enrol as well.
I will keep posting notes here while doing the course, so don't expect a very organised article until the end!.
Week 1: Welcome to Big Data
- There's a huge need for data scientists
- 90% of the World's data has been created in the last two years!
- In 2009 we had 0.8 ZB of data and in 2020 we will have 20 ZB
- It's growing very rapidly
- The majority of Big Data out there is unstructured
- It is generated from everywhere around us (mobiles, GPS, etc)
- Companies need to capture information about their products, services, customers, pricing, segmentation, social networks, etc, to gain insight from the data
- Gather, store and manipulate large amounts of data at the right speed and at the right time
- There's a lot of untapped value in Big Data
- Predictive and deep analytics
- Come up with answers and improve ROI or try to understand our customers and learn their habits and predict their future behaviours
- Functional requirements: collection, integration, organise, analyse (statistical, summary, predictive,..), management, take action, decisions
- Big Data stack: analyse then offer some focussed services on top of those analytics
- Tools that provide fast, scalable access to the data then push that to the analytics stack
- Many different areas of booming new technologies. Crowded and diverse space
- Marketing companies are at the forefront
- Needs: real-time, scalable, high-performance analytics on large datasets
- Bring storage capacity and computational capacity together: Hadoop
- Apache Hadoop: open source, low cost, reliable, scalable, distributed computing. From a single server to thousands of machines
- Fault-tolerant, flexible environment (structured or unstructured data)
- Lower layer: Hadoop Distributed File System (HDFS)
- Middle layer: Hadoop MapReduce, a model for large-scale data processing
- Top layer: we can have software like Pig, Hive, Mahout, etc to manipulate the data through the MapReduce processes
- Minimize data movement
- This is how MapReduce works:
- We will learn how to submit MapReduce jobs
- In Hadoop 2.0 we have YARN, which allows us to do more complex stuff
Week 2: Why Big Data?
- Computers are no longer deterministic machines. Not physically available
- Bring technologies together to find meaning in large, fast-moving, uncertain data
- Before: relational databases. Now: clickstream
- Machine data is very fast
- Streaming data. IoT. Very fast too
- Before: structured datasets. Now: raw, complex, unstructured
- Going beyond the data warehouse. SQL? HBase, Hive,...
- Expanded 'views' of data. Behavioural, Social Media challenge: integration
- Find meaning in the chaos: integration, transformation, load
- Analytics: simple, advanced, statistical
- Predictive dashboards
- Parallelised, distributed, optimised
- Before: sample, do machine learning, build predictive models, score larger data set. Now: just analyse all data and run models? Exploding sample size
- Correlation vs causality. Does not necessarily explain it
- New methods from the research community: deep learning, moving beyond flat files to more complex data
- Past and present. Before: white-coat PhD expensive tools. Now: data scientist open source tools
- Who are data scientists? Need to understand statistics, machine learning, databases, data mining, how to query, order, visualisation,...
- Communication skills. Understand the domain.
- Ask the right questions that will bring value to the business
- Intellectual curiosity, intuition, communication and engagement, presentational skills, creativity, and business savvy. Interact with business analysts.
- Data preparation, understanding, modelling
- Need to code, create equations
- Most successful data scientists have substantial, deep expertise in at least one aspect of data science: statistics, machine learning, Big data, Business communication
- Data science is inherently collaborative and creative
- Curriculum topics: Data manipulation at scale, Analytics, Communicating results