Big data involves data collection from remote sensing devices and networks, Internet-powered data streams, systems, devices and many other sources that brings massively heterogeneous and continuous big data streams. You need to design the solution to effectively handle data to store, index, and query the data sources which poses big challenges. Big Data properties are commonly referred as 6Vs that is outlined below which includes volume, velocity, variety, veracity, variability and value.
I have been asked numerous times on how a developer can get into Big Data development space. Although there is not a single right answer, I have laid out the approach you can consider taking. Depending on when you read this post, many Big Data players are moving towards developing solutions that are very easy to use by developers using SQL like languages.
I would encourage developers to start with Hive. Move on to Pig and then Scala or Python.
Hive — Using query language based on SQL (HiveQL), you can write SQL queries that are transformed into MapReduce tasks. This is a good transition if you are coming from the relational database world.
Pig — Creates MapReduce jobs and can be extended with Python UDFs or Java. Procedural type language. Great candidate for simple data analysis tasks.
Scala — This is a full programming language for Big Data developers. Complex language to learn but very powerful.
Start with playing around with basic ETL tasks using Hive or Pig and then move into Scala. Python is always a strong candidate and with many libraries available in Python for data analysis and ML, you can’t go wrong with it.
But for a quick learning road map, follow the bottom up approach shown in the diagram below