We talked about how data science is about the exploration of data, and we have tools readily available to enable us to jump into it head on.
One of the technologies we use is called Spark, which is a development platform and computational engine for cluster computing. We run analyses using Spark. Within Spark, there are also software capabilities that allow us to do something new with data: real-time processing of data analyses. For example, if you are writing out a mathematical formula, say 7+8, instead of having to run the command like you would do with some software programs typically, Spark will automatically calculate the answer for you and display it on the screen. Imagine building a complicated mathematical model where each portion is computed automatically! It saves a lot of time and allows you to catch any errors as you go.
This capability within Spark is called REPL (Read, Evaluate, Print Loop). For the geeks in the house, it is what we call an interactive shell.
Traditional computing programs requires you to build code and formulas, execute it, run it, then review the output. With REPL, for every command you input, it processes it, which generates more immediate value, and it's more fun to work with it because of the interactive factor.
Tools like this has made it easier to do data exploration because it assists you discover interesting nuggest of information immediately. This is very helpful if you’re not sure what you’re looking for, as it happens in exploration.
It enables you to try out processes, since it gives you the output right away. You can examine if its usable, put other parameters in place, execute what you're looking for, eventually creating a program that runs the code and formulas you just tested along the process.
It makes data scientists like us very happy because it cuts time and makes our job more efficient!