Data Analytics: Keeping it Clean with a Nod to History
6 min read
Lessons from the Past
In the Internet’s infancy, Unix shell commands were very terse such as
cp and so on. There was a good reason for this. The poor programmers had to work on so-called ‘typewriters’ (also known as teletypewriters) and it took physical exertion to press the keys down! To exacerbate matters, the devices operated very slowly. For example, the ASR-33 teletype had an input or output rate of only ten characters a second. With this in mind, shell commands and editor commands (such as Ken Thompson’s
ed) were very much to the point.
However this is not to say that the Unix shell, as it evolved in the 1970s and onward, was feeble. It was actually a rich environment since developers could combine the terse shell commands with pipes (the vertical
| symbol) which can stack commands and redirections such as
< to, for example, read in files or send output to files.
For example, this command:
ls –al | grep ^d | head -5 > /tmp/headdir.txt
ls –al command on the left to list files and directories then pipes in the output of that command (see the left-most
| pipe symbol) as the input to the grep command to the first pipe’s right. The grep pattern
^ character is part of the regular expression notation meaning the start of the line. Thus, since directories have the letter
d starting the listing, this command picks up directories only. The second pipe sends the output to the
head command that has a
-5 parameter. Finally, the top five directories are then redirected, using the
> symbol, to the file named on the right.
If we had to describe the programming and functional environment, we could use words like ‘minimalistic’ and ‘flexible’ and ‘functional’.
The Unix shell and the protocols such as “FTP” (File Transfer Protocol) and “rsh” (remote shell) and more recently, “ssh” (Secure Shell) and “HTTP” (Hypertext Transport Protocol, the famous Web protocol which gained traction in the 1990s) gained in importance as the early Internet added compute nodes and interconnections. However, there was friction between the scientific community that used the nodes for research and nascent commercial interests. In fact, circa 1981 the National Science Foundation (NSF) enacted an “Acceptable Use Policy” (AUP) on its nationwide backbone to ban activities not in support of research or education. So, for a while, (although in 1995 the NSFNET Backbone was defunded) there were relatively pure research and education projects flowing on the various interconnections. The scientific community used the flexible and minimalistic shell environment to piece together a wide array of software, ranging from astrophysics to chemistry to biology and all areas in between.
Movement Towards Free and Open Standards
As software and hardware became more sophisticated, research communities and their laboratories could accomplish more and more with computing power. However in the laboratory too there was friction between scientific inquiry and commercial motives. A very well known example was when a printer vendor no longer supplied the source code to Richard Stallman’s MIT lab. This meant Stallman and his peers could not modify the printer to do what they needed any longer. This motivated Stallman to launch an initiative of worldwide important, the Free Software movement.
From Stallman’s GNU (launched in 1983) operating system project page, we read these important principles about free software.
“Free software” means software that respects users’ freedom and community. Roughly, it means that the users have the freedom to run, copy, distribute, study, change and improve the software. Thus, “free software” is a matter of liberty, not price. To understand the concept, you should think of “free” as in “free speech,” not as in “free beer”.
In a very real sense, scientific research has been greatly aided by both the free software movement and the minimalistic but flexible shell environment.
Notable early GNU General Public License (GPL) successes include Linux (actually only GPL’d in 1992) and the Debian Linux distribution (1993), explicitly committed to the Free Software Foundation (FSF) principles and the Apache HTTP server. More recently the MySQL database and PHP Web scripting language have been added to this list.
The Enterprise, Big Data, and Potential for Nasty Gaffes and Pitfalls
In any large enterprise, there are also conflicts between scientific inquiry and commercial interests. In the Big Data space, commercial interests can introduce to the unwary very expensive pitfalls. There are several major pitfalls that come to mind.
The first major pitfall is faulty modeling. For example, a vendor might craft “business objects” which are the vendor’s model of an enterprise business domain. The objects may contain inaccurate or incomplete models right off the bat, or deviate over time from actual situations due to lack of maintenance. ‘Business object’ and ‘business intelligence terminologies’ in vendor advertising and promotion sound positive but are often actual sources of worry because the vendor modeling does not stand up over time. The artificial and unnecessary vendor object layer gives another headache: proprietary and often inefficient access techniques to read and write these objects. The headache grows when inefficiencies multiply as the number of objects increase. In addition, some vendor interfaces are only good to pull a few “objects” at a time, which is entirely unsuitable in Big Data solutions. To state a clear guideline: the job of creating and using enterprise ontologies (key snapshots of business domain realities) should be left to the enterprise and not outsourced at any step.
Vendor Specific Standards
Another common pitfall is a vendor trying to foist a non-standard solution for data storage and retrieval because the Big Data space is not yet mature and has not settled on clear standards in some regards. For example, a vendor might make use of a “Big Data cluster” with a custom data format but provide only limited, strange and most definitely suboptimal ways to access the data.
‘Proprietary’ Black Box Approaches
A third, related pitfall relates to preparation of the data as it flows from the source to the Big Data repository. Vendors with persuasive presentations might propose a clever analytics platform that requires they take proprietary control of the Extraction/Transformation/Load (ETL) steps. This is a very unpleasant path for the organization to follow. Unfortunately, the ETL steps are pivotal to transform source data feeds into the Big Data structure for further manipulation. If the vendor owns the ETL steps, the vendor is dictating the final data set used for inquiry and in effect, the business model (ontology). Again, the organization handed over the keys to the castle inappropriately and they become hamstrung in any attempt to learn ‘bedrock truths’. In this situation, the only viable solution we have seen is to back out what they did, and restart – a total failure of technology investment.
To step back and look at corporate technology investment history, commercial banks and investment banks in particular have been victimized by odd vendor hardware and software recommendations for decades. Bizarre and obsolete hardware and byzantine, almost impossible to navigate software are still to be found in the decaying, shrinking power niches of a large company’s backwater. The reasons are many, for example: lots of disposable cash, lack of oversight, lack of planning, and lack of due diligence.
The pitfalls discussed and the corporate history paint an unpleasant picture of blunder after blunder coupled with vendors all angling for deviation from the optimum to create themselves a power niche in the firm in this new Big Data space. However the correct approach is surprisingly simple if we pay attention to the historical lessons we talked about at the beginning of this essay. Big Data analysis even in a commercial setting should center on the principles of scientific inquiry and the freedom to represent and query the data as one sees fit (taking guidance from the FSF discussed above). The flexible and minimalistic shell environment is still a powerful metaphor in Big Data computing, and the modest pipe symbol (
|) joining commands together is a metaphor for the need to interoperate. Thus, systems in the big data space should have frictionless boundaries (well, as frictionless as possible) and adhere to well-known standards, so the scientist can form the right questions and get the right answers without unnecessary obstacles and angst.
The Old is New Again
If we look at the present day curious situation, we have some current technologies, which look rather old. We have “Pig” scripts, which look like amalgamations of regular expressions that have been in use for decades. We have classic remote shell invocations and we have a renaissance of non-relational data structures and the concomitant need to index them and optimize them for access. The new thing is the relatively cheap cluster-computing environment and the challenge is how to effectively organize and query vast quantities of data floating among all these nodes.
However in many ways it still looks the 1980s with remote shell invocations tossing queries over a wall to the cluster nodes. We have the recent “Map/Reduce” technique (this one also has “old” characteristics) which farms requests out to the nodes, then collates data back from the nodes, but we have flux in the area of Big Data structure and access optimization. There are several promising Apache open-source initiatives in this space for example Apache Accumulo and Apache Cassandra. Vendors that make use of strong open-source initiatives usually provide less headaches at the boundary between applications.
Thus, as new technologies roll down the chute and mature, the selection and the implementation of technologies in the Big Data space should hold true to the principles of scientific inquiry that researchers have enjoyed since the advent of the Internet and before that, the advent of the Unix shell. We will always see aspects of the “old” mixed in with the new – that is a technological truism. Everything new is built on the shoulders of something old. The solutions that stand the test of time are flexible, nimble, minimalistic, and avoid proprietary pitfalls. The ‘bigness’ of Big Data merely means mistakes are more expensive to reverse and redress. Corporate decision makers are well advised to pay attention to the strongest and fittest open source players in the Big Data market and ask “how do they play with others” when determining a big data technology stack. This might sound simple, but so often the keys to the castle are given away when the vendor takes control of one or more of the steps leading the big data repository be it ETL or be it modeling or be it proprietary storage and retrieval. Corporate managers and their anointed vendors rise and fall, new princes are crowned and more money is wasted. Then, bedrock truth is either obfuscated or outright simply not determinable. In military history, Barbara Tuchman defines folly as the pursuit by government of policies contrary to their own interests, despite the availability of feasible alternatives. Paying attention to what constitutes folly in corporate vendor selection helps us avoid mistakes.
In the end, the overarching principle is simple. ‘Data science’ as it pertains to Big Data is really all about the science and giving the researchers clean, explainable, flexible tools to work on clean and explainable data.