Data Science refers to an emerging area of work concerned with the collection, preparation, analysis, visualization, management, and preservation of large collection of information. Although the name Data Science seems to connect most strongly with areas such as databases and computer science, many different kinds of skills – including non- mathematical skills – are needed.
For some, the term “Data Science” evokes images of statisticians in while lab coats staring fixedly at blinking computer screens filled with scrolling numbers. Nothing could be further from the truth. First of all, statisticians do not wear lab coats: this fashion statement is reserved for biologists, doctors, and others who have to keep their clothes clean in environments filled with unusual fluids. Second, mush of the data in the world is non-numeric and unstructured. In this context, unstructured means that the data are not arranged in neat rows and columns. Think of a web page full of photographs and short messages among friends: very few numbers to work with there. While it is certainly true that companies, schools, and governments use plenty of numeric information – sales of products, grade point averages, and tax assessments are a few examples- there is lots of other information in the world that mathematicians and statisticians look at the cringe. So, while it is always useful to have great math skills, there is much to be accomplished in the world of data science for those of us who are presently more comfortable working with words, lists, photographs, sounds, and other kinds of information.
In addition, data science is much more than simply analyzing data. There are many people who enjoy analyzing data and who could happily spend all day looking at histograms and averages, but for those who prefer other activities, data science offers a range of roles and requires a range of skills. Let’s consider this idea by thinking about some of the data involved in buying a box of cereal.
Whatever your cereal preferences – fruity, chocolaty, fibrous, or nutty – you prepare for the purchase by writing “cereal” on your grocery list. Already your planned purchase is a piece of data, albeit a pencil scribbles on the back on an envelope that only you can read. When you get to the grocery store, you use your data as a reminder to grab that jumbo box of FruityChocoBoms off the shelf and put it in your cart. At checkout line the cashier scans the barcode on your box and the cash register logs the price. Back in the warehouse, a computer tells the stock manager that it is time to request another order from the distributor, as your purchase was one of the last boxes in the store. You also have a coupon for your big box and the cashier scans that, giving you a predetermined discount. At the end of the week, a report of all the scanned manufacturer coupons gets uploaded to the cereal company so that they can issue a reimbursement to the grocery store for all of the coupon discounts they have handed out to customers. Finally, at the end of the month, a store manager looks at a colorful collection of pie charts showing all of the different kinds of cereal that were sold, and on the basis of strong sales of fruity cereals, decides to offer more varieties of these on the store’s limited shelf space next month.
So the small piece of information that began as a scribble on your grocery list ended up in many different places, but most notably on the desk of a manager as an aid to decision making. On the trip from your pencil to manager’s desk, the data went through many transformations. In addition to the computers where the data may have stopped by or stayed on for the long term, lots of other pieces of hardware – such as the barcode scanner – were involved in collecting, manipulating, transmitting, and storing the data. In addition, many different pieces of software were used to organize, aggregate, visualize, and present the data. Finally, many different “human systems” were involved in working with the data. People decided which systems to buy and install, who should get access to what kinds of data, and what would happen to the data after its immediate purpose was fulfilled. The personnel of grocery chain and its partners made a thousand other detailed decisions and negotiations before the scenario described above could become reality.
Obviously data scientists are not involved in all of these steps. Data scientists don’t design and build computers or bar code readers, for instance. So where would the data scientists play the most valuable role? Generally speaking, data scientists play the most active roles in the four A’s of data: data architecture, data acquisition, data analysis, and data archiving. Using our cereal example, let’s look at them one by one. First, with respect to architecture, it was important in the design of the “point of sale” system (what retailers call their cash registers and related gear) to think through in advance how different people would make use of the data coming through the system. The system architect, for example, had a keen appreciation that both the stock manager and the store manager would need to use the data scanned at the registers, albeit for some-what different purposes. A data scientist would help the system architect by providing input on how the data would need to be routed and organized to support the analysis, visualization, and presentation of the data to the appropriate people.
Next, acquisition focuses on how the data are collected, and, importantly, how the data are represented prior to analysis and presentation. For example, each barcode represents a number that, by itself, is not very descriptive of the product it represents. At what point after the barcode scanner does its job should the number be associated with a text description of the product or its price or its net weight or its packaging type? Different barcodes are used for the same product. When should we make note that purchase X and purchase Y is the same product, just in different packages? Representing, transforming, grouping, and linking the data are all tasks that need to occur before the data can be profitably analyzed, and these are all tasks in which the data scientist is actively involved.
The analysis phase is where data scientists are most heavily involved. In this context we are using analysis to include summarization of the data, using portion of data (samples) to make inferences about the larger context, and visualization of the data by presenting it in tables, graphs, and even animations. Although there are many technical, mathematical, and statistical aspects to these activities, keep in mind that the ultimate audience for data analysis is always a person or people. These people are the “data users” and fulfilling their needs is the primary job of a data scientist. This point highlights the need for excellent communication skills in data science. The most sophisticated statistical analysis ever developed will be useless the results can be effectively communicated to the data user.
Finally, the data scientist must become involved in the archiving of the data. Preservation of collected data in form that makes it highly reusable – what you might think of as “data creation” – is a difficult challenge because it is so hard to anticipate all of the future uses of the data. For example, when the developers of Twitter were working on how to store tweets, they probably never anticipated that tweets would be used to pinpoint earthquakes and tsunamis, but they had enough foresight to realize that “geocodes” – data that shows the geographical location from which a tweet was sent – could be a useful element to store with the data.
- learning the application domain – The data scientist must quickly learn how the data will be used in a particular context.
- communicating with data users – A data scientist must possess strong skills for learning the needs and preferences of users. Translating back and forth between the technical terms of computing and statistics and the vocabulary of the application domain is a critical skill.
- Seeing the big picture of a complex system – After developing an understanding of the application domain, the data scientist must imagine how data will move around among all of the relevant systems and people.
- Knowing how data can be replaced – Data scientists must have a clear understating about how data can be stored and linked, as well as about “metadata”.
- Data transformation and analysis – When data become available for the use of decision makers, data scientists must know how to transform, summaries, and make inferences from the data. As noted above, being able to communicate the results of analysis to users is also a critical skill here.