Describe the challenges of the current analytical architecture for Data Scientists.What are the key skill sets and behavioral characteristics of a Data Scientist?In which phase would you expect to in

Data Scientist

COLLAPSE

Top of Form

Data scientists are rapidly becoming major contributors in today's business, however, there are many challenges and obstacles still exists. Unlike Business Intelligence (BI) that provides reports on historical data, data science (aka predictive analytics and data mining) involves mining big data, provides what if analysis, predict trends and forecasts, and answers to why things are happening. Due to the fundamentally difference between the two, BI technologies and architecture will not work properly on Big Data. Big data is defined as large amount of data which requires new technologies and architectures so that it becomes possible to extract value from it by capturing and analysis process (Katal et al. 2013).

Companies store their data in enterprise data warehouse (EDW), which are managed and protected centrally by IT. However, the data in EDW must be well structured and normalized with the appropriate data type definitions (Dietrich et al. 2016). To do just that, it will require lengthy data preprocessing data cleansing. Therefore, to accommodate the flexibility needed by data scientists, local data mart may emerge to bypass these lengthy processes, resulted in non-centralized managed systems and compromised securities.

Big data characteristics can be defined as Volume, Velocity and Variety (Dietrich et al. 2016). The volume refers to the size of the data, velocity is the rapid rate of data creation, and variety refers to the various data structures exist within these data. As the name suggested, big data is … big, think about petabyte and exabyte. Because of the size, speed of creation and various structures of the data, it is a time-consuming process to retrieve relevant information from these them, making quality and high value data hard to reach and leverage.

To maximize business impact, “time-to-insight” must be timely, however, a typical architecture for data science today does not promote prompt data availability. Data mining activities usually take place after prioritized and core processes have been completed. In addition, data is moving in batches from EDW to local analytical tools, thus causes further delay for their availability. Because of the fast-moving nature of big data, in the worst case, the analysis results may be outdated by the time they reach end users.

Very often, because analytic projects are not centrally managed, there is a tendency that they may become non-standard initiatives and not aligning with the business goals, thus reducing their impact on business.

The different varieties of data structures from big data increase the efforts required to mine them. Moreover, most of these data are unstructured, such as data from social media or “Quasi” structured that are inconsistence in data values and format. These data are difficult to mine and requires new tools to be able to process them effectively.

To optimal the analysis processes, analytic sandboxes are needed because these varying structure datasets will not work well within traditional EDW. Analytic sandbox is a workspace that the data analyst owned rather than database administrator owned. It enables flexible and high performance analysis in a non-production environment using data gathered from multiple sources and can perform large-scale analytical data experiments (Dietrich et al. 2016). Sandbox can contain a mixture of data type such as raw data, aggregated data, unstructured data and more. Because of the data transformation, analytic sandbox can expand to five to ten times the original size of the datasets.

Per Dietrich et al. (2016), there are five key skill sets and behavioral characteristics of a data scientist:

Quantitative: math and statistics and skills.
Technical aptitude: computer programming, engineering, machine learning and database administration skills.
Skeptical mind set and critical thinking: can apply critical thinking to examine their work in all angles.
Curious and creative: find creative way to solve problems.
Communicative and collaborative: able to articulate business value clearly and collaboratively work with others.

Data scientists are expected to spend most of their time in data preparation. This includes exploring, preprocessing and conditioning data, as well as preparing an analytics sandbox and perform Extract, Transform, Load, Transform (ETLT) from EDW to sandbox. This is to ensure that there are enough good quality data to start building the model. As Lane (ND) pointed out, a small sample size is much more likely to produce extreme values, thus, skewed the result and the accuracy of the analysis, therefore, having a large sample size of quality data is important to ensure accuracy.

On the other hand, they are expected to spend the least time processing the data, with the available advance big data tools such as Hadoop and MongoDB, data can be processed using these tools with relative ease. For example, New York Times archives their entire public archives from 1851 to 1922 to 11 million PDFs in 24 hours.

Bottom of Form