Q) Data processing using Spark. After working on your literature review & Problem Statement. You will be extending your work to a full literature review chapter. Attached you will find 3 different pub

Running Head: DATA PROCESSING 0

Data processing using Spark

Student Name

Instructor Name

Course

Date

Background

Data processing is the collection and translation of data into usable information which is usually performed by a team which is a specialist in data science. It is crucial for data process as it does not negatively affect the output of data. This is vital as companies stand in a position to learn more on the verge of processing its data and the possible ways on how to go about the processed data. The data process commonly starts from its raw nature, which is then converted into a precise life giving out the context that can be interpreted by cyberspace and used by the workers in an organization.

What occurs in data processing using Spark?

Spark is the general purpose data distribution engine that is suitable for use in a wide range of situations. Spark application usually runs in a cluster on independent processes which is coordinated by spark session in the drive program. ("Spark 101: What Is It, What It Does, and Why It Matters | MapR", 2019) The cluster assigns one task per partition to workers, where the task assigned applies the unit of work to the database within its partition as it outputs a new database partition. Then the result is saved in the disk or even sent back to the drive. Spark is also capable of supporting the kubernetes which is on the open source used for automating development, the Apache Hadoop Yarn, Apache Mesos which is a general cluster than run Hadoop apps ("Spark 101: What Is It, What It Does, and Why It Matters | MapR", 2019).

Spark standalone, which is included in a flash, is also is in a position to handle various petabytes of data distributed in a cluster of thousands of virtual servers. ("Spark 101: What Is It, What It Does, and Why It Matters | MapR", 2019) Spark mostly work with the distributed data such as Hadoop’s HDFS, MapR XD as well Amazon's S3 which is a simplified NoSQL database example the Apache HBase, Apache Cassandra, Mongo DB, and MapR database. Spark supports languages such as python, R, Scala, and Java. It also distributes messaging storages as that of the MapR as well as Apache Kafka. Its typical uses include stream processing, machine processing, data interaction, interactive analytics and machine learning ("Spark 101: What Is It, What It Does, and Why It Matters | MapR", 2019).

The problem associated with data processing using Spark

Despite spark being a great framework of a building application to process data, some problems are associated with data processing. (+ et al., 2019) This includes poor documentation; the walk through the literature is vital as it brings the new users up to the speed. ("Challenges with processing data in real-time using conventional Big Data solutions," 2019) Memories problem is another import context that needs more attention. Spark typically work smart in normal usage as it is built to process a massive chunk of data, measuring and monitoring the memory usage. Also, tricky deployment is a problem that is encountered in the writing of the application. This leads to hiccups in building dependencies, which if wrongly deployed the spark only works standalone mode which results to classpath exceptions during the run of the cluster ("Challenges with processing data in real-time using conventional Big Data solutions," 2019).

The problem associated with the processing data using spark can only be noted when the number of spark executors is not in a place to massive handler data and thus can be as a result of the number of cores and parallelism is unable to handle a large amount of data. ("ODI Summit 2019 – The ODI", 2019) Also, I noticed the problem of improper functioning of the memory as a result of the lack of space to perform the system operations and garbage collection in the instances spark executor. Also, through the performance of the application is a clear indication when there is a problem that is in between the system. Low memory leads to hiccups and also hanging during the operation of the application ("ODI Summit 2019 – The ODI", 2019).

Extent gap of the problem

There are various undoubted approaches to the systems deal with the real-time data before it is stored in a database. ("5 reasons why Spark Streaming’s batch processing of data streams is not stream processing", 2019) For instance, in the Apache Storm and the Apache spark, this two takes a different approach towards processing the data stream. Apache has emerged the facto framework in big data analytics. (Pointer, 2019) There is the various aspect that has led to a different image on processing data using a spark. One of the problems is the small file problem, which is associated with the Hadoop referred to as HDFS, which results to an end with the RDD with millions of partition (Pointer, 2019).

Spark streaming is also another problem with the processing data using a spark. (Pointer, 2019) This is the extension of to the spark API, this is the whale that turns the majority of developers into Ahab, and this can only work better in the optimal block interval which enables the pipeline working. It is incredible to stand up streaming with spark is hard as getting a pipe that can work at a scale of 24/7 is a different case. The memory problem is another calamity that is experienced in the data processing using spark; this is based on the myriad and not in the perennial issues, working in the time scale. This can whiplash when switching from flash standalone to Yarn and Mesos, which changes a whole default. This also demands the knowledge of arcane for the configuration options. Also, random crazy errors is another problem associated with unnecessary stoppage of the application, in which the full entries of logs appointed to decompression and compression on the shuffle stages. I tracked down some of the issues towards the interaction of spark’s networking and transporting system (Pointer, 2019).

The need for changes in processing data using Spark

For a quality output of data processing, some changes are vital for the best result. This change needs to focus more on the storage as well as deploying the two aspects of transparency. ("Apache Spark - Deep Dive into Storage Format's," 2019) This includes the data schema which will entail what field and how many exist in each data record as well, which are the data types of the area. Also, the aspect of use operation, which is vital in helping to understand the kind of action that tries to perform and on which field is the data record is done("Apache Spark - Deep Dive into Storage Format’s", 2019).

References

+, D., +, D., +, D., +, D., News, P., & News, C. et al. (2019). Apache Spark Architecture, Use Cases, and Issues. Retrieved from https://blog.panoply.io/apache-spark.-promises-and-challenges

Five reasons why Spark Streaming's batch processing of data streams is not stream processing. (2019). Retrieved from https://sqlstream.com/5-reasons-why-spark-streamings-batch-processing-of-data-streams-is-not-stream-processing/

Apache Spark - Deep Dive into Storage Formats. (2019). Retrieved from https://spoddutur.github.io/spark-notes/deep_dive_into_storage_formats.html

ODI Summit 2019 – The ODI. (2019). Retrieved from https://theodi.org/event/odi-summit-2019/?gclid=EAIaIQobChMIp_r5vZqq4wIVxp3tCh1HiwADEAAYAiAAEgIwxfD_BwE

Challenges with processing data in real-time using conventional Big Data solutions. (2019). Retrieved from https://codelook.com/challenges-with-processing-data-in-real-time-using-conventional-big-data-solutions-bb602b33da0c