Data Science Tools

Updated: 03 September 2023

Data Science Tools

Cognitive Class Labs

This Course makes use of the Lab Environment at Cognitive Class Labs

CCL also stores a reference to Apache Spark context in the sc variable

Jupyter Notebooks

Open source notebooks for data exploration and visualization, these allow to interact with data programmatically as well as view output and documentation in the same place

Jupyter Lab is an environment that organizes Jupyter Notebooks

Jupyter Notebooks can be created with a variety of languages, built into CCL we have Python, Scala, and R, although kernels are available for other languages

If we scroll to the bottom of the main page when opening the Jupyter Lab tool we can see the available tutorials

CCL also stores a reference to Apache Spark context in the sc variable

Zeppelin Notebooks

Zeppelin Notebook

Open source multi-purpose notebooks, this makes use of the Zeppelin interpreter concept which allows any language and data processing backend to be plugged into Zeppelin

If we look at the main page when opening the Zeppelin Notebooks tool we can see the available tutorials

Zeppelin Interpreters allow Zeppelin to use different languages for interacting with data

Zepplin Context allows us to exchange objects between different languages

Zeppelin also has a built in integration with Spark, thereby allowing you to run Spark code

CCL also stores a reference to Apache Spark context in the sc variable

RStudio IDE

Open source Online Version of RStudio, this allows you to code, manage packages, use the console, view data, and see visualizations

RStudio IDE also has a built in integration with Spark, thereby allowing you to run Spark code

CCL also stores a reference to Apache Spark context in the sc variable

Seahorse

A way to define Data Science Pipelines without needing to write any code, this makes use of a visual approach to programming

Seahorse still allows us to write our own python transformations when needed

Seahorse has Apache Spark as a backend and provides a variety of machine learning and data operations, we can also explore these with Jupyter Notebooks and Deploy as a Spark Application

The CCL Seahorse instance is not to be used for production

OpenRefine

OpenRefine was developed for dealing with messy data

Previously known as Google Refine, is a tool for transforming and preprocessing data as well as perform data refinery processes