Data Science Tools
Updated: 03 September 2023
Data Science Tools
Based on this Cognitive Class Course
Cognitive Class Labs
This Course makes use of the Lab Environment at Cognitive Class Labs
CCL also stores a reference to Apache Spark context in the sc
variable
Jupyter Notebooks
Open source notebooks for data exploration and visualization, these allow to interact with data programmatically as well as view output and documentation in the same place
Jupyter Lab is an environment that organizes Jupyter Notebooks
Jupyter Notebooks can be created with a variety of languages, built into CCL we have Python, Scala, and R, although kernels are available for other languages
If we scroll to the bottom of the main page when opening the Jupyter Lab tool we can see the available tutorials
CCL also stores a reference to Apache Spark context in the sc
variable
Zeppelin Notebooks
Open source multi-purpose notebooks, this makes use of the Zeppelin interpreter concept which allows any language and data processing backend to be plugged into Zeppelin
If we look at the main page when opening the Zeppelin Notebooks tool we can see the available tutorials
Zeppelin Interpreters allow Zeppelin to use different languages for interacting with data
Zepplin Context allows us to exchange objects between different languages
Zeppelin also has a built in integration with Spark, thereby allowing you to run Spark code
CCL also stores a reference to Apache Spark context in the sc
variable
RStudio IDE
Open source Online Version of RStudio, this allows you to code, manage packages, use the console, view data, and see visualizations
RStudio IDE also has a built in integration with Spark, thereby allowing you to run Spark code
CCL also stores a reference to Apache Spark context in the sc
variable
Seahorse
A way to define Data Science Pipelines without needing to write any code, this makes use of a visual approach to programming
Seahorse still allows us to write our own python transformations when needed
Seahorse has Apache Spark as a backend and provides a variety of machine learning and data operations, we can also explore these with Jupyter Notebooks and Deploy as a Spark Application
The CCL Seahorse instance is not to be used for production
OpenRefine
OpenRefine was developed for dealing with messy data
Previously known as Google Refine, is a tool for transforming and preprocessing data as well as perform data refinery processes