An Introduction to Object Oriented Data Science in Python

A lot of focus in the data science community is on reducing the complexity and time involved in data gathering, cleaning, and organization. This article discusses how object oriented design techniques from software engineering can be used to reduce coding overhead and create robust, reusable data acquisition and cleaning systems. I’ll provide an overview of object oriented design and walk through an example of using these techniques for getting and cleaning data from a web API in Python. You can find the Jupyter Notebook for this post on Github

Object Oriented Design

Much of modern software engineering leverages the principles of Object Oriented Design (OOD), also called object oriented programming (OOP), to create codebases that are easy to scale, test, and maintain. As the name suggests, this programming paradigm is centered on thinking of code in terms of objects. An object encapsulates data, attributes, and methods relating to a specific entity. For instance, a Cat object would have methods play, eat, scratch. An attribute of Cat could be name, with the value of the name attribute being the data for the Cat object.

If you are unfamiliar with object oriented design in Python you can read about it in the Python Tutorial. As we are walking through code examples I’ll be talking about things at a high level, if you are interested in the specifics of Python class design, for example, “what is this “self” parameter?”, check out the tutorial for more information.

Working with Classes

Lets consider creating an object to encapsulate a data source. Typical activities for interacting with a data source include extracting the data from its source, perhaps a database or API, followed by cleaning and formatting the data for use in an analysis of some kind. Looking at this from an object oriented point of view, we could create an object Data with methods ‘extract’ and ‘clean’ and a DataFrame attribute for the cleaned and formatted dataset. For this example we will use the Recreational Information Database API, a service for finding recreation opportunities in the US. Here is an example of a Data class using the RIDB:

To setup the object we need a constructor method, __init__, to set attributes of our object and perform any initialization routines. RIDB has a variety of different end points, so we will have to specify which endpoint we want to query when creating the RidbData object in the ‘endpoint’ parameter, and any url parameters we need to set such as our RIDB API key in the ‘url_params’ parameter. The name attribute will help us identify this object later on when we are working with multiple RidbData objects.

In our extract method we’ll query the endpoint and load the json response into the DataFrame attribute ‘df’.  We have a ‘clean’ method to insert NaN in place of empty strings and drop any entries that don’t have latitude / longitude values, since we aren’t interested in facilities that don’t have a location.

So at this point you may be wondering why we wrote all that code instead of simply make a function:

I’m glad you asked!

If all the RIDB endpoints had the same response and endpoint configuration, a function like this would work fine.  The time to consider using object oriented techniques is when you find yourself writing a lot of specialized functions and ‘if’ statements to make small tweaks to your code for special cases. For example, when we get data from the facilities endpoint we want to drop any that do not have latitude and longitude, whereas when we query the media endpoint we decide to capture only the image data:

Notice that we have some lines that are the same between the two functions:

Another best practice in programming is called the DRY principle – Don’t Repeat Yourself. You may be saying “Pfft! Its just three lines!” But what if RIDB changes their response format from ‘RECDATA’  to ‘RECDAT’ ? Then you have to track down every instance of ‘RECDATA’ and replace it. Also, consider that three lines is 75% of the code for the ‘get_ridb_facility_media’ method.

Lets look at querying these two endpoints using the RidbData object we created above. For the facilities endpoint, we are pretty much ready to go. Just plug in our endpoint and the object methods will take care of the rest:

We can then use this object to get and clean data for any similar endpoint, such as campsites:

Now, if the RIDB API changes its ‘RECDATA’ record name, we just have to update the code in one place: the RidbData class.  The above example of using a class to generate multiple instances is a hallmark of the OOD paradigm.

Extending Classes

One of the principles of OOD is the open closed principle: classes should be closed for modification, but open for extension. This means that once a class is complete, tested, and verified to be working as expected we want to set it aside as done. Any time you touch a piece of code you create the possibility of new bugs, with the open/closed principle we can reduce the likelihood of bugs, and also guarantee that a class is safe to extend and use by others since it hasn’t been modified.

So how do we modify the functionality of an existing class? In our example, we have a media endpoint that requires a different clean method than what we have coded in the RidbData class.  We can extend the RidbData class and provide a new clean clean method, while inheriting the functionality of the constructor and extract methods

When we create the new class, RidbMediaData, we pass the RidbData class to the class definition. This indicates that RidbMediaData is a derived class from the base class RidbData. RidbData would also be called the superclass of RidbData.  Mostly this is to inform you of the language used around this construct; the take away is that RidbMediaData inherits the methods and attributes of the RidbData class, so it doesn’t have to implement the __init__ constructor method or the extract method – it will get those implementations from RidbData. The only thing you need to implement in a derived class is the methods or attributes that differ from that of the base class.

Using our new class to query the media data endpoint:

Hmm. Well thats pretty good, but what if we wanted to get the media for all facilities? Should we create a new object for every facility? That doesn’t seem like a good use of resources. Instead, we can also provide a new implementation the extract function in the RidbMediaData object:

To use extract to get media from several facilities at once, we have to make some changes to our constructor parameters endpoint and url_params. For the endpoint, we will pass the facilities endpoint and append the facilityID and “/media” in the extract method for each facility. The url_params will become a DataFrame of API key / facilityID pairs for each facility

Putting it all together

We now have two objects we can use to get, extract, and clean data from the RIDB API. In addition to the benefit of reduced repeated code through inheritance, we also have a uniform interface for all of these data sources. We can make use of this to create a two line data extraction pipeline! First lets setup our objects and endpoints.

Now the really neat part; because our objects all have the same interface, we can extract and clean the data for all of them in two lines:

You can now examine the cleaned data for each object using the ‘df’ attribute

head

Give it a go for yourself! You can find the Jupyter Notebook for this post on Github.

Summary

We’ve seen some ways OOD paradigms can give us scalable, sharable code for data analysis.  Some key takeaways:

  • Through inheritance, different objects can share the same code, reducing the likelihood that bugs will be introduced through changes in functionality.
  • By encapsulating the data with the associated methods in an object, we provide an implicit guarantee of the data shape and the manipulations it has undergone. This is especially important for keeping track of data manipulation during feature development.
  • Creating uniform interfaces through OOD principles can help us streamline downstream operations.

As you’re writing code, keep an eye out for these “code smells” to identify potential opportunities for OOD to help:

  1. Repeating yourself with slight tweaks to accommodate differences between data sources
    • Just different URLs, database connection strings or file names? A well-parameterized function will probably suit you just fine.
    • Finding yourself writing a lot of ‘if’ statements in your extraction code? Probably time to refactor and consider using classes
  2. Finding similar functions across vertical stacks
    • We looked across the process used for working with different data sources and identified similar functions for cleaning and acquiring data. If you can identify similarities like this in your code OOD may help.

I hope you’ve found this introduction helpful for thinking about how you can organize your data analysis code to be more efficient and robust. There are many other ways OOD can be leveraged for data science work, including using abstract base classes for interface definition, writing robust web scrapers through inheritance, and streamlining machine learning prototyping through encapsulating feature development.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.