Data Science Practice During Enterprise Cloud Migration
Updated: Dec 28, 2022
Well-established businesses rely on rich data assets, infrastructure, and time-tested processes to stay ahead of competition. These three dimensions have also taken center stage to incorporating value from the use of data science models. With recent advancements in cloud-based services, there is a compelling need to transform on all three dimensions to stay ahead. In a typical enterprise there are numerous operational applications and not all of them can be migrated into cloud simultaneously. On the other hand, migrating analytical data assets into cloud spreads the impacts to all operational systems that feed or rely on them, and hence is not preferred either. So, ensuing gradual migration to cloud often results in hybrid scenarios. At a high-level, data science projects deal with multiple types of workloads, such as Exploratory Data Analysis, Feature Engineering & Stores, Model Training, Validation, Deployment and Monitoring. Enterprises tend to stay in hybrid state for several years depending on the complexities of their applications landscape, skill sets gaps, etc. So, while staying in the hybrid state, options exist for migrating data science project workloads in a systematic manner to assist with managed chaos. This systematic plan leverages powerful concepts such as ‘judicious abstractions’, etc., so that data science teams can meet their expectations reasonably, while technology platform teams deliver on robust platform capabilities. There are critical considerations to support self service oriented use of sensitive data and processes in cloud native platforms, where vulnerability management tends to be complex. This paper discusses adoption plans for the data science workloads and provides an abstraction example for beginning Model Training and Validation workload on cloud. Python based abstraction code is described.
A perspective of hybrid environments
Well-established businesses rely on rich data assets, infrastructure, and time-tested processes to stay ahead of competition. Teams have created these data assets through multiple processes, morphing the data from operational applications over long periods of time, using multitude of toolsets, software libraries and platforms. These three dimensions have also taken center stage to incorporating value from the use of data science models. With recent advancements in cloud-based services, there is a compelling need to transform on all three dimensions to stay ahead. While many organizations have succeeded in transforming purpose built operational applications into cloud native solutions, enterprise or analytical data assets and their strategic uses have posed different set of challenges.
in a typical enterprise there are numerous operational applications and not all of them can be migrated into cloud simultaneously. On the other hand, migrating analytical data assets into cloud spreads the impacts to all operational systems that feed or rely on them, and hence is not preferred either. So, ensuing gradual migration to cloud often results in scenarios, where analytical data assets management spans both on-premise and cloud native technologies. For instance, loads into on-premise data warehouse applications from a cloud native application now needs to account for platform changes and data transfer mechanisms/costs etc.
When we add self-service oriented data assets management to such a hybrid scenario, it is not uncommon for both business and technology teams to struggle with expectations and capabilities mismatch.
Data science workloads
At a high-level data science projects deal with following types of workloads. The term workload has varying definition, and we use it to bundle the three dimensions focused on delivering outcomes – data assets, infrastructure & processes.
EXPLORATORY DATA ANALYSIS (EDA): This workload involves accessing multiple data sources, combining and transforming measures, creating visual aids and determining/finalizing features associated with the hypotheses in scope. These features often do not exist in an easily accessible or readily available manner. As the name suggests, this is an exploratory workload, where not all requirements can be defined at the onset of the project and or hypotheses formulation. For example, the need for graph structures or graph databases may not be apparent initially, but may prove to be highly effective eventually.
FEATURE ENGINEERING: This workload involves formalizing and, in some instances, hardening creation of features identified and earmarked at EDA stage. Enterprises can glean tremendous value from feature engineering as, well defined features when shared, can minimize future EDA workloads and project timelines. So, feature engineering workloads typically involve creating, and persisting features, that adhere to data governance policies. For example, a set of features that describes, the frequency, recency and revenue associated with customers’ shopping, when engineered in a robust manner can save hundreds of hours of development time and minimizes inconsistencies. Subsequent EDA effort can derive additional features, which may be outputs from even complex processes, such as customer satisfaction scores, etc.
FEATURE STORES: Feature stores are key part in a data science project. At the very least, they are expected to bring consistency and shorter time to market outcomes. To meet some of these objectives, feature stores may need to support optimal time relevancy of data. For example, centrality of a customer’s social connectedness, features from time-series data, such as spectral centroid and kurtosis of absolute Fourier transform spectrum, etc., need to be managed in an automated manner, so that data scientists can focus on new features they need to define. Typically feature store workloads include functions such as search, extend, correlate, filter etc. Also, feature store workloads involve data management that goes beyond traditional relational datastores. It is conceivable that a feature may be composite or formulated as rich structured data, including graphs and tensors. In addition, features that are used in training often also are needed when models are operationalized, thereby making the FEATURE ENGINEERING and FEATURE STORES workloads needing to support higher tier SLAs.
MODEL TRAINING & VALIDATION: These set of workloads, at the heart of a data science project tends to take various shapes. There are numerous types of machine learning algorithms, available under many frameworks and languages, etc. Data scientists often deal with hyperparameter tuning processes that are like optimization problem solving with an artistic starting point. The workloads pertaining to model training and validation will place heavy demands on platform capabilities, such as scalability to train on large data/feature sets, flexibility to incorporate relevant data science algorithms frameworks, custom containerizations to be compliant, seamless CI/CD capabilities to progress the project, etc. As this workload requires, enterprises need to support governed and yet flexible self service use of data, infrastructure, processes, etc.
MODEL DEPLOYMENT: This workload may span the entire breadth of an enterprise application landscape. Data science models may be deployed as standalone offline batch-oriented applications, standalone online real-time applications, embedded services of other applications, champion-challenger configurations of these deployments, etc. Models may source their inputs using multiple mechanisms, such as files, application data streams, other data science model outcomes, etc. This paper does not discuss this workload in detail.
MODEL ACCURACY MONITORING: This workload has unique characteristics where model performance is measured using baselines that were newly established at training phase and baselines that have been established from other data science projects. In essence, in a steady state scenario, enterprises may have a critical set of baseline measures already established using on-premise capabilities. In addition to baselines, model prediction effectivity is measured based on outcomes established by business.
Challenges for data science workloads in hybrid environments
Several aspects constrain data scientists in building components that are characterized in the previous section in a hybrid environment. Data science teams struggle to meet business demands, despite valiant efforts from technology teams during cloud adoption due to hybrid nature of data, platforms and toolsets availability. In some large enterprises, the duration of the hybrid situation tends to span multiple years, resulting in poor and/or risky adoption of cloud native capabilities.t is worthwhile to reiterate that most of the workloads defined earlier have been delivered on-premise through self service platforms and toolsets, which makes cloud adoption more complex.
Platforms and toolsets challenge
Enterprises that simply do not lift and shift on-premise platforms and toolsets to cloud native environments need to account for migration efforts, costs and impacts, learning curve associated with different platforms, toolsets, etc. This is a general case, rather than an exception. Simply deploying on-premise platforms and toolsets is not a strategy many enterprises use. So, during hybrid situations data science teams tend to rely on platforms and toolsets that they are comfortable with. For example, if on-premise feature engineering workload has been built using ETL platforms like Informatica, Qlik, Alteryx, Ab Initio, etc., there may be constraints in migrating them to cloud native with ease. For example, how to effectively comingle workloads when a new data science model deployment needs features that have been in use already along with new features?
Data exposure risk and compliance challenge
Since infrastructure is provisioned on demand using configuration software, care must be exercised so that multiple invocations of such on-demand infrastructure is consistent and complaint. In many cases, access controls prohibit the flexibility data science teams may have relied on to incorporate infrastructure such as libraries, in a self service manner. This limitation will continue to be negatively impactful if platform teams did not provide means to deal with ongoing vulnerability management, compatibility issues, etc., that may stem from day to day tasks.
Options to adopt cloud native workloads in hybrid environments
If hybrid environments are expected to be in place for several years, a few options may be considered to be effective in delivering data science projects. The following table provides a summary discussion by each workload we reviewed earlier. The starting points for data science projects may fall in a spectrum supported by the hybrid state for each workload. For example, Feature Engineering workload may have a wider latitude in the spectrum. When majority of the inputs to feature engineering code is on-premise, it is worthwhile using on-premise capabilities for Feature Engineering workload and yet persist features in cloud native platforms. As input sources migrate to could, Feature Engineering workload could migrate. Since feature engineering code can be modularized, this workload can easily span the spectrum.
Strategies to begin on Cloud Native platforms - Abstractions
This section will use AWS SageMaker to illustrate a few options to begin Model Training and Validation workload with cloud native capabilities.
It is a common knowledge that platforms, toolsets, libraries, etc., tend to evolve and enhance value they provide. So, one of the decisions enterprises must make involves how much of a tight coupling with the platforms they are willing to accept. Some platforms or toolsets are not candidates for this consideration. For example, GUI capabilities, such as Athena, Glue, Data Wrangler interface that SageMaker Studio provides, etc., are versatile and unique. Since cloud platforms rely on open source machine learning packages and provide orchestration and deployment services care must be given to learning curve needed to safely use them.
It is not uncommon for enterprises to look at abstractions that provide a manageable de-coupling and simplified usage scenarios. But, first let’s look at challenges enterprises tend consider prior to deciding.
What if abstractions are not an option?
Abstractions may not be appealing to some enterprises. Typical challenges they face are based on various maturity levels. Each challenge listed below has potential mechanisms to mitigate risks. These mechanisms are illustrated through AWS SageMaker example.
Abstractions are not expected to provide a complete coverage of the entire suite of cloud native interfaces. So, they tend to begin with more common uses and expand as needed. The common uses are often directly visible in many enterprises and since many are intentional about cloud adoption, it is quite common for them to have planning sessions. The following table provides a few high-level characteristics of abstractions that are closely related to scope and coverage of abstractions.
Some of the pain points in adopting cloud native platforms for data scientists include infrastructure selection, access controls for cloud resources, vulnerabilities management, etc. This stems from how technology teams have supported them by taking care of them on-premise and enabled satisfactory levels of self-service uses. Platforms like Qlik, Informatica, Alteryx, etc., not only provide sophisticated ETL capabilities, but also enable python wrappers and/or virtual environments, with which data scientists are able to knit a flow of data pipelines, data science model training, validation and deployment components for their data science driven solutions. These platforms also provide good orchestration capabilities and many data scientists are able to automate and support production grade applications, all in a self-service mode.
So, abstractions can be one of the techniques that technology platform teams could leverage and continue to provide a rich self-service experience in a hybrid environment. Appropriate design of abstractions may continue to add value even when hybrid environment is no longer applicable. Following table provides a few high-level characteristics of abstractions that are closely related to providing good experience for data scientists.
Example: Abstracted Model Training and Validation workload using AWS SageMaker
In a scenario where data science teams have both power users and novice users on the scale of cloud native capabilities, enterprises have flexibility on the starting scope of abstractions. Keep in mind that one of the key principle of abstractions is to allow power users to pass on direct cloud native interfaces through abstractions. Governance of such use could be entrusted with data science teams, as they are accountable ultimately for the delivery of data science driven solutions. In real-life scenarios, many data science teams resort to such an approach on-premise and with the onset of rapid adoption of cloud native capabilities, such an approach may be risky. So, abstractions are expected to provide necessary guard rails and yet allow for prototypical use
This section provides an example of an abstraction that allows data scientists to train and deploy using XGBoost algorithm on AWS SageMaker use. Python notebook is used for this illustration and the development code can run in multiple environments.
Development host – AWS SageMaker Studio
SageMaker Studio supports a GUI environment and provides managed notebook instances that can be used as development environment. These instances must have all the development dependencies incorporated. Many enterprises follow robust change control procedures and hence planning is required if a model development dependency require changes to such notebook instances, that the data scientists can’t manage in a self-service manner. For example, installing some packages may not be possible as internet connectivity may be limited.
Also, we need to plan for some administration tasks needed to provision domains and manage SageMaker Studio users, which is distinct from account login and IAM roles. The following schematic from AWS SageMaker Studio Blog provides a good overview.
Development host - Local Environment on-premise servers
On the other hand, if data movement impacts are non-existent or minimal which may be the case in hybrid environments, the development notebook can be run on-premise in local mode while the Model Training and Validation workload actually happening on AWS SageMaker. This option may provide some flexibility with respect to data scientists self service. However, some deployments need to be planned in advance. For example, if the data science model training workload (not development task) needs custom configuration of containers, rather than AWS SageMaker provided containers, platform teams need to get ahead of this need. In some cases, this could be limiting.
AWS SageMaker abstraction sample
The library aplatformincubate is the main abstraction layer that is defined and maintained, typically by a technology platform team.
It is feasible to define aplatformhardened library that has been hardened for higher SLA purposes. With such a capability maturity model, it is conceivable for a data science project to use both incubate and hardened versions, where unless not hardened they may use an incubate level service if it is available, knowing some limitations. Some power user data scientists themselves can help with incubate level abstractions.
AWS SageMaker Environment Reference
This is a mandatory component irrespective of the development host. Keep in mind that in a full on-premise environment, data scientists have access to consolidated environments where they develop, train and validate their data science models. Often, this is a limiting factor where scaling options are non existent. On the other hand, in hybrid or cloud native environments data scientists have multiple options. For example, they may develop their models in a reasonably sized development environment, even sharing them with other team members, and based on the size of the training step, they may choose high powered instances for scalability, thereby minimizing total costs for a data science project. In order to get this flexibility, the SageMaker environment reference is critical.
In the example, mySensorPlatform provides necessary AWS SageMaker required attributes and makes them available for subsequent tasks. Each user will get their own platform environment and only pay for the resources they need as enabled by AWS. The code fragment shows an example role, but in real-life, it can be based on input parameters as users obtain their platform references.
Since the platform supports multiple users, and each user may handle simultaneous training sessions, several platform references are enabled by the abstraction and uses AWS tags, etc., to distinguish them for various administrative tasks. In addition, enterprises may collect some useful metadata as platforms are used, either from explicit or implicit inputs.
It is critical for abstraction APIs to be designed to provide necessary hooks for power users to go beyond abstraction scope in a governed manner. The SageMakerRole variable in the code fragment is precisely needed for such purposes. What hooks made available depends on overall enterprises’ policies, etc.
Training environment setup
Keep in mind that an environment could be used for multiple purposes, such as training, batch inferencing etc.
The flexibility of cloud native platforms is that one can tailor on demand infrastructure that is fine tuned for purposes in hand. AWS SageMaker provides such a capability by providing pre-built containers focused on algorithms such as XGBoost, in a repository and that can obviously be hosted on numerous hardware instances. Challenges do exist for some data science teams in figuring out appropriate hardware instances that will be helpful. Abstractions can come in handy in such situations
From a cost management perspective, one may use a smaller configuration to build and test training job code, before actually performing the training. The API mySensorPlatform.getSmallXGBoostEnvironment (region) precicesly does that.
The API mySensorPlatform.getSmallXGBoostEnvironment (region) distinguishes infrastructure need for XGBoost vs HuggingFace or even frameworks like TensorFlow etc., and also the size of instance to be used. While this example explicitly names a small instance, one can define an API that may take implicit inputs and define an instance size.
This also provides an example of incubate vs hardened levels of abstraction. For enterprises that are getting started, one can simply use judgement based allocation of Small, Medium, Large and Huge classes and enable data scientists to make quick progress. Subsequently the technology platform team may extend the abstraction to deduce appropriate instance configuration based on input factors, such as algorithm type, training data size, etc.
Data preparation – Keep it outside of Training and Validation workloads
Though not necessarily part of the model training workload, training data needs to be organized in specific formats, if we are going to leverage built in algorithms. In some enterprises, the data preparation pipelines may involve complex transformations and it may be prudent to keep them outside of core Model Training and Validation workloads. For example, data pipelines may not need the characteristics of AWS ml-instances. AWS ml-instances are more expensive (exact amount depends on a few factors), and hence it is prudent to rely on other options in AWS ecosystem.
So, in our hybrid situation example, the notebook may be hosted in an EC2 or even on a local machine. If one uses local machine, egress costs will need to be factored in.
As stated earlier not all cloud native interfaces need to be abstracted. In particular, data preparation code may take various forms and, in many instances, may be outside of Model Training and Validation workload as discussed earlier. So, the following code fragment directly uses AWS boto3 libraries and S3 APIs.
Training sessions setup
In a typical data science project, model training happens several times and some of them may even be run simultaneously with a few variations (I need the results NOW!), including hyper parameters, inputs, etc. So, enabling multiple training sessions bounded and distinguished is required. Also, it is quite common for data scientists to schedule few variations of training and choose ones that are helpful for their purpose. In a typical on-premise environment for Model Training and Validation workloads, data scientists may not have options to run several experiments simultaneously. Rather than serialize these steps, in cloud native environments, they may run a few training jobs simultaneously and get back to analyzing the outcomes. So, the following code fragment demonstrates the ability to create multiple training sessions and associate the platform details with them.
So, The myTrainingSession named train1 is distinct and bounded.
Model Training Workload
Model Training workload accomplished through creating a AWS SageMaker training job and submitting that for execution on the platform. The API myTrainingSession.submitTraining (job_name =’ ‘, container = ‘ ‘) precisely accomplishes this need. While it simplifies a lot of input options – for example, hyper parameters are supplied by the API and that can be overridden. A well designed API should require mandatory inputs to be provided, while this example just illustrating the concept.
The Python notebook referenced in this paper allows data scientists to prepare and complete a training session on AWS SageMaker without directly using AWS SageMaker APIs. Instead the AWS SageMaker APIs were encapsulated based on sample scenarios. In real life scenarios, the abstractions are expected to be more sophisticated. As SageMaker evolves, platform teams can incorporate them in abstractions, maintaining backwards compatibility.
Though not explicitly discussed, most APIs defined as part of such abstractions should allow expert data scientists to pass underlying AWS SageMaker API parameters. In such scenarios, choices have to be made by platform engineers on which parameters that the abstractions provide can be overruled.
Choices, Choices & Choices!