Training Data Preparation by Marco Interactive

Date

May 2020

Client

Genentech

Genentech operates drug manufacturing facilities globally, where precision is paramount. These facilities represent substantial investments in cutting-edge equipment, highly trained personnel, and rigorous environmental controls, all with the overarching goal of consistently producing precise drug formulations. Even the slightest alterations in environmental factors, processes, or formulation can introduce significant deviations in the final product, potentially resulting in multimillion-dollar losses per production run.

Problem Statement

Genentech faced a costly challenge involving unexplained losses in various manufacturing processes across multiple sites. Their objective was to gather comprehensive time series data from every machine and process, including data on plant environmental conditions. This data would then be fed into a machine learning model designed to pinpoint the exact source of these losses. The challenge extended to ensuring that both the model and its training data could be consistently replicated, considering the diverse data collection methods employed by Genentech. Furthermore, this complex task had to be coordinated across four separate teams, each operating independently with distinct security models and data access protocols.

‍

Solution

I designed and implemented a versatile data pipeline to aggregate information from various sources, consolidating it into a Google BigTable. This raw data served as the foundation for refining data used in model training and direct prediction analysis. While some data was automatically extracted from machine logs, other information required manual processing, which introduced the potential for human errors. To address this, I created quality control scripts to manage all input data sources systematically. Additionally, I established a standardized schema for the training dataset across the diverse systems, ensuring data consistency and governance. This common data framework facilitated effective communication within the team and allowed us to refine the model for identifying anomalous conditions.

‍

‍

Challenges

I understand the importance of aligning data accurately, especially in a complex environment where different equipment and providers introduce variations in how data is tracked and categorized. It was a challenging task to bring harmony to these diverse datasets, as each device had its own unique lexicon and data grouping methods for what essentially represented a batch run of data.

Collaborating closely with the Genentech scientists responsible for the plant, I was able to bridge these gaps. Together, we worked diligently to create a unified time-series format for the data and establish a common baseline dictionary of identifiers and definitions. This not only ensured data consistency but also streamlined the analysis process, making it more efficient and reliable. Our shared goal was to enhance the accuracy and effectiveness of the model, ultimately benefiting Genentech's manufacturing processes.

‍

Tools

SQL
Draw.io
Python
Google Cloud
BigTable

‍

Other projects

More Case Studies

We create amazing Webflow templates for creative people all around the world and help brands stand out.

Digital Transformation