Follow me for insights on real-world MLπ
If this thread helped, drop a like π· or repost π·
@indent4.bsky.social
π€ ML Engineer β Developing Apps & Sharing ML/AI Insights π | Always Learning #MachineLearning #DataScience #AI
Follow me for insights on real-world MLπ
If this thread helped, drop a like π· or repost π·
Why use the CPO Pattern?
β
Scalability β Add new models & components easily.
β
Maintainability β Elements are decoupled making it easy to change, test and debug individual parts without breaking the entire system.
β
Reusability β Reuse components across different ML projects.
3οΈβ£ Orchestrators π€
The orchestrator is responsible for managing the pipeline execution.
This allows us to easily switch components (e.g., swap XGBoost for RandomForest).
2οΈβ£ Pipelines π€οΈ
A TrainingPipeline ties together components into a single workflow.
Pipelines ensure a clear flow of data, from raw input to final evaluation.
1οΈβ£ Components π¦
Each key function (data import, preprocessing, training, evaluation) is wrapped in its own class.
This makes it reusable, testable, and modular.
Example: A DataImporter class to load data!
π Want to write production-grade ML code thatβs clean, modular & maintainable?
Hereβs a simple CPO (components, pipelines, orchestrators) pattern example to structure your code.
Let's break this down with a training pipeline exampleπ
#MachineLearning #DataScience #MLOps
DataFramed, Super Data Science, The Data Scientist Show, Ken's Nearest Neighbours
19.02.2025 15:34 β π 2 π 0 π¬ 1 π 0Follow me for insights on real-world MLπ
If this thread helped, drop a like π· or repost π·π
EDA isnβt just a formalityβitβs the difference between a strong ML model and garbage results. π
β
Understand your data
β
Spot issues before modeling
β
Find strong predictors
These steps serve as a starting point. Analyse further depending on the needs of your project.
π‘ Why?
πDetect multicollinearity (too many correlated features = model confusion)
πUncover hidden patterns
πReduce dimensionality if needed (PCA can help)
π Example: If two features are 99% correlated, you probably donβt need both in your model.
4οΈβ£ Multivariate Analysis π (Relationships Between Features)
Time to go deeper and check how features interact with each other.
π₯ Correlation heatmaps (sns.heatmap(df.corr(), annot=True))
π₯ Pairplots (sns.pairplot(df))
π₯ PCA (if needed) (PCA().fit_transform(df))
π‘ Why?
π Find strong predictors π
π Identify non-linear relationships (Choose a non-linear model)
π Detect leakage (some features might be too correlated with the target!)
π Example: A high correlation might mean a strong predictorβ¦ or data leakage. Always check!
3οΈβ£ Bivariate Analysis π
See how features relate to the target variable
π Numerical:
Correlation heatmaps (df.corr())
Scatter plots
π Categorical:
Box plots (sns.boxplot(x='category', y='target', data=df))
Grouped means (df.groupby('category')['target'].mean())
π‘ Why?
π Detect skewness (maybe you need log transformations)
πSpot outliers
π Identify imbalanced categories
π Example: If a feature is heavily skewed, your linear models might struggleβfix it early!
2οΈβ£ Univariate Analysis π
Letβs get to know each feature individually.
π What to check?
π Numerical: Histograms, box plots (df['feature'].hist())
π Categorical: Value counts, bar plots (df['feature'].value_counts())
π Outliers: Box plots (sns.boxplot(x=df['feature']))
π‘ Why? This helps spot early issues (wrong dtypes, missing data, duplicates, incorrect values) before they ruin your model.
π Example: Checking for missing values can reveal if you need imputation or feature removal.
1οΈβ£ Understanding Data Structure & Metadata ποΈ
Before diving in, know your data.
π What to check?
β
Dataset shape (df.shape)
β
Data types (df.dtypes)
β
Descriptive statistics (df.describe())
β
Missing values (df.isnull().sum())
β
Unique values & duplicates
π Many newbies jump straight to feature eng and training ML models... then find models underperform.
Why? They skip Exploratory Data Analysis (EDA).
EDA is an important step for performance and explainability.
Hereβs how to startπ
#DataAnalytics #100DaysOfML #MachineLearning
Should you separate your SQL servers into different resource groups? This depends on your specific use case. if you have a small number of SQL databases and they are tightly integrated with your app then keeping them in the same resource group as your app might be the way to go.
10.02.2025 00:07 β π 1 π 0 π¬ 0 π 0It's recommended to keep Data Lakes and Blob Storage in a different resource group than your application or project resource group. This limits attack potential security issues and eases resource & cost management.
10.02.2025 00:07 β π 0 π 0 π¬ 1 π 0It's a good idea to keep data associated with different projects in dedicated Blob storage containers.
10.02.2025 00:06 β π 0 π 0 π¬ 1 π 0Create dedicated resource groups for each large project or application
10.02.2025 00:06 β π 0 π 0 π¬ 1 π 0On cloud platforms there are a number of different ways to store your data. Given that it can often be confusing to know what to know how best to organise your data.
Here's a couple of tips for working with blob storage, data lakes and sql servers:
#azure #dataengineering
Personally, Iβm not a fan of building things with certain frameworks or patterns just because company x or developer y does it that way. Prefer to use as ideas to try, see the pros and cons and then use where it makes sense rather than following blindly.
10.02.2025 00:04 β π 1 π 0 π¬ 0 π 0The more I watch or read about successful people, the more I see that:
- They build things they want to use
- They are smart contrarians
- They donβt assume something will or wonβt work, they experiment and adjust.
Also helps if you have rich people in your networkπ¬