Data Organization
#
VisualizationWhen the data is uploaded, the catalog is visualized with the help of hundreds of computer vision and NLP models, that process your data, and project the products on a 2D canvas. This enables you to visualize your data spread across the canvas, to quickly realize similarities and differences between the products, based on their relative distances from each other. To avoid clutter, not all data points are visualized at once but only a sample of it.
Using pre-labeled data, or starting with a default taxonomy can help start organization from a semi-organized state.
#
Organization Interaction#
Labeling data pointsThe data organization starts with labeling a few data points. You can label a datapoint in two ways
Drag and drop products into their respective classes below
Hover on a product and choose a label from the dropdown
Labeled datapoint - The user action of assigning a class value to a datapoint marks a datapoint as a ‘labeled datapoint’. Once a datapoint is labeled it disappears from the working area. To view labeled data points, click on the class to open all data points labeled as that class
#
Predictions and Clusters- Once a few products have been labeled, the system understands the intent, and tries to predict the classes for the remaining products, by grouping similar ones into clusters.
A cluster is a group of similar products that’s predicted by the system. A cluster boundary represents a prediction confidence of 80%.
Predicted datapoint - A datapoint that falls within a cluster is a predicted datapoint. Data points predicted with a confidence over 80% will fall inside the cluster.
The data points with lower confidence in predictions will fall outside the cluster. These are considered unlabeled and unpredicted data, and can be referred to as outliers.
The user can continue drag and drop interactions with the clusters to convert predicted data points into labeled data points. Labeled data points disappear from the cluster/working area.
Confirm prediction - Drag a datapoint from within the cluster and drop within the same cluster - This will confirm the systems prediction, and mark the datapoint as ‘labeled’
Correct mispredictions - Dragging a datapoint from one cluster and dropping inside another cluster - This will correct a misprediction, and mark the datapoint ‘labeled’ as the new cluster class.
Label unpredicted data - Dragging a datapoint from outside and dropping inside a cluster - This action is the same as dropping a datapoint into the class below - it will mark the datapoint ‘labeled’ as that cluster class.
Labeled data points are not considered for predictions again. These are anchored to the class assigned by the user. Only unlabeled data points (with or without predictions) are considered for predictions in the next refresh.
#
Labeling data points in bulkAnother powerful option is to organize data in bulk.
- Here the user can scan through a grid of similar products, and quickly select and label tens of data points at a time.
- You can sort data based on confidence of prediction. And also filter the data by user labeled, System predicted data. Predicted data points have a grey tag, while labeled data points are blue.
While the AI classifies most of the data points correctly, the user feedback by moving outliers into clusters helps the system refine its predictions. With a few iterations of predictions and feedback, the accuracy of the AI improves, and the organization gets closer to completion.
Other bulk page feature includes Icon enlargement to size up/down icons as required.
-Switching working levels - Now that we have competed 99 - 100% at a category level, we can switch to organizing a different level of the taxonomy. We repeat the organization steps, very similar to the category level organization.
#
Switching taxonomy levelsOnce organization at the current level is complete, it is recommended to switch to a child taxonomy level or a peer attribute to continue organization.
Note:
Completion at the current organization level is determined by
Completion rate at 100% or high 90s (Note: there can be some attributes which are difficult to train the system on. The user can use their discretion to stop labeling at a lower completion rate in such cases)
Average confidence score over 80% - This indicates that the system has classified most of the data in the validation set correctly
No outliers (or very few outliers)
In order to switch the taxonomy level, click the taxonomy level indicator to open up the taxonomy screen, and select the taxonomy level to switch to. When switching to a deeper level in the taxonomy, only the data points that have been labeled as the parent class will show up for organization.
In this video, we look at how to switch the taxonomy level from parent to child node once the category has been organized. If the user switches taxonomy level to ‘Sleeve Length’ attribute under ‘Dress class, only those data points labeled/predicted as ‘Dresses’ will show up for organizing for 'Sleeve Length'.
#
Menubar/Toolbar- Zoom tool has the option to Zoom In, Zoom Out, Rectangular area selection and Zoom Reset
- Lasso Selection helps in selecting all the outliers and opens up for user labeling in the bulk edit page
Image Size Icon enlargement to size up/down icons as required in the 2D canvas for better visibility
Reset is used when we want the system to unlearn the previous interactive learnings and starting the labeling/User interactions from scratch all over again.
#
MetricsThere’s a number of metrics that provide insights into the data organization.
Let’s start with some basic count of data points. Each of these counts are available at a class level (when hovering on the class at the bottom of the screen) and at the current taxonomy level (when hovering on the completion rate indicator)
System predicted data - The number of data points that the system could predict with a confidence higher than 80%
User labeled data - The number of data points labeled by the user either in single or bulk edit mode. Predicted data points once labeled by the user will move from predicted count to labeled count. To provide more granular info on the user labeled data points, we break this down into accepted vs correct data points. These ratios provide a sense of the AI’s understanding and correctness in organizing data
User Accepted - These were labeled predicted correctly by the AI, that the user accepted
User Corrected - These were mis-predicted data points or data points without any predictions which the user labeled
#
Completion rateThe completion rate is a percentage measure of the number data points with a label or prediction over the total data points available for classification at this level of the taxonomy. It can also be viewed as a measure of the ratio of data points that fall within clusters to the outliers. This helps provide a sense of when to stop working at the current taxonomy level, and switch to another.
#
Average confidence scoreAs the user organizes the data, a portion of the user labeled data is hidden from the training set, and is used as a validation set to provide a sense of the AI’s understanding of the user interactions. The average confidence score is a measure of the system’s performance on the validation set.
Note:
This number will not be very relevant for the first few iterations of predictions since it will be computed on a very small validation set. It will start making better sense a few rounds into the interactions.