Data Engineer Interviews: Design Data Model for Cloud Storage Platform
When did cloud storage become a thing ?
The cloud storage has substantially changed the way we live and work over the past decade more than any other technology. According to IBM , The Cloud was first used on mainframe computers in the 1950s, despite many believing it is a new technology. During the 1960s, it evolved and was used by governments to construct reliable communications networks. In the 1970s VM’s took technology to the next level.
Let’s start identifying the Success and Improvement Metrics of a cloud storage platform.
Product Sense:
Success Metrics
- New Users vs Existing Users, active/inactive/returned
- Logins / Usage Daily/Monthly/etc
- Upload Success Rates
- Time to execute each step of customer journey (login → file uploaded successfully)
- Number of files uploaded / download — bandwidth
- File sizes, aging, types (image, pdf, etc)
- Content Sharing metrics (file level, folder level)
Improvement Metrics
- Device type / region distribution of users (mobile/desktop/tab) → to drive adoption on specific platforms
- Growth Accounting (new/inactive/active/churn/return/etc)
- Bandwidth consumption by activity (up/download)
- Success/failure rate for each step of user journey (up/down load funnel) w\time spent
- Flow analysis … reduce # of actions required to complete user journey
Data Modeling: Four-Step Dimensional Design Process By Kimball Group
- Select the Business Process
- Declare the Grain (“How do you describe a single row in the fact table?”)
- Identify Dimensions (“How do business people describe the data resulting from the business process measurement events?”)“who, what, where, when, why, and how” associated with the event.
- Identify Facts
Solution
Business Process : The cloud storage company would like to design a data model to capture all critical data elements and answer the following questions.
Track how many files are Shared and % of Shared Files in a week.
what file type is shared more frequently ?
How many files had more than one owner at a given time?
Total File Actions by Content Categories.
Grain : Individual File on each Transaction Level
Dimensions : Date , Users/Customers, Drivers , Plans/Subscriptions , File Types.
Facts : Files , Actions/Events , Shared Files
Derived Facts: These are recommend to be stored physically in the table.In this case study, the total amount is straightforward, but storing it means its computing consistently in the ETL process, to eliminate the possibility of user calculation errors. Certain Derived Facts can be added in the view to minimize Space or can be added in any BI tool.
Section 1: Dimensions
Date Dimension:
All Data models always need an explicit date dimension table. There are many date components that are not supported by the SQL date function, including week numbers, fiscal years, seasons, holidays, workdays and weekends. Rather than attempting to determine these calculations in a query, its would be easy to store in the date dimension table.
Users Dimension:
In this area , we will cover the users who are using the Cloud service. We will use this information in the center of the model. The users table contains a list of all relevant users ( Current and New Users). For each user we will store their Unique ID , Name ( First and Last Name) , Email , Address , Phone Numbers , joined date , Current Plan ID ( To determine which subscription they are currently using) , Is Active ( To determine if user is still active or Opted out of service) , Payment ID and default deviceID.
Note that User Login Details are not mentioned in the users table, we can certainly maintain different tables to track the each login made by the user( Login Start Date , Login End Date) based on his permissions if any.
Also Note that if user have more than one address or phone numbers , we can add communications( With Type ( Phone/Email/Address) and Preferences tables.
Plans Dimension:
In this area , we will cover all the monthly/yearly membership plans that user can opt for on the platform. For each plan , there is a unique ID , Plan Name , No of Screens Limit , Base Rate , Plan Start Date , Is Active columns.
Devices Dimension:
This table stores all the devices that are compatible to use the platform. All devices have unique ID ,type of device( mobile , Tablet , Desktop) and Is Active Flag to indicate the status of the device.
Note that User can have more than one device active at a time. you can also maintain a table at a grain of UserID and DeviceID to support it.
Section 2: Facts
Files Fact:
In this area, we will cover all the files uploaded by the user by accessing the cloud storage platform. Every record in this table has a unique file id, user id, device id , IsFileShared (Y/N) , Upload Start and End timestamps , File Type (Video, Image etc), File Size , Status of the File ( 0 : Cancelled , 1: In Progress 2: Completed).
Note that User and File ID are unique in the above fact table. But if the same file is shared with multiple users , we store that data in different table.
Shared Files Fact:
In this area, we will cover all the shares done by the user on each file by accessing the cloud storage platform. Every record in this table has file id, user id, shared user id , device id , Shared Date , IsFileDownloadable(Y/N), Sharing type ( 0 : Read , 1: Read/Edit).
Actions/Events Fact:
This table covers all the actions done by the user. The data is stored as rows for each event done by user. Every record has unique Action ID, File ID , User ID, Action Type ( Upload, shared, download etc).
Payments Fact:
This table stores all the payments made by the user. Every payments includes a Payment ID , User id ( who made the payment) , Date of Payment , Payment Method , Total Amount , Transaction Id and Status of Payment.