Bigdata In Our Palm: October 2018

System Design - 'Here Now' service

Aim:

Design a location service, where we can search for a user’s current location and also search for all users in a given location. Gaining insights about user location helps for better recommendations and predictive analysis.

Requirements:

1. The user should be able to check in their location.

2. Given a location (latitude/longitude), we should be able to find all users in the location.

3. The system should have a real-time search experience with very less latency.

4. The service should support heavy search load.

5. The system should be durable and highly available.

Scaling Estimation:

1. 200 Million check-in places

2. 100 Million active users

3. 20 Million daily check-ins

4. 30% growth every year

Database Schema:

A. Details for every check-in point – Location Table (Stored in MySQL):

1. LUID – Location Unique ID

2. Name - Location Name

3. Longitude

4. Latitude

5. Category – Breakfast/Lunch/Dinner/Coffee & Tea/Nightlife/Things to do

6. Description - Small description of the place

Location Table
PK	LUID	Int
	Name	Varchar(32)
	Longitude	Varchar(8)
	Latitude	Varchar(8)
	Category	Varchar(20)
	Description	Varchar(300)

B. Details for every user – User Table (Stored in MySQL):

1. UUID – User Unique ID

2. Name – Name of the user

3. Email – email address of the user

4. DateOfBirth – Birth date of the user

5. CreationDate – Date when the user joined

6. LastLogin – Last login information of the user

User Table
PK	UUID	Int
	Name	Varchar(20)
	Email	Varchar(32)
	DateOfBirth	datetime
	CreationDate	datetime
	LastLogin	datetime

System APIs:

1. Search users in a given location:

search(api_dev_key, location, radius_filter, max_number_of_results_to_return) api_dev_key : Mandatory. Developer API key. Used for multiple cases like throttling access based on the allowed quota.

location: Mandatory. Location where the search is performed

radius_filter: This is optional. To specify the search radius. By default, the radius is 0 meters. (Looks only the exact location).

max_number_of_results_to_return : Again optional. To specify the number of results to return. by default, return all users.

2. Check in to a location:

checkin(user_id, location_uid, user_location_lattitude, user_location_longitude)

System Design:

1. MySQL solution:

The user – location information will be tracked in ‘user_location_table’ in MySQL. We will insert a new row UUID, LUID, and CurrentTimestamp when the user checks in for the first time. If the user updates their check-in location, update the existing row.

user_location_table
PK	UUID	Int
PK	LUID	Int
	RecordTimestamp	datetime

A. High level query to get user location for a particular user:

select name from location_table where LUID in (select LUID from user_location_table where UUID = <UUID1> and (CurrentTimestamp - RecordTimestamp)< 3 Hours)

B. High-level query to get all users in a particular location:

select name from user_table where UUID in (select UUID from user_location_table where LUID = <LUID1> and (CurrentTimestamp - RecordTimestamp)< 3 Hours)

NOTE: If the data is huge, this will not return results in near real time.

2. Solution based on HBase (Preferred solution):

We can use HBase to store the check-in location of a user. We can use the ‘Time To Leave’ feature of HBase to delete the user check-in after 3 hours of check-in. The TTL can be set to 3 hours (10800 seconds).

HBase table creation:

create ‘user_check_in_info’,’c’, { TTL => 10800 }

Table structure: Row Key – <UUID> Column Qualifier – check_in_location (constant string) Column Value - <LUID/Location name>

If the same user checks in to a different location within 3 hours of the previous check-in, then we will update the same row with the new location. E.g.:

put ‘user_check_in_info’,’UUID1’,’c: check_in_location’,’LUID1/Location name 1’

Now the user is tagged to location ‘LUID1’. If the user check-in to a different location, then the row will be updated.

put ‘user_check_in_info’,’UUID1’,’c: check_in_location’,’LUID2/Location name 2’

A. To query the location of a particular user:

This is a simple ‘get’ operation in HBase. get ‘user_check_in_info’,’UUID1’

B. To query all users in a given location:

Since the HBase rowkey is UUID, to get all users specific to a given location, we will need to scan the entire HBase table and filter out all rows where check_in_location is the required location. This solution is not optimal. Query latency will increase as the data grows. This solution can be improved by caching the data in memory. The next section discusses one such solution.

Faster retrieval using GRIDs:

Divide the whole map to small grids. Each gird is responsible for a group of locations that is within a specified range of latitude and longitude. As a result, we can restrict our query to a few grids which will reduce the amount of data queried. With help of given location and the optional radius, we can also find the neighboring grids that are of interests. Hence we can leverage the current system not only to get a list of users from a particular location but also from a location nearby to the given location.

The grid will be used to track the location of each user.

Deciding what should be the size of each grid:

We can pre-divide and specify a fixed range of latitude and longitude for a given grid. But this solution will have problems when the certain area will be crowded (eg. places like San Francisco/ New York, etc.) while other areas might be sparsely populated (eg. Someplace in Antarctica!).

We can modify the solution to use dynamic grids. Here, the grid will be decided based on the number of check-in places. For example, we can create a new grid when the number of check-in place reach, say 500. In this way, each grid will have a more uniform amount of data. This can be implemented using the quadtree data structure.

Implementation:

We will build a tree structure with four children - a QuadTree. Each grid or node will have details of users present in a particular location. Once the number of check-in places reaches the specified limit, (in our case, 500), the node will be divided to make four new children and the load is distributed. Hence, only the leaf nodes will have information about the users. To find the neighboring grids, we can connect all the leaf nodes using a doubly linked list.

How we build the QuadTree if the system crashes?

We will scan through all the records in HBase ‘user_check_in_info’ table. Based on the location obtained for each user, we will decide which grid the user information should be kept.

Memory requirement: We will store location ID (8 bytes), latitude (8 bytes), longitude (8 bytes), user ID (8 bytes), username (20 bytes) and check in timestamp (8 bytes) in the grid. Since we assume that there will be 20 million checks in every day, total memory requirement will be:

(8+8+8+8+20+8) *20M Bytes = 60*20M Bytes = 1145 MB

We can add few MBs of memory to hold the pointers (parent to children) required to build the QuadTree. In total let’s assume that we need roughly 1.5 GB of memory to hold the entire tree. This can easily fit into the memory of a single machine. However, for better performance and scalability we can shard data to multiple servers based on hash function of location ID. (Discussed in data sharding section).

What happens when a user creates a new check-in?

Whenever a new check-in is added by a user, we need to insert it into the HBase, as well as, in the QuadTree. Since the QuadTree is distributed among different servers, first we need to find the grid/server of the new check-in and then add it there.

What happens when user check in to location X and then check in to location Y?

We should track only the latest check-in by the user. We have designed our HBase table in such a way that it will track only the latest check-in. However, we need to track the same in our QuadTree. Hence, when we update HBase, first we get the previous location of the user from HBase (in this case X) and update the new location (in this case Y). We can do a ‘get’ operation from HBase to retrieve the previous location of the user.

get ‘user_check_in_info’,’UUID1’

With the help of previous location information obtained from the HBase table, we modify the QuadTree to delete the user from location X and add the user to location Y.

How to filter out users with check in time that lasted more than 3 hours?

Since we have set TTL, data will be purged from the HBase table after 3 hours of check-in. However, the data will still reside in the QuadTree data structure. Hence, when we retrieve the list of users for a particular location, we need to filter out for users who have checked in time longer than 3 hours. This can be implemented with help of check-in timestamp stored in QuadTree and the current timestamp. (Rule: current timestamp - check in timestamp < 3 hours)

Housekeeping: We can run a background chore that will periodically check for the records that have expired (>3 Hours) and then do a cleanup activity. This will be similar to compaction activity performed by HBase. This is required for disk space cleanup and better overall performance.

Data Sharding:

user_check_in_info HBase Table: We can shard data based on the hash of UUID. Ideally UUID will be monotonically increasing integer generated by MySQL. We need to hash the UUID to generate a uniform distribution of data across different servers. Also, based on the hash function, we can also create a pre-split HBase table which will help in better performance and load balancing.

users_in_location QuadTree: One solution is using a hash function to map each location ID to a server where we will store the location and the user present in the location. However, this can cause hot spotting if some location is crowded with users. For example, if we have some event in Levi’s Stadium, Santa Clara, then there might a large number of check in to the stadium. Hence, partitioning based on the hash of UUID is a better solution. However, this comes with an overhead of extra computation when we need to fetch users for a particular location. Since, now the data is sharded based on UUID and can reside in any of the available QuadTree servers, we need to fire the query to get results of users from a particular location to all the servers. Once each server responds with their set of users, then we need an aggregation service that will aggregate the results from all the individual servers and preset the complete output.

Fault Tolerance & Replication:

We would need replicas of these servers, so that if the primary dies the secondary can take control. We will store this data in a distributed file system like HDFS/MapRFS. By default, these filesystems will have data replication factor of 3. Also, we can make use of SSDs for faster IOs.

Disk space required to store data:

For user details:

8 (UUID) + 20 (Name) + 32 (Email) + 8 (DateOfBirth) + 8 (CreationDate) + 8 (LastLogin) = 84 bytes per user

For 100 million users:

84*100M = ~8GB For location details: 8 (LUID) + 32 (name) + 8 (Latitude) + 8 (Longitude) + 20 (Category) + 300 (Description) = 374 bytes per location For

200 million locations:

374*200M = ~70GB

For user-location tracking table:

8 bytes (UUID) + 8 bytes (LUID) = 16 bytes + 4 bytes (HBase metadata) For 20 million check ins: 20*20M = ~380MB

Hence, we need ~80GB of storage space for the first year. If we design a system for 10 years with year over year growth of 30%, we will need ~1110GB storage space. If we consider data replication of factor 3, then total storage required will be 1110*3GB = 3330GB.

Bigdata In Our Palm

Friday, 12 October 2018