09 Design A URL Shortener

Original source
09

Design A URL Shortener

In this chapter, we will tackle an interesting and classic system design interview question: designing a URL shortening service like tinyurl.

Step 1 - Understand the problem and establish design scope

System design interview questions are intentionally left open-ended. To design a well-crafted system, it is critical to ask clarification questions.

Candidate: Can you give an example of how a URL shortener work?
Interviewer: Assume URL https://www.systeminterview.com/q=chatsystem&c=loggedin&v=v3&l=long is the original URL. Your service creates an alias with shorter length: https://tinyurl.com/y7keocwj. If you click the alias, it redirects you to the original URL.

Candidate: What is the traffic volume?
Interviewer: 100 million URLs are generated per day.

Candidate: How long is the shortened URL?
Interviewer: As short as possible.

Candidate: What characters are allowed in the shortened URL?
Interviewer: Shortened URL can be a combination of numbers (0-9) and characters (a-z, A-Z).

Candidate: Can shortened URLs be deleted or updated?
Interviewer: For simplicity, let us assume shortened URLs cannot be deleted or updated.

Here are the basic use cases:

  1. URL shortening: given a long URL => return a much shorter URL

  2. URL redirecting: given a shorter URL => redirect to the original URL

  3. High availability, scalability, and fault tolerance considerations

Back of the envelope estimation

It is important for you to walk through the assumptions and calculations with your interviewer so that both of you are on the same page.

Step 2 - Propose high-level design and get buy-in

In this section, we discuss the API endpoints, URL redirecting, and URL shortening flows.

API Endpoints

API endpoints facilitate the communication between clients and servers. We will design the APIs REST-style. If you are unfamiliar with restful API, you can consult external materials, such as the one in the reference material [1]. A URL shortener primary needs two API endpoints.

  1. URL shortening. To create a new short URL, a client sends a POST request, which contains one parameter: the original long URL. The API looks like this:

    POST api/v1/data/shorten
    
    - request parameter: {longUrl: longURLString}
    - return: shortURL
    
  2. URL redirecting. To redirect a short URL to the corresponding long URL, a client sends a GET request. The API looks like this:

    GET api/v1/shortUrl
    
    - return: longURL for HTTP redirection
    

URL redirecting

Figure 1 shows what happens when you enter a tinyurl onto the browser. Once the server receives a tinyurl request, it changes the short URL to the long URL with 301 redirect.

Image represents a network request response, specifically showing details of an HTTP GET request and its corresponding response headers.  The top section displays the request details: the URL (`https://tinyurl.com/qtj5opu`), the method (GET), a 301 status code (indicating a redirect), the remote server address (`[2606:4700:10::6814:391e]:443`), and the referrer policy ('no-referrer-when-downgrade'). Below, the 'Response Headers' section lists various headers sent by the server, including `alt-svc` (specifying alternative service protocols), `cache-control` (setting caching directives), `cf-cache-status` (indicating a dynamic response from Cloudflare), `cf-ray` (a Cloudflare request identifier), `content-type` (specifying HTML content with UTF-8 encoding), `date` (timestamp of the response), `expect-ct` (related to certificate transparency), and crucially, `location` (containing the redirect URL: `https://www.amazon.com/dp/B017V4NTFA?pLink=63eaef76-979c-4d&ref=adblp13nvvxx_0_2_im`).  The arrangement is linear, with request details above the response headers, and each header is presented as a key-value pair.  The information flow is unidirectional, from the client making the request to the server providing the response, including the redirect location.
Figure 1

The detailed communication between clients and servers is shown in Figure 2.

Image represents a simplified URL redirection system.  A 'Client' (represented by a smartphone icon) initiates the process by visiting a short URL: `https://tinyurl.com/qtj5opu`. This request is sent to a 'tinyurl server,' which then responds with a 301 status code (indicating a permanent redirect) and the 'location' of the long URL: `https://www.amazon.com/dp/B017V4NTFA?pLink=63eaef76-979c-4d&ref=adblp13nvvxx_0_2_im`.  The client then automatically follows this redirect, sending a new request to the 'Amazon server' to visit the long URL.  The diagram uses arrows to show the flow of requests and responses between the client, the tinyurl server, and the Amazon server.  The text labels clearly indicate the URLs involved and the HTTP status code of the redirect.
Figure 2

One thing worth discussing here is 301 redirect vs 302 redirect.

301 redirect. A 301 redirect shows that the requested URL is “permanently” moved to the long URL. Since it is permanently redirected, the browser caches the response, and subsequent requests for the same URL will not be sent to the URL shortening service. Instead, requests are redirected to the long URL server directly.

302 redirect. A 302 redirect means that the URL is “temporarily” moved to the long URL, meaning that subsequent requests for the same URL will be sent to the URL shortening service first. Then, they are redirected to the long URL server.

Each redirection method has its pros and cons. If the priority is to reduce the server load, using 301 redirect makes sense as only the first request of the same URL is sent to URL shortening servers. However, if analytics is important, 302 redirect is a better choice as it can track click rate and source of the click more easily.

The most intuitive way to implement URL redirecting is to use hash tables. Assuming the hash table stores <shortURL, longURL> pairs, URL redirecting can be implemented by the following:

URL shortening

Let us assume the short URL looks like this: www.tinyurl.com/{hashValue}. To support the URL shortening use case, we must find a hash function fx that maps a long URL to the *hashValue*, as shown in Figure 3.

Image represents a simplified URL shortening system.  A rectangular box labeled 'longURL' containing the text 'longURL' represents the original, lengthy URL.  An arrow points from this box to a second rectangular box containing the mathematical function symbol 'ƒ(x)', representing a hashing function. This function takes the 'longURL' as input.  Below the 'ƒ(x)' box, a downward-pointing arrow labeled 'hash' connects it to a third rectangular box, which displays a shortened URL: 'https://tinyurl.com/qtj5opu'.  The text 'Viewer does not support full SVG' is present below the shortened URL box, indicating a limitation of the image display, not the system itself.  The overall diagram illustrates the process of inputting a long URL, applying a hashing function (ƒ(x)) to generate a short hash, and then using that hash to create a shortened, accessible URL.
Figure 3

The hash function must satisfy the following requirements:

Detailed design for the hash function is discussed in deep dive.

Step 3 - Design deep dive

Up until now, we have discussed the high-level design of URL shortening and URL redirecting. In this section, we dive deep into the following: data model, hash function, URL shortening and URL redirecting.

Data model

In the high-level design, everything is stored in a hash table. This is a good starting point; however, this approach is not feasible for real-world systems as memory resources are limited and expensive. A better option is to store <shortURL, longURL> mapping in a relational database. Figure 4 shows a simple database table design. The simplified version of the table contains 3 columns: id, shortURL, longURL.

Image represents a simplified relational database table schema, likely for a URL shortening service.  The table is named 'url' and contains three columns. The first column, labeled 'PK' (Primary Key), indicates that the 'id' column serves as the unique identifier for each row.  The 'id' column itself presumably holds a unique integer representing each shortened URL entry. The second column, 'shortURL', stores the shortened URL string, and the third column, 'longURL', stores the corresponding original, longer URL string.  No information flows between the columns; rather, each row represents a single mapping between a short URL and its long URL counterpart. The table structure suggests that given a shortURL, one can retrieve the corresponding longURL, and vice-versa, using the 'id' as the primary key for efficient lookups.
Figure 4

Hash function

Hash function is used to hash a long URL to a short URL, also known as hashValue.

Hash value length

The hashValue consists of characters from [0-9, a-z, A-Z], containing 10 + 26 + 26 = 62 possible characters. To figure out the length of hashValue, find the smallest n such that 62^n ≥ 365 billion. The system must support up to 365 billion URLs based on the back of the envelope estimation. Table 1 shows the length of hashValue and the corresponding maximal number of URLs it can support.

nMaximal number of URLs
162^1 = 62
262^2 = 3,844
362^3 = 238,328
462^ 4 = 14,776,336
562^5 = 916,132,832
662^6 = 56,800,235,584
762^7 = 3,521,614,606,208 = ~3.5 trillion
862^8 = 218,340,105,584,896

Table 1

When n = 7, 62 ^ n = ~3.5 trillion, 3.5 trillion is more than enough to hold 365 billion URLs, so the length of hashValue is 7.

We will explore two types of hash functions for a URL shortener. The first one is “hash + collision resolution”, and the second one is “base 62 conversion.” Let us look at them one by one.

Hash + collision resolution

To shorten a long URL, we should implement a hash function that hashes a long URL to a 7-character string. A straightforward solution is to use well-known hash functions like CRC32, MD5, or SHA-1. The following table compares the hash results after applying different hash functions on this URL:
https://en.wikipedia.org/wiki/Systems_design

Hash functionHash value (Hexadecimal)
CRC325cb54054
MD55a62509a84df9ee03fe1230b9df8b84e
SHA-10eeae7916c06853901d9ccbefbfcaf4de57ed85b

Table 2

As shown in Table 2, even the shortest hash value (from CRC32) is too long (more than 7 characters). How can we make it shorter?

The first approach is to collect the first 7 characters of a hash value; however, this method can lead to hash collisions. To resolve hash collisions, we can recursively append a new predefined string until no more collision is discovered. This process is explained in Figure 5.

Image represents a flowchart depicting the process of URL shortening.  It begins with a green circle labeled 'start,' which initiates the process.  An arrow leads to a light-blue rectangle labeled 'input: longURL,' representing the input of a long URL. This long URL then flows to a light-blue rectangle labeled 'hash function,' where a hashing algorithm is applied to generate a shortened URL. The output, 'shortURL,' is shown in another light-blue rectangle.  This short URL then enters a diamond-shaped decision node labeled 'exist in DB?', checking if the short URL already exists in a database (DB). If 'yes' (a collision), an arrow points to a light-blue rectangle labeled 'longURL + predefi..,' indicating that the process might involve appending a predefined string to the long URL and repeating the hashing process. If 'no,' an arrow leads to a light-blue rectangle labeled 'save to DB,' where the short URL and its corresponding long URL are stored in the database. Finally, an arrow leads from 'save to DB' to a green circle labeled 'end,' signifying the completion of the process.  The arrows indicate the flow of data and control throughout the system.
Figure 5

This method can eliminate collision; however, it is expensive to query the database to check if a shortURL exists for every request. A technique called bloom filters [2] can improve performance. A bloom filter is a space-efficient probabilistic technique to test if an element is a member of a set. Refer to the reference material [2] for more details.

Base 62 conversion

Base conversion is another approach commonly used for URL shorteners. Base conversion helps to convert the same number between its different number representation systems. Base 62 conversion is used as there are 62 possible characters for hashValue. Let us use an example to explain how the conversion works: convert 1115710 to base 62 representation (1115710 represents 11157 in a base 10 system).

Image represents a diagram illustrating the conversion of a decimal number (11157) into its base-62 representation.  The left side shows a repeated division process.  The number 11157 is repeatedly divided by 62, with the remainders shown on the right (59, 55, and 2).  Each division's quotient becomes the dividend in the next step (11157/62 = 179 remainder 59; 179/62 = 2 remainder 55; 2/62 = 0 remainder 2). The remainders are listed in reverse order to form the base-62 representation ('X T 2'), where 'X' represents 59 and 'T' represents 55. The final remainder is 0, indicating the end of the conversion.  The diagram visually connects each division step with a line, showing the flow of the calculation from the initial decimal number to its base-62 equivalent.
Figure 6

Comparison of the two approaches

Table 3 shows the differences of the two approaches.

Hash + collision resolutionBase 62 conversion
Fixed short URL length.Short URL length is not fixed. It goes up with the ID.
Does not need a unique ID generator.This option depends on a unique ID generator.
Collision is possible and needs to be resolved.Collision is not possible because ID is unique.
It’s not possible to figure out the next available short URL because it doesn’t depend on ID.It is easy to figure out what is the next available short URL if ID increments by 1 for a new entry. This can be a security concern.

Table 3

URL shortening deep dive

As one of the core pieces of the system, we want the URL shortening flow to be logically simple and functional. Base 62 conversion is used in our design. We build the following diagram (Figure 7) to demonstrate the flow.

Image represents a flowchart depicting the process of URL shortening.  It begins with a rectangular box labeled '1. input: longURL,' representing the input of a long URL. An arrow connects this to a diamond-shaped decision box, '2. longURL in DB?', which checks if the long URL already exists in the database (DB).  If 'yes,' an arrow points to a rectangular box, '3. return shortURL,' indicating that the corresponding short URL is returned. If 'no,' an arrow leads to a rectangular box, '4. Generate a ne...', suggesting the generation of a new unique ID. This is followed by a box, '5. Convert ID to...', implying the conversion of this ID into a short URL. Finally, an arrow connects to '6. Save ID, shor...', indicating the saving of both the ID and the short URL into the database.  The entire flow is sequential, with each step dependent on the previous one, culminating in either returning an existing short URL or generating and saving a new one.
Figure 7
  1. longURL is the input.

  2. The system checks if the longURL is in the database.

  3. If it is, it means the longURL was converted to shortURL before. In this case, fetch the shortURL from the database and return it to the client.

  4. If not, the longURL is new. A new unique ID (primary key) Is generated by the unique ID generator.

  5. Convert the ID to shortURL with base 62 conversion.

  6. Create a new database row with the ID, shortURL, and longURL.

To make the flow easier to understand, let us look at a concrete example.

idshortURLlongURL
2009215674938zn9edcuhttps://en.wikipedia.org/wiki/Systems_design

Table 4

The distributed unique ID generator is worth mentioning. Its primary function is to generate globally unique IDs, which are used for creating shortURLs. In a highly distributed environment, implementing a unique ID generator is challenging. Luckily, we have already discussed a few solutions in the “Design A Unique ID Generator in Distributed Systems” chapter. You can refer back to it to refresh your memory.

URL redirecting deep dive

Figure 8 shows the detailed design of the URL redirecting. As there are more reads than writes, <shortURL, longURL> mapping is stored in a cache to improve performance.

Image represents a simplified system architecture diagram illustrating a user's request for a shortened URL and its resolution.  A user (represented by icons of a laptop and a mobile phone) initiates a GET request (1) to the shortened URL 'https://tinyurl.com/zn9edcu'. This request is directed to a load balancer (2), which distributes the traffic across multiple web servers. The web servers then return the long URL 'https://en.wikipedia.org/wiki/Systems_design' (5) to the user.  The web servers are connected to a cache (3) for faster access to frequently requested data and a database (4) for persistent storage of information.  The numbered arrows (1-5) indicate the flow of the request and response, showing the interaction between the user, load balancer, web servers, cache, and database.
Figure 8

The flow of URL redirecting is summarized as follows:

  1. A user clicks a short URL link: https://tinyurl.com/zn9edcu

  2. The load balancer forwards the request to web servers.

  3. If a shortURL is already in the cache, return the longURL directly.

  4. If a shortURL is not in the cache, fetch the longURL from the database. If it is not in the database, it is likely a user entered an invalid shortURL.

  5. The longURL is returned to the user.

Step 4 - Wrap up

In this chapter, we talked about the API design, data model, hash function, URL shortening, and URL redirecting.

If there is extra time at the end of the interview, here are a few additional talking points.

Congratulations on getting this far! Now give yourself a pat on the back. Good job!

Reference materials

[1] A RESTful Tutorial: https://www.restapitutorial.com/index.html

[2] Bloom filter: https://en.wikipedia.org/wiki/Bloom_filter