There are many blogs and videos online that show you the answers to LeetCode problems. However, the perspective is mainly from the interviewee, rather than the interviewer.

In my 25 years in Big Tech, I have conducted over a thousand interviews (thousands at Amazon as a bar raiser, hundreds at Microsoft, and a hundred less at Google).

Big Tech has three types of interviews for software engineers: [1] coding, [2] system design, [3] leadership, soft skills, culture, etc. The loop tends to blend them together. For example, an SDE-I might have 3 coding, 1 design, and 1 leadership, while a principal engineer might have 1 coding, 2 design, and 3 leadership. Personally, I prefer more open leadership interviews because I feel I learn more about the candidate this way. However, coding interviews are a necessary evil.

Now, a disclaimer: All coding interview questions are terrible, including mine. You have 45 minutes to propose a rental suggestion that could change someone's life. People are never: this is stressful, they are exhausted, and there is an artificial time constraint. These questions are designed to fit the time frame and elicit specific signals. I tend to prefer simpler questions that lead to high-quality conversations where I can understand someone's thought process. The conversation is much more important to me than the actual code my candidate writes on the whiteboard. What trade-offs exist? How do you balance them? As an interviewer, I know how to help my candidates have a good experience, when to guide, when to probe, and how to calibrate my expectations in an artificial environment.

To conduct coding interviews, I often present variations of the following question, which is loosely based on things I have to do in real life. Since I decided to retire this question, I will break it down today. What am I looking for? What do I see the candidate doing?

Assume we have a website, and we track the pages customers are viewing, such as business metrics.

Whenever someone visits the site, we write a record to a log file consisting of a timestamp, PageID, and CustomerID. Ultimately, we have a large log file with many entries in that format. Every day we have a new file.

Now, given two log files (the log file from day 1 and the log file from day 2), we want to generate a list of "loyal customers" who meet the following criteria: (a) they visited at least two unique pages on both days.

This question is not particularly difficult, but it does require some thought and an understanding of code complexity and data structures. You can easily get customers from both days, and you can easily get customers who viewed at least two unique pages, but getting the intersection of these two effectively requires more work.

I may have asked this question 500 times, which allows me to calibrate well. I find that the rental decision for this question aligns with the final rental decision for the candidate in the ultimate loop. I can also elevate it to higher-level candidates, which gives it broad applicability, and I can provide many hints for candidates who are struggling along the way.

Ask Clarifying Questions

Great candidates must ask clarifying questions before jumping into coding. I want to see my candidates have some intuition because I actually express the question in a ambiguous way.

What I mean is two unique pages per day or overall two unique pages?

This significantly impacts the solution. I mean “overall two pages” because that’s a more interesting question. About half of the candidates jump straight into coding without clarifying this, and half incorrectly assume I mean “two unique pages per day.” If it’s a more junior candidate, I will give a lot of hints before they start coding. If it’s a more senior candidate, I will wait a bit to see if they will think more deeply about the algorithm.

In real life, engineers are always dealing with ambiguity, and most software problems stem from poorly defined requirements. Therefore, I want to get signals about discovering and handling ambiguity.

Another clarifying question that 90% of people miss but is also important is: What about duplicates? Visiting the same page on the same day? Visiting the same page on different days? For this question, those are duplicates.

Another important question is What scale are we talking about here? Does the data fit in memory? Can I load the contents of one of these files into memory? Can I load both?

This question is inspired by a real-world system I worked on at Amazon (called ClickStream) that was responsible for tracking user behavior on Amazon.com. We handled events from millions of concurrent customers, and it was a massive MapReduce cluster of 10,000 hosts, with an entire engineering team maintaining and operating it. However, for the purpose of a 45-minute interview, please imagine the data fits in memory, and the scope is much smaller.

Finally, an important question is, is it important that performance versus memory versus maintainability matters? There’s a naive solution in runtime, but only O(1) in memory. There’s a better solution with a runtime of O(n) but O(n) in memory. And there’s a middle solution that does some preprocessing in O(n log N) using O(n log n) memory, then the main algorithm is O(n) using O(1) memory. Each has its pros and cons. Some heuristics you might use to improve performance or memory usage may make the code harder to read and evolve later. With more senior candidates, I hope to chat about these.

Start with the Naive Solution

Theoretically, you could write a very simple SQL query, and certainly, large tech companies have massive data warehouses where you could easily do this. However, for the scope of coding interviews, you don’t want to do that. Since this is not a distributed systems problem and the data fits in memory, why settle for the complexities and dependencies of a database when you can solve it with 20 lines of simple code?

Starting point:

About 80% of candidates initially choose the naive solution. This is “for each element in file 1, loop through everything in file 2, looking for elements with the same customer ID and tracking the pages they viewed.”

The problem with the naive solution is that its runtime is O(n²).

I don’t mind getting the naive solution first, but I really want to see my candidates have that O(n²) AHA moment that may never be good. I hope that moment comes quickly without hints. Unless there are memory or other constraints that cannot be dynamic, do not settle for an O(n²) algorithm.

After the candidate proposes O(n²), I smile politely and wait. I really hope the next word out of their mouth is “...but this is O(n² complexity, so can I do better?” If not, I will probe, “What is the runtime of that solution?” In most cases, the candidate will have an AHA moment after the hint and continue to a better solution.

Tuning O(n²) to O(n)

At this point, you need to think about what data structures to use and how to store the data.

Poor candidates go for linked lists or arrays. These have O(n) searches, so no matter how hard you try, you will revert to an O(n²) algorithm. You can use trees, but since the search is O(log n), it can yield an overall best of O(n log n).

Better candidates intuitively know that a map will provide O(1) lookups, and they need to convert the O(n²) algorithm to an O(n) algorithm. Great candidates will proactively point out that the downside is that you will use O(n) memory. If you want to take a more nuanced approach, you can even point out the hidden cost of running the hash function repeatedly on the map, which can be expensive. Overall, faster runtime at the cost of more memory is a good trade-off.

If you use a map, what is your key? What is your value? I see a variety of answers here. Some candidates will use pageID as the key and customerId as the value, but that won’t help. Then the candidate switches to using CustomerID as the key and PageID as the value of the map. But that’s not particularly good either, as it ignores the fact that you can have many pages per customer, not just one. Some candidates intuitively feel they need a set of pages as the value of the map, but they will choose a list, which makes my soul sad because it ignores the fact that you can have duplicates. This is a great opportunity to explore the knowledge of data structures regarding lists versus sets as candidates consider handling duplicates.

So, map> would work. But, will you load the contents of both files into one map? Or have two maps, one for each file? Or can you just load the contents of file 1 into the map and process file 2 without storing it?

Great candidates realize they can immediately choose option 3. That’s much less and a simpler algorithm. Good candidates get there but need a little hint.

The condition of “customers who came on both days” is very simple: when you read customer entries from day 2, if the customer was in the map from day 1, then you know they came both days:

The condition of “customers who visited at least 2 unique pages” tends to be harder for candidates to get right, so if they get stuck, I throw out a little hint: You have a set of pages from day 1 to day 2... how do you determine that’s at least two unique pages?

Poor candidates will loop through the elements in the set to check if the pages from day 2 are there. This again turns your O(n) algorithm into O(n²). The number of candidates who do this is surprising. Better candidates will use .contains() (1) on the hash set. But the logic is still there.

The correct intuition is: if you’re inside, if the customer visited at least two pages on day 1, and they visited any page on day 2, regardless of which page they visited on day 2, they are loyal. Otherwise, they only visited one page on day 1, so the question is: is this another page? If so, they are loyal; otherwise, it’s a duplicate, so you don’t know and should move on.

You need to pay attention to details, such as using “>” instead of “>=” or missing “!” in the second clause. I often see these. I don’t worry about it. Great candidates check their work carefully when completing the algorithm, and they quickly catch them. Good candidates find them after a little hint. This gives me good debugging skills signals.

Optimizing the Solution

If candidates get to this point and we have some extra time, then it’s nice to explore how to optimize speed or memory usage. Bonus points for discussions about the trade-offs between these and future maintainability.

For example, if memory is a constraint, then you don’t need to keep every page from day 1 in the map, just two, because the question is “at least two pages,” so a set size of 2 or even an array size of 2 will use less memory than an unbounded set.

Or, if you’ve already determined a customer is loyal, then the next time you encounter that customer on day 2, you don’t need to waste CPU cycles.

One word about optimization: You could argue that if future requirements change, these optimizations make changing the algorithm more difficult. That’s a reasonable position. As long as you can have a good conversation about the pros and cons, I’m satisfied with a high-quality discussion on how to balance these decisions. At the same time, being able to optimize an algorithm when needed *is a hallmark of a great engineer, and you *will need *that a time or two in your career.

Another Solution

The vast majority of candidates go for the map approach, but sometimes I have a candidate go for another candidate. Maybe at most 5%.

What if you preprocess the files and sort them by CustomerID, then sort by PageID?

Preprocessing is a powerful tool in your software engineering arsenal, especially if you’re going to perform a series of operations. You can hit with the first preprocessing or do it ahead of time, which amortizes the cost over time. Sorting the files can be an O(n log n) operation with constant memory.

If the files are sorted, the problem becomes easier and is just a two-pointer algorithm that you can perform in O(N) with O(1) memory. You move the pointers of file 1 and file 2 to the same customerID, which means they were both there for two days. Since the second key is pageID, you can use another two-pointer algorithm to determine if there are at least two unique pages. So, this is a two-pointer algorithm within a two-pointer algorithm. This is an interesting problem! I will leave the actual implementation as an exercise for the audience.

In Summary

I hope this small insight into coding questions from the perspective of an interviewer, and how I see good, great, and poor candidates, is useful to you. Good luck in your next interview!

Users who liked