Managing Recurring Problems In Your Startup

Some examples of recurring problems in startups with self-debugging questions to help point you to root cause corrective actions.

Managing Recurring Problems In Your Startup

A recurring problem in a startup can take several forms.

You see a problem and implement a solution, only to have it return in a few weeks to a few months because you did not correctly diagnose the root causes.
You experience several different aspects of the same root problem, believing they are unconnected you treat them as a set of unrelated smaller problems instead of aspects of a larger problem that you need to address.
You have the capability to solve a problem but you choose not to for reasons that don’t appear to be rational.

I define a recurring problem in a startup as one that you believed you have solved but returns a few weeks to a few months later in essentially the same form. These kinds of problems often have both a perceptual and a psychological aspect to them that can make them harder to identify, diagnose, and solve–or at least manage. Often there are one or more key assumptions or at least one critical aspect of how the problem is framed that make it very hard to address root causes.

Example: This is not part of my job

Howard Dernehl told me this story from a startup he had worked at early in his career. About two dozen people in the office were all sharing a small kitchen that kept filling up with dirty cups, glasses, and other dishware. A janitor service was included in the rent, but they were tasked with keeping the work area and bathrooms clean, not the kitchen sink. One morning my friend arrived to find a handmade sign posted next to the sink:

Your Mom Does Not Work Here.

If I have time to clean up after myself–you do as well

Jacob

Jacob was the CEO.

Everyone was working long hours and some may have felt they were too busy to clean. It’s inevitable that once you see one or two dirty cups in the sink adding yours does not appear to contribute meaningfully to the mess: as Stanislaw Jerzy Lec observed, “No snowflake in an avalanche ever feels responsible.”

A couple of questions to ask yourself as a founder or manager:

Am I setting an example that contributes to this recurring problem?
Can I set a better example for behavior that will minimize or eliminate the problem?
Can I catch others “doing a good job” and praise them for behavior, practices, or results that will improve or prevent the problem?

Example: I Thought We Fixed That

Since its founding, Cisco would ship software updates either in PROM or allowed to be downloaded from an FTP site. The router motherboard had sockets that allowed existing PROMs to be replaced without unsoldering. In 1992 we were preparing to launch a new line of mid-range routers that used FLASH memory soldered to the motherboard. There were daily builds of the latest software, and as we approached code freeze, some of the images started to fail–and fail for a new and troubling reason. They would not fit in the PROM memory capacity. This led to a fire drill where engineering settled on a compression scheme and a bootstrap image that would save about 40% of the space. Problem solved!

Nine months later, the compressed image would not fit. And we realized that no one had been keeping track of memory usage and the amount of slack we had left (it varied by image and chassis) or plotting when it would run out. We had naively calculated that the prior scheme had lasted six years and figured we had plenty of runway. But code growth is a function of the number of software engineers on staff, and we had something like 200 at that point, compared to perhaps a dozen in the late 80s.

More fixes, and then a few months later, customers who were very concerned about uptime (e.g., large telephone companies) noticed that routers were crashing and rebooting every few days or even more frequently. The combination of larger images and growing routing tables (also stored in memory) meant that main memory would fill up as routes were added, and the machine would crash and reboot.

We had always been concerned with “feeds and speeds” and processor speed and backplane bandwidth necessary to support them. But we were not monitoring memory usage. We had added floppy-based distribution to complement PROM and FTP and soon had a fire drill to adopt a multi-floppy compression scheme when the compressed image would not fit on a single floppy. Memory usage was a persistent blind spot because no one was responsible. It was a shared resource that no one owned or managed.

Some questions to ask:

What are our “commons?” What resources do multiple teams rely on but no one manages?
When you solve a problem, do you project the likely lifetime for your solution?
Who owns not just the problem but the category or area it occurred in? Are you measuring your slack or headroom?

Example: Margin for Error is Room to Maneuver

No one was happy, our next-generation system relied on a new backplane, and the connectors were wired incorrectly (the signals on the board were not connected to the correct pins on the connector in the interface specification). Also, some connectors did not fit because the footprint drilled into the board (and in the CAD system) did not match the physical part. As a result, we could not do system-level testing because everything plugged into the backplane. I worked between Christmas and New Year’s Day along with the CAD Librarian and a PCB Layout designer to prevent further schedule slips. To say that no one was happy would be an understatement.

We had several after-action reviews. One was just within the CAD group, where I said, “I may be part of the problem here, so I am going to step out of the room for an hour, and I want Chuck to run the meeting and come up with a list of what we are going to do to prevent this.” Chuck was a senior engineer and well-respected within the group. When I returned, the whiteboard listed ten items we needed to change.

A key one was that we relied on manual verification of footprints against data sheets from the manufacturer. Connector pin counts now numbered in the hundreds and often had irregular numbering and spacing schemes. The librarian said, “I am very careful but between errors in the data sheet, errors that slip through checking, and mistakes on the schematic we need a test vehicle before a new connector or other part with a complex pinout or footprint gets on the critical path.”

I said–perhaps a little too automatically, “But a test board will be a significant extra expense!”

At this, several people in the group pointed out that our current fire drill had been very expensive in both additional spins and time to market. And I realized I had been an idiot (or that my team had helped me reach a creative solution in the Russell Ackoff’s definition: “Creativity is the ability to identify self-imposed constraints, remove them, and explore the consequences of their removal”).

We could use a very simple board to test connectors and other complex or high pin count parts weeks or even months in advance so that errors were flushed out early and we stayed on schedule. I had to mentally reframe “we need to be more careful” into ” we need more margin for error as the complexity of the tasks continues to grow.”

Some questions to ask:

Where are you relying on manual verification that you could complement with automated checking?
Where are you seeing intermittent errors that additional focus on the current process does not seem to eliminate? Is it time to redesign the process?
Take an inventory of your current procedures; what assumptions did you make crafting them are no longer valid?

Example: But Those First Two Sales Were So Easy

Unexpected success and near misses can blind us to the actual process we need to build.

A small startup I advised had a chat box on their website that prospects could use to contact them. One evening a detailed technical question came in; one of the founders was suddenly in an in-depth conversation about some new capabilities they had announced. He could answer the questions, and the prospect asked for a trial license to verify some of the answers. Trials were part of our standard process, so this was no problem. We got a three-license order worth tens of thousands of dollars a week later. It took us a year to sell another license that included the new capability. The original buyer had been a VP of Engineering at a venture-backed company on a tight timetable to roll out their own new capabilities, and we had just what he needed.

We would have other detailed technical chats, but none advanced to a sale. We invested in more and more detailed technical content, not realizing that we needed an executive-level presentation that stressed the business impact of our approach. With the VP of Engineering, we had lucked into someone who could easily translate technical features into business benefits, but it was not a good idea to leave that as an exercise for the reader.

Some questions to ask yourself:

Do you find it difficult to duplicate an early success that seemed so easy?
Do you understand why you succeeded?
Have you mapped what challenges are getting in the way of a repeat?
Do you find yourself wishing for “smarter prospects?” If so, your pitch may be inadequate.

Example: Is It a Harbinger Or an Outlier?

A customer field service technician went to replace a card in a system and received a severe burn on his forearm when it brushed against the power supply. The supply had gotten so hot that the housing was starting to buckle. My friend, a Director of Hardware, was informed of this by a newly hired VP of Engineering, who asked him to investigate. It was a young company that had seen rapid growth, and the VP had ambitions to be the next president, something he had shared in his welcome-aboard lunch with his new staff.

My friend immediately investigated and discovered that this was a second source power supply installed in about half of recent shipments. He made a few calls and could not find another serious failure, but he still felt they needed to take action. So he went back to the VP and suggested they send a notice to customers warning them of the risk and a plan for a field upgrade.

The VP disagreed, “I am just getting started here and this is what I will become known for, a field failure in new equipment.”

My friend explained his fundamental fear: “customers will forgive us a failure like this if we tell them about it, but if we hide this and someone else is injured and they find out we knew about it then there will be hell to pay.”

The VP said, “Fine, you handle it, it’s in your area of responsibility.”

Ultimately this led to several changes in testing procedures, the creation of the first field recall notice at the company and related internal methods for effecting it, and upgrades to the part and component tracking for better traceability of what components were installed where.

I count this as a recurring problem prevented. Here are some questions to ask yourself.
Are you counting on a problem being an outlier when it’s more appropriate to treat it as a potential harbinger?
Do you keep track of “near misses” or other situations where in hindsight, you realize how lucky you were?
Are you willing to invest effort and money to prevent potential problems, or do they need to happen at least once? Twice?
What are you learning from the problems and mistakes of competitors and other firms like yours?
Can you let go of the future you have planned to address the realities of your current situation?

Other Examples: Mistakes to Watch For

Ignoring a small loss or deviation from plan that grows month over month: if a trend emerges, it’s no longer a random variation and must be addressed. Your plans need to take it into account, and mitigation strategies must be started.
Valuing sunk costs: relying on a familiar process and capabilities may blind you to the need for new tools and methods.
Seeing growth shrink your margin for error, making your operation more brittle and more at risk for “bad luck.”
Realizing that you are losing touch with customers, partners, and internal practices. Founders need to delegate but need to solicit insights from others actively.
Letting personal conflicts blind you to feedback and complaints from key employees, customers, and partners.

PostScript: Many of These Read Like Big Company Examples.
Do They Really Apply to Startups?

Jeff Ballut left the following comment which I wanted to promote and respond to.

As I read this, I felt confused… I feel that many of these examples are from companies which are far beyond a “startup” level. “Who owns this process”, “VP of Engineering”, etc. In my definition of a startup, the owner wears all those hats, and never has time for reflection. BUT, it did make me think. Maybe my bias and perceptions are blinding me to the real lessons in this story. So I will reflect on this during sleep mode, and hopefully my subconscious will identify ideas I can recall when needed.
Jeff Ballut

All of the examples represent challenges that startups can face in their first few years and can manifest in firms with a handful of team members. “Who owns this process” is “enterprise speak” for two team members, each saying, “I thought this was my responsibility!” or “Who is responsible for this?” Your definition of a startup is limited to one run by a solo entrepreneur if a startup is “one owner who wears all of the hats.” My focus is on teams of two to twenty or so. One of the terrible realizations I have spoken about–see “What is Your Post-Launch Growth Plan” is that when a small team of bright generalists succeeds, they create the need for structure and specialization.

At a company level, what’s the challenge after launch? The challenge is that you have to build an effort that you can sustain and scale, which means you have to develop things like process and metrics and dashboards.

So…process, metrics, and dashboards…most of you probably have left big companies where you had process dashboards and metrics, and you are thinking “I don’t want any more of that.”

Just as little process as possible is a good thing, but you can’t get away from it completely.

After launch, you grow from a small team of generalists to a larger team of primarily specialists, because with that comes division of labor and the ability to scale. You can’t just keep hiring generalists, utility infielders who are comfortable in the white space on the org chart–an org chart that used to be pretty much entirely white space now has a lot of boxes on it. You’ve got to hire people that are outstanding in that one particular box, in part because they can rely on the folks in the boxes around them to be outstanding as well.

“What is Your Post-Launch Growth Plan“

Many recurring problems stem from a desire to use creative improvisation that consumes a lot of talent and energy and yields inconsistent results. A standard method that builds on your prior improvisations but codifies them allows you to deliver higher quality results for common problems–recurring problems–and focus your creative improvisation where it will yield higher impact.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Managing Recurring Problems In Your Startup