"OMG! A heisenbug!" - Explaining to a layman what a heisenbug is

Asked

Viewed 727 times

36

A Heisenbug is a bug that changes its behavior while being studied [1]. It has its name derived due to the principle that Heisenberg detected that simple "passive observation"* of quantum processes alter the final result.

Typical heisenbugs happen with racing conditions, because any kind of measurement you make (like trace debugging or break points) eventually synchronising concurrent processes in one way or another. The @André LFS Bacci indicated that using floating point for monetary purposes may cause heisenbugs [2].

*: In quantum physics there are no purely passive observations, but these are details


So imagine that I’m in a complicated situation trying to solve a problem in the system and I can even reproduce the bug, but when I try to see more things about the bug by putting a break point strategic, this bug ceases to happen. At this moment, I come across that situation:

"OMG! A heisenbug!"

The PHB Ask me what’s going on. Support is with the customer on the line. I need to give an answer about the progress of the study of this problem, I need to ask for more time to try to remedy because it is not a simple bug, but a heisenbug!

How can I explain to the boss and to the support about heisenbug? They are not the deepest connoisseurs of programming, they think just for a if (stuff_will_bug()) { dont_do_stuff(); } else { do_stuff(); } magically solves the problem.

*: Preferably explanations that don’t result in my dismissal

  • 7

    Dude, now I stop calling these bugs a magic bug. I really enjoyed :D

  • @Thiagotiede, has a whole classification of "quantum bugs". In fact what they suffer most where I work is with mandelbugs/fractal bugs, which to fix a bug you have to fix 3 bugs (and so recursively up to infinity). Worth a read on the references I put in the question

  • I was interested on the subject. I will read more. Thanks for the tip.

  • Downvoter, could you help me improve this issue?

  • 8

    I’m not the downvoter, but I think the problem with your question is that what you’re asking is "How can I explain to the boss [PHB] and to the heisenbug support?" - That from there tends to be a more opinionated than objective question. Also, there is a law of the universe (I just invented it) that says: "Anyone who Aventure trying to explain complex concepts to the eyes and ears of idiots will only have pain, misery and suffering as a result."

  • @Victorstafusa I would understand this question of the possibility of resulting in an opinionated/subjective response if was accompanied by a vote of closure for this reason. Btw, I loved this universal maxim.

  • Oxi, my next Github label will be "locust from hell" when asked what is put the "heisenbug" link. Now seriously. How to explain something there is someone who clearly has no ability to understand? I believe that the approach should be to take to the top (even if it is an "ignorant") the issue to set standard goals for analyzing and discussing the (standard) approach when dealing with the client. In a work team monocratic definition is not the best way to treat the topic.

  • @Lauromoraes the focus should not be that it is something beyond the technical ability of other people, but rather that they are laymen, who would have difficulty in a purely technical explanation, without "softening" the content before

  • 9

    @Victorstafusa, "Anyone who Aventure trying to explain complex concepts to the eyes and ears of idiots will only have pain, misery and suffering as a result" that would be a good answer if it did not generate conflict with a question statement: " Preferably explanations that don’t result in my dismissal" :-D

  • The problem seems to be parallelism and asynchronous processing, right? So you already know what’s going on, you just don’t know how to solve the problem. What’s the real problem? Competition from lawsuits? Resource allocation? Inconsistent status change? I’m asking you this because as well as the @RBZ response I believe analogy is the best option to make the problem understood, I just think the examples of analogy.

  • 1

    Downvoter, how can I improve the question? You didn’t schedule her for closing, so I understand you understood my question and considered her within the scope, so I interpreted your vote as poor quality. I am willing to try to improve my text

  • @Fernandoleal, not all heisenbug comes from parallelism. I took one today that was related to a coincidence of values, and depending on my action in investigating, the bug stopped occurring. And, by the characteristic of being a heisenbug, I don’t know what is happening; I can’t be sure because every observation made changes the state of the system to the point of not reproducing that specific bug

  • Downvoter, could you explain why I voted no? Since you have not scheduled to close the issue, I believe you have thought that she is in some trouble, but still welcome on the site

Show 8 more comments

2 answers

26

From what I understand, you need a analogy, thus facilitate explain something complex, to a layman in programming (his boss).

Then I’ll try to explain in my own words your problem, if my boss was a layman.

Analogy 1 - Broken Bus

Fact

I own a public transport company.

Problem

I have a bus that always breaks the shock.

Debug

The bus driver always travels exactly the same route.

With that, I’ll go along with him, and so see what’s going on.

But "incredibly" the times I went along, the bus didn’t break! Why?

Solution

After suspicions, the problem was followed in a not so invasive way, putting a "spy" disguised as passenger, and bingo!

There’s a huge hole in one of the streets along the way, and the driver doesn’t even turn away from this hole, except that it’s being observed.

In short

Just the fact that I’m being observed, resulted in temporary correction of the problem.


Analogy 2 - Where’s my steak?

Fact

I work at XYZ and always I take lunch with lunch, where I leave in the community refrigerator that has in service.

Problem

How I make the second lunch break, every day is a "missing" steak of my lunch box.

Debug

Due to the problem, I started to walk past the cafeteria at the first lunch hour. And so, my steak stopped "disappearing". Why ?

Solution

There was an employee, who went to get his own lunch box, opened other lunchboxes and picked the steaks (outside of his correct "instruction"). That was discovered, observing and analyzing all the "resources" (employees), until arriving at the resource causing the problem, where the same, visually was not directly connected to the problem, but with a factor of "intervention" (eu) modified the final result.

In this case we have 3 solutions:

  • Send the employee away (eliminate the appeal/process)
  • Put cameras (create complementary feature that filters failure)
  • Apply warning (correction, with the possibility of temporary solution)

In short

Idem a Analogia 1, with an example of 3 possible more concrete solutions.


Analogy 3 - In and out of the Beetle (true fact)

Fact

I have a 1947 Beetle!

Problem

Sometimes when I’m riding in my Beetle, it suddenly stops working. I pull over and try to start over and over again, but in no way works.

Debug

In the old days, you’d hear that when the Beetle gave you that problem,, you were to leave the vehicle for 2 minutes, enter again, start, and it would start.

Indeed, done and proven!

Solution

After better understanding the mechanical part (structural), it was discovered that the "coil" heated up and caused the electrical part to stop.

The fact of entering and exiting the vehicle and waiting 2 minutes, made it cool enough to start again.

In short

One more example, of a third key factor that was not imaginable and remarkable, and when we tried a solution out of logic but that worked palliatively, we ended up "by addicting" our reasoning to an incorrect "starting point".

This is an example of something that when we "turn it on" we have the problem, and when we "turn it off" when we "stop" to see, it is difficult to find the real problem. It is as if a "visible" problem depended on another problem to happen.

  • I just disagreed with the solutions. After all, if the solutions were so easy, the bug wouldn’t be such a mysterious bug, does it? Suppose the "solution" of the bug is to have the state of the variable printed in the standard output. But that would generate a certain amount of synchronization that would eventually make a naturally parallel process sequential. This could mean extra hours of processing on the server. It would be like the solution of dismissing the employee of case 2, after all stealing lunch box does not give just cause. Not to mention that he identified himself as that, but not necessarily if he saw who he was.

  • I understood what I meant by the impression. But because the question is the "need to explain to the layman", only analogies will exemplify. Exactly as I said, theft of lunch box does not give just cause, so they had 2 options. The issue in the analogy is that the problem (bug) changes when observed (Heisenbug), and not the form of correction. But I will think of a third analogy, which exemplifies the next factors affecting the result (in these cases, it would be the other people who saw or learned that the driver was careless or who was the steak thief) ;]

  • 3

    Okay, your Beetle example won me.

0

As it is an explanation for laypeople, we have to be as generic as possible and try to explain the problem adapting the communication according to the interlocutor.

This is a huge challenge in our field, as it is quite common for developers to have difficulty explaining what they are doing to their superiors or customers without going into too much technical detail. Imagine, then, a Heisenbug?

The laity in question (head and support) may be people with a more refined level of logic understanding. If they are not, the Rbz response would fit perfectly.

My idea here is to bring this response closer to the technical part, without exaggerating.

Explanation

As a more technical answer depends on the problem to be solved, I would go down the path of identifying what type of Heinsenbug we are dealing with and begin my explanation to the more laity from there.

I’ll take the example of Heisenbug that occurred to me more times: race condition. In an attempt to remedy the problem, you can explain as follows:

This error is occurring in a system flow in which it is awaiting information but, from time to time, this information enough.

I’m having trouble reproducing the bug because when I am analyzing step-by-step the execution of the code, in an attempt to check the cause of the bug, I end up creating the time needed to this information arrive and the bug does not occur. Thus, I end up having more difficulties to reproduce the problem and identify the correction adequate.

To get an idea of this time I’m talking about, the whole flow in the production environment runs in a few milliseconds, but when I’m analyzing the code this time takes several seconds.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.