Tag Archives: Opinions

Hypotheses About What Happened to Facebook

от Божидар Божанов
лиценз CC BY

Facebook was down. I’d recommend reading Cloudflare’s summary. Then I recommend reading Facebook’s own account on the incident. But let me expand on that. Facebook published announcements and withdrawals for certain BGP prefixes which lead to removing its DNS servers from “the map of the internet” – they told everyone “the part of our network where our DNS servers are doesn’t exist”. That was the result of a backbone self-inflicted failure due to a bug in the auditing tool that checks whether the commands executed aren’t doing harmful things.

Facebook owns a lot of IPs. According to RIPEstat they are part of 399 prefixes (147 of them are IPv4). The DNS servers are located in two of those 399. Facebook uses a.ns.facebook.com, b.ns.facebook.com, c.ns.facebook.com and d.ns.facebook.com, which get queries whenever someone wants to know the IPs of Facebook-owned domains. These four nameservers are served by the same Autonomous System from just two prefixes – 129.134.30.0/23 and 185.89.218.0/23. Of course “4 nameservers” is a logical construct, there are probably many actual servers behind that (using anycast).

I wrote a simple “script” to fetch all the withdrawals and announcements for all Facebook-owned prefixes (from the great API of RIPEstats). Facebook didn’t remove itself from the map entirely. As CloudFlare points out, it was just some prefixes that are affected. It can be just these two, or a few others as well, but it seems that just a handful were affected. If we sort the resulting CSV from the above script by withdrawals, we’ll notice that 129.134.30.0/23 and 185.89.218.0/23 are the pretty high up (alongside 185.89 and 123.134 with a /24, which are all included in the /23). Now that perfectly matches Facebook’s account that their nameservers automatically withdraw themselves if they fail to connect to other parts of the infrastructure. Everything may have also been down, but the logic for withdrawal is present only in the networks that have nameservers in them.

So first, let me make three general observations that are not as obvious and as universal as they may sound, but they are worth discussing:

  • Use longer DNS TTLs if possible – if Facebook had 6 hour TTL on its domains, we may have not figured out that their name servers are down. This is hard to ask for such a complex service that uses DNS for load-balancing and geographical distribution, but it’s worth considering. Also, if they killed their backbone and their entire infrastructure was down anyway, the DNS TTL would not have solved the issue.
  • We need improved caching logic for DNS. It can’t be just “present or not”; DNS caches may keep “last known good state” in case of SERVFAIL and fallback to that. All of those DNS resolvers that had to ask the authoritative nameserver “where can I find facebook.com” knew where to find facebook.com just a minute ago. Then they got a failure and suddenly they are wondering “oh, where could Facebook be?”. It’s not that simple, of course, but such cache improvement is worth considering. And again, if their entire infrastructure was down, this would not have helped.
  • Have a 100% test coverage on critical tools, such as the auditing tool that had a bug. 100% test coverage is rarely achievable in any project, but in such critical tools it’s a must.

The main explanation is the accidental outage. This is what Facebook engineers explain in the blogpost and other accounts, and that’s what seems to have happened. However, there are alternative hypotheses floating around, so let me briefly discuss all of the options.

  • Accidental outage due to misconfiguration – a very likely scenario. These things may happen to everyone and Facebook is known for it “break things” mentality, so it’s not unlikely that they just didn’t have the right safeguards in place and that someone ran a buggy update. The scenarios why and how that may have happened are many, and we can’t know from the outside (even after Facebook’s brief description). This remains the primary explanation, following my favorite Hanlon’s razor. A bug in the audit tool is absolutely realistic (btw, I’d love Facebook to publish their internal tools).
  • Cyber attack – It cannot be known by the data we have, but this would be a sophisticated attack that gained access to their BGP administration interface, which I would assume is properly protected. Not impossible, but a 6-hour outage of a social network is not something a sophisticated actor (e.g. a nation state) would invest resources in. We can’t rule it out, as this might be “just a drill” for something bigger to follow. If I were an attacker that wanted to take Facebook down, I’d try to kill their DNS servers, or indeed, “de-route” them. If we didn’t know that Facebook lets its DNS servers cut themselves from the network in case of failures, the fact that so few prefixes were updated might be in indicator of targeted attack, but this seems less and less likely.
  • Deliberate self-sabotage1.5 billion records are claimed to be leaked yesterday. At the same time, a Facebook whistleblower is testifying in the US congress. Both of these news are potentially damaging to Facebook reputation and shares. If they wanted to drown the news and the respective share price plunge in a technical story that few people understand but everyone is talking about (and then have their share price rebound, because technical issues happen to everyone), then that’s the way to do it – just as a malicious actor would do, but without all the hassle to gain access from outside – de-route the prefixes for the DNS servers and you have a “perfect” outage. These coincidences have lead people to assume such a plot, but from the observed outage and the explanation given by Facebook on why the DNS prefixes have been automatically withdrawn, this sounds unlikely.

Distinguishing between the three options is actually hard. You can mask a deliberate outage as an accident, a malicious actor can make it look like a deliberate self-sabotage. That’s why there are speculations. To me, however, by all of the data we have in RIPEStat and the various accounts by CloudFlare, Facebook and other experts, it seems that a chain of mistakes (operational and possibly design ones) lead to this.

The post Hypotheses About What Happened to Facebook appeared first on Bozho's tech blog.

Digital Transformation and Technological Utopianism

от Божидар Божанов
лиценз CC BY

Today I read a very interesting article about the prominence of Bulgarian hackers (in the black-hat sense) and virus authors in the 90s, linking that to the focus on technical education in the 80s, lead by the Bulgarian communist party in an effort to revive communism through technology.

Near the end of the article I was pleasantly surprised to read my name, as a political candidate who advocates for digital e-government and transformation of the public sector. The article then ended with something that I’m in deep disagreement with, but that has merit, and is worth discussing (and you can replace “Bulgaria” with probably any country there):

Of course, the belief that all the problems of a corrupt Bulgaria can be solved through the perfect tools is not that different to the Bulgarian Communist Party’s old dream that central planning through electronic brains would create communism. In both cases, the state is to be stripped back to a minimum

My first reaction was to deny ever claiming that the state would be stripped back to a minimum, as it will not (risking to enrage my libertarian readers), or to argue that I’ve never claimed there are “perfect tools” that can solve all problems, nor that digital transformation is the only way to solve those problems. But what I’ve said or written has little to do with the overall perception of techno-utopianism that IT people-turned-policy makers are usually struggling with.

So I decided to clearly state what e-government and digital transformation of the public sector is about.

First, it’s just catching up to the efficiency of the private sector. Sadly, there’s nothing visionary about wanting to digitize paper processes and provide services online. It’s something that’s been around for two decades in the private sector and the public sector just has to catch up, relying on all the expertise accumulated in those decades. Nothing grandiose or mind-boggling, just not being horribly inefficient.

When the world grows more complex, legislation and regulation grows more complex, the government gets more and more functions and more and more details to care about. There are more topics to have policy about (and many to take an informed decision to NOT have a policy about). All of that, today, can’t rely on pen-and-paper and a few proverbial smart and well-intentioned people. The government needs technology to catch up and do its job. It has had the luxury to not have competition and therefore it lagged behind. When there are no market forces to drive the digital transformation, what’s left is technocratic politicians. This efficiency has nothing to do with ideology, left or right. You can have “small government” and still have it inefficient and incapable of making sense of the world.

Second, technology is an enabler. Yes, it can help solve the problems with corruption, nepotism, lack of accountability. But as a tool, not as the solution itself. Take open data, for example (something I’ve been working on five years ago when Bulgaria jumped to the top of the EU open data index). Just having the data out there is an important effort, but by itself it doesn’t solve any problem. You need journalists, NGOs, citizens and a general understanding in society what transparency means. Same for accountability – it’s one thing to have every document digitized, every piece of data – published and every government official action leaving an audit trail; it’s a completely different story to have society act on those things – to have the institutions to investigate, to have the public pressure to turn that into political accountability.

Technology is also a threat – and that’s beyond the typical cybersecurity concerns. It poses the risk of dangerous institutions becoming too efficient; of excessive government surveillance; of entrenched interests carving their ways into the digital systems to perpetuate their corrupt agenda. I’m by no means ignoring those risks – they are real already. The Nazis, for example, were extremely efficient in finding the Jewish population in the Netherlands because the Dutch were very good at citizen registration. This doesn’t mean that you shouldn’t have an efficient citizen registration system. It means that it’s not good or bad per se.

And that gets us to the question of technological utopianism, of which I’m sometimes accused (though not directly in the quoted article). When you are an IT person, you have a technical hammer and everything may look like a binary nail. That’s why it’s very important to have a glimpse on humanities sides as well. Technology alone will not solve anything. And my blockchain skepticism is a hint in that direction – many blockchain enthusiasts are claiming that blockchain will solve many problems in many areas of life. It won’t. At least not just through clever cryptography and consensus algorithms. I once even wrote a sci-fi story about exactly the aforementioned communist dream of a centralized computer brain that solves all social issues while people are left to do what they want. And argued that no matter how perfect it is, it won’t work in a non-utopian human world. In other words, I’m rather critical of techno-utopianism as well.

The communist party, according to the author, saw technology as a tool by which the communist government would achieve its ideological goal.

My idea is quite different. First, technology necessary for “catching up” of the public sector, and second, I see technology as an enabler. What for – whether it’s for accountability or surveillance, fight with corruption or entrenching corruption even further – it’s our role as individuals, as society, and (in my case) as politicians, to formulate and advocate for. We have to embed our values, after democratic debate, into the digital tools (e.g. by making them privacy-preserving). But if we want to have good governance, and to be good at policy-making in the 21st century, we need digital tools, fully understanding their pitfalls and without putting them on a pedestal.

The post Digital Transformation and Technological Utopianism appeared first on Bozho's tech blog.

Every Serialization Framework Should Have Its Own Transient Annotation

от Божидар Божанов
лиценз CC BY

We’ve all used dozens of serialization frameworks – for JSON, XML, binary, and ORMs (which are effectively serialization frameworks for relational databases). And there’s always the moment when you need to exclude some field from an object – make it “transient”.

So far so good, but then comes the point where one object is used by several serialization frameworks within the same project/runtime. That’s not necessarily the case, but let me discuss the two alternatives first:

  • Use the same object for all serializations (JSON/XML for APIs, binary serialization for internal archiving, ORM/database) – preferred if there are only minor differences between the serialized/persisted fields. Using the same object saves a lot of tedious transferring between DTOs.
  • Use different DTOs for different serializations – that becomes a necessity when scenarios become more complex and using the same object becomes a patchwork of customizations and exceptions

Note that both strategies can exist within the same project – there are simple objects and complex objects, and you can only have a variety of DTOs for the latter. But let’s discuss the first option.

If each serialization framework has its own “transient” annotation, it’s easy to tweak the serialization of one or two fields. More importantly, it will have predictable behavior. If not, then you may be forced to have separate DTOs even for classes where one field differs in behavior across the serialization targets.

For example the other day I had the following surprise – we use Java binary serialization (ObjectOutputStream) for some internal buffering of large collections, and the objects are then indexed. In a completely separate part of the application, objects of the same class get indexed with additional properties that are irrelevant for the binary serialization and therefore marked with the Java transient modifier. It turns out, GSON respects the “transient” modifier and these fields are never indexed.

In conclusion, this post has two points. The first is – expect any behavior from serialization frameworks and have tests to verify different serialization scenarios. And the second is for framework designers – don’t reuse transient modifiers/annotations from the language itself or from other frameworks, it’s counterintuitive.

The post Every Serialization Framework Should Have Its Own Transient Annotation appeared first on Bozho's tech blog.