Whenever someone publishes an article like this, I want them to find out what its like when you reach the threshold of 100000 individual delete requests per second which ends up being five million actual deletes when you factor in all of the associated references to the item being deleted and its metadata. Then I want them to find out what happens when you have to propagate those deletes across geographically distributed data centers, clear a geographically distribute cache and do it in a way that minimizes user facing errors and guarantees consistency. Finally you have to also ensure that deletions don't effect any business facing applications since ad revenue and metrics are all generated from certain types of data.
These people imagine you just rm -rf a file and run a few SQL queries, you literally can't just delete data. I've worked at the scale of Instagram before. Literally nothing works like that at that scale, the person who wrote that article should have called up someone from Instagram and had them explain at a really high level how complicated deleting data is at large scale.
Hi. I also work at this scale. It doesn't take a year to delete the data. Be upfront with your users:
'Your delete request is being processed. No one will be able to see this item though it may take up to three days for the data to be completely removed from our servers'
This is not a technical challenge, it's an issue of priorities and Facebook doesn't prioritize privacy. They were able to release a TikTok clone in a few months, but can't solve deleting data?
It does if you have cold backups that take a year to cycle out. They're often offsite, compressed, and incrementally hashed so finding individual items and removing them is really hard -- you're much better off just waiting for them to expire, and a year isn't an unreasonable amount of time.
The article specifically states this data was downloaded using the "Download Your Information tool on Instagram". The undeleted data wasn't from a cold backup or any backup, it was still on production systems.
Doesn't that link show that the how backups related to GDPR is still up for debate? The accepted answer that says they are included has 3 upvotes and there is an answer that says no they aren't included that has 2 upvotes. Either way, the legal requirement is a separate issue from what Instagram is actually doing. If that download tool is automated, I am rather confident in saying that it isn't combing through year old backups to get data.
Ever wonder why "download my data" usually takes a few hours/days on most services before you receive the download link in your email? I promise you compiling the production data doesn't take that long.
Backups are covered under GDPR, although when a user requests erasure you can say "your data will be rotated out in X months/years". Not sure how this applies to access, but I assume it's similar: https://ico.org.uk/for-organisations/guide-to-data-protectio...
Whenever I see someone write a comment like this, I want him to understand, that don't say deleted, if it's not deleted.
Also, if it not works at scale, don't tell your users that it works and pretend it works as you communicated.
I find really bothering, when engineers trying to hide behind the "it's a very complicated process in the background, so we just say we have done something, when the truth is, it is in progress. It's not a user error, when you see the word "deleted" you assume, your picture is gone, it's your error, if you state it's gone, while it is not.
I agree with you completely, no one should be hiding behind "It's a really complicated process". They need to actually sit down and come up with a concrete plan on how to tackle the problem then communicate that we have a plan in place its been added to the roadmap and we are going to tackle it in Q3 20XX.
Developers are also falling into this trap regularly with distributed systems. MongoDB famously claimed to be "eventually consistent" but in reality failed to deliver on this promise. The MongoDB authors then tried to weasel their way out of it the same way you are describing and many developers rightfully called them out on it.
Bingo! Its the same as saying we take your username and password and let you login. Do you or do you not? Its a boolean feature that either does or doesn't do what you say it does. Deletion can be complicated and if it is provide that indication to your users. Software is about setting the correct expectations, otherwise we leave users to interpret written language which we all know leaves room for misinterpretation due to all sorts of complexities with the way humans communicate with one another.
This rationalization isn't acceptable to me. They have gladly set up the infrastructure to handle a high volume of new information and it should go both ways.
should
See RFC 2119, "should" usually denotes something as "optional"
But really, almost zero companies build software around data in such a way that the data could and will disappear at any moment.
Their comment wasn't an RFC. Right to be forgotten and GDPR are not suggestions.
Regardless, in practice, billion dollar companies have the resources to ensure 'delete' isn't actually "partially hide for years/forever". With proper surrogate IDs they could even preserve a minimal amount of meta such as tombstone IDs to ensure all first and third party systems are in compliance.
Woe is me. My poor little (looks at notes) trillon dollar company can't cope! Instragram seems to be able to cope with 100000 new photos per second - that they can do. Across geographies and data centres. Ever seen a user facing error due to a newly uploaded photo?
No one expects you to just rm -rf. They just expect you to expect the same effort to remove the file that they put in to create the file. It's not like the file got propogated to 10 different data centres by accident. There was a decision that an uploda needed to be everywhere in 5 seocnds and a deletion was an advisory hint that could be ignored for a year.
Google has an internal standard for how long it takes to wipe out user data after a deletion request, and compliance with it is taken very seriously. It's a lot less than 1 year.
I think this points to (one of) the significant flaws in the cult of the "minimal viable product". By eliding up-front design tasks, you may implement something in which complying with the law is impossible. Data retention compliance has to be designed in from the beginning, or it might be technically infeasible to add it later.
I think this is why you need a very senior engineer overseeing your MVP.
Before MVPs were championed, companies would create a full-fledged project management suite with 50 database models and release it to crickets.
An MBA and a junior engineer define an MVP as “You can log in, and add todo lists and check them off.”
A senior engineer defines that same MVP as “You can log in, add todo lists, check them off, delete your account, and if the server crashes or we’re hacked, we have backups for 30 days.”
Or in other words with an MVP you still need to complete the 80% of the iceberg the MBA and junior engineer can’t see, but you don’t build the whole iceberg at once.
...and I do not believe for one second that they have never discovered a copy of data lying around somewhere after they thought it was deleted. The choice is to admit it, or lie about it.
It has actually happened and they did come clean about it. Some data was discovered in log files that shouldn't have been there. They owed up and fixed it pronto even if the chances that the data was leaked outside of Google were nil. There is plenty wrong with Google, but that isn't one of those.
If you can insert... for sure you can delete? It doesn't take days for your picture to show up.
Sure, eventual consistency and replication and distributed storage is hard.
Alternative approach: encrypt at rest, delete encryption key. It would require extra resources for sure, but if privacy was a concern... ;-) (and then have a reasonable expiry on your CDN if you use one).
(of course then we can argue that deleting the encryption key may be difficult too, for the same reasons)
The cost of insert and delete is massively asymmetrical. This is an intentional and fundamental architectural decision in all of our data infrastructure and built into the data structures and algorithms that are used. Not only does “delete optimized” data infrastructure not exist, in many cases we don’t have good computer science for how you would even design such a thing while having good performance for all the other operations.
People that suggest using encryption and throwing away the key have not thought through the implications of that approach. It scales very poorly and therefore is not suitable for most practical systems. There are good reasons “obvious” solutions like this are not used.
It depends on the definition of “delete”. Removing a record from a data model scales just fine, and is how it has always been defined in database systems. Making that record physically unrecoverable from any hardware in the broader system is extraordinarily expensive for fundamental technical reasons. Traditionally “delete” has always meant the former in all systems, and physical deletion is deferred to a point in the future when either the an opportunity arises to do it inexpensively or the cost becomes acceptable. There is a value in recovering the resources consumed by the deleted record but it comes at a very high cost.
If you change the definition of “delete” to mean unrecoverable physical deletion, then sure, it scales poorly. But it is a bit like redefining “fast car” to “can travel faster than Mach 5” — technically a valid as a redefinition while completely ignoring the engineering realities of what a car can do.
You’re right. Maybe the technical definition means altering a record to say DELETE=TRUE, but the widely used definition means unrecoverable, and that’s what matters.
Actual deletion is what Facebook told us they did; that's what everyone assumed when they said they were complying with GDPR. This is like advertising a car as 'faster than Mach 5', and then when people call you out for lying, saying, "well, 'faster' is a relative term."
This kind of crap is exactly why people don't trust Facebook. It's not because people are paranoid, it's because Facebook systematically creates expectations in their advertising and public releases that they're acting responsibly, and then acts like they're the victim of unfortunate circumstances and misunderstanding whenever they get called out.
If three weeks ago a tech journalist had written an article saying, "Facebook will fully delete your data when you ask", no one at Facebook would have been reaching out to that journalist saying, "Oh, in the interest of preventing misunderstanding, we don't actually delete the data, we just mark it to be ignored." But now that they've been caught, now it's just a big misunderstanding by people who don't understand database architecture.
When companies tell the public that they're doing something, it is reasonable for the general public to assume that they're referring to the commonly understood definitions of the words they use.
I find it very interesting that given a chance to hover up a billion records we see no technical challenge at all but when asked to get rid of them suddenly you're demanding unobtanium.
If you can't reliably get rid of the data you administer, don't collect it in the first place.
Just because it's hard doesn't make it optional. If you operate at the scale of Instagram, yes, it becomes very hard, but the companies typically have both the revenues and expert staffing to do it.
And if they don't, they're mismanaged and need to allocate more resources to it, because it's illegal not to, for good reason.
You can't remove data that you don't control though e.g. Removing embarrassing pictures from the internet. However if you control the server you control the data. I can appreciate that it might not be as simple as a unix command but it is a single letter in CRUD. It's fundamental.
I don't think people understand the vastness of resources that FB and Instagram have at their disposal. This is not a particularly hard problem. Yes it's not a simple SQL query. But they do solve it -- in fact this was a bug that they have already fixed.
They've got 30 days to delete data, there's no excuse not to manage it within 30 days.
Also, you're saying they intentionally built a software stack without even thinking about how to obey the law (the 30-day deletion requirement existed even before the GDPR, since 1996 even, just with lower fines)
I hope in a few years this case will be taught in school just like the Therac case is being taught right now.
It's not that simple, some of these companies have tech stacks that are run-a-way trains written by people who had good intentions but wrote their code and built their infrastructure in a way that was meant for 1/50th of the traffic they currently have. I don't know anything about instagram's infrastructure but what I can tell you for certain is that every one of these companies has legacy infrastructure and code that keeps them from being able to meet all legal requirements. You can't just make a blanket statement that they don't have an excuse.
> I don't know anything about instagram's infrastructure but what I can tell you for certain is that every one of these companies has legacy infrastructure
So you don't know anything about Instagram but at the same time know for certain that they have legacy infrastructure that keeps them from meeting legal requirements? How about we don't give them a pass just because it requires a bit of work to comply with laws and regulations.
Maybe its just me but I have yet to work for a company that didnt have legacy infrastructure that was hold them back in some way. Even the startups I've worked for had to make odd business decisions in order to take on new customers. Those odd one-offs and business decisions eventually led to sprawling legacy infrastructure. If there is an infrastructure Utopia I have yet to work there or hear of it.
These companies are shipping new features every day. If they have the bandwidth to work on those projects then the failure to meet legal requirements is an issue of prioritization, leaving no excuse.
And this is the attitude that causes businesses to tell lies about everything they are doing. If you want transparency with regard to how you're data is being handled you need to ease up a bit and realize there are biological limitations to how fast people can make changes. Most of these companies have people who want to fix these problems. They just have hurdles you don't know about, some of them are internal politics and others are technical.
This. The amount of time quoted in the article in excessive but it's also sparse on details (maybe his data request also sent him backups related to him? a year is reasonable in this case).
Deletion is pretty much the hardest thing to do in data management, way harder than inserting or retrieving data. I know GDPR says you need to be able to delete a user's data, but how reasonable is it to trawl through your offsite backups to find their items and remove them?
> "The researcher reported an issue where someone’s deleted Instagram images and messages would be included in a copy of their information (...) We’ve fixed the issue"
This makes it sound like they consider "you could see it" the issue, not "we were still keeping it". In other words, the fix was to hide it, not to delete it.
If I were the Irish DPA (and actually wanted to do my job and had the resources to, instead of being intentionally lazy/crippled to attract tech firm headquarters), I'd definitely be asking for retention plans and evidence that the data is now being removed in a timely manner, and start issuing fines (small ones for past transgressions, big ones if they keep doing it or don't have a decent plan how to make sure to get rid of data they shouldn't be having).
For comparison: Deutsche Wohnen (large real estate) got slapped [1] with a 14.5 million EUR fine for over-retaining sensitive tenant data and not having an automated system to delete it.
Why small fines initially? I’d like to see privacy fines being used to make lots of money like traffic fines are used today. It’s a way to tax tech companies in your jurisdiction with the added benefit of improving privacy.
Because the goal is compliance, not to put companies out of business. When the laws were first enacted everybody was screaming that it was just to put companies out of business. Now they are wondering why the small initial fines.
It's simple: change your ways and use the initial fines as a wake up call. If you then do not wake up and persist the fines will get heavier and heavier until you will pay attention.
A Dutch hospital managed to get to the third round of fines and they weren't all that happy afterwards. 460K Euro fine for a single instance of ignoring the regulators on a single individual.
Believe me when I tell you they have understood now.
The initial fine was zero, just a warning to improve.
The case revolved around a very minor dutch celebrity whose data was reviewed by hospital employees that should not have had that access.
> Because the goal is compliance, not to put companies out of business.
There's middle ground between "we take 100% of your revenue" and "we take 0.001% of your revenue". Given that we're not this lenient with private citizens and small companies, why should we be with international corporations?
That's why the fines can be ramped up. The largest fines were a substantial fraction of the revenues for the companies they were addressed to and there is no practical limit once you take per violation figures into account. You ignore this at your peril.
There is still some unclarity as to whether or not multiple fines can be issued for different transgressions, there hasn't been such a case yet and nobody has gotten close to the limit so for now this is still grey. But I think that once fined at that level no sane CEO is going to risk getting a second such fine in the same year or even at all.
Imagine if we lived in a society where you were given a small fine for the first time you commit murder and then life imprisonment for the second offence. This may lead people to believe that murder is a serious but forgivable offence when it is not. That’s my first argument: small fines play down the seriousness of the “crime”.
My second argument is the efficiency of using capitalism to fight capitalism. Take money from companies who make mistakes with personal information. Be that out of ignorance, malice or bad luck. Why does it need to be fair and just - make money from it. Make it a risk to capture personal information in the first place. If those companies go out of business then so be it. Others will take their place. It’s not impossible to do business without storing personal information, there is just not enough incentive to bother.
(1) this isn't murder, so that's a false equivalent. Murder is in a different book of law than privacy laws.
(2) small fines do not play down the seriousness of the transgression (which is the word I think you should be using for instances like this). They merely indicate that you should clean up your act assuming no real harm has been done. In some cases the regulators have immediately resorted to fines, and quite large ones as well if they felt that the case warranted it. They do have that option.
But putting companies out of business was never the goal, contrary to what a lot of alarmist people were screaming when the law went into force. Also, over time as more and more companies have been fined I would expect that the initial fines will go up because claiming ignorance really isn't an option any more. Some comments in this thread are particularly worrisome in that light, it appears that some people still don't get it and they are in positions where they really should know better.
Turning data into a liability rather than an asset is the long term outcome. This will take time, and when it happens I'll be that much happier. Every company will have to seriously weigh the price of holding on to some datum vs the price of losing it.
You’ve made some good points there. I guess it’s a balancing act and if the regulators go too far too quickly then we may never get to that long term goal.
Exactly. I keep a close watch on the fines via the enforcement tracker. I think some of them are too strict, some too lenient but overall the picture is actually quite ok.
> Imagine if we lived in a society where you were given a small fine for the first time you commit (an offense) and then (a harsher punishment) for the second offence.
We do live in that society. For example, the first few time you get caught speeding, you get a warning or a ticket with a fine. Keep getting caught speeding, and you lose your license and/or go to jail.
What makes you think even deleting one's account will truly wipe data?
These companies profit from personal info whether one is a customer or not. I would imagine everything exists with a "user de-activated since $DATE" entry or similar in their database.
If they can't show the pictures that they have claimed to have deleted then they can't use it to bring in eyes for advertisers, which means that the data is a net loss for them, causing them to have to buy more storage sooner.
They should want to delete data as soon as possible
Hmm yes, they probably free up the bulk of space and just keep extensive metadata instead: "We don't have the picture, but we know you took one at date, time, location, make & model, shutter speed & aperture, and we regognized the faces of these 2 users X & Y, and another non-user for whom we have shadow profile Z wwas also tagged."
This sort of practice is not limited to just Instagram. Plenty of places that do soft deletes when they should be doing hard deletes. Data life-cycles are about the poorest understood subject in startup land. Ingestion is usually top notch, friction free and heavily automated. Deletion - assuming it even exists - is semi automatic or even manual, full of friction and usually incomplete or broken.
You see a similar pattern with respect to signups vs account cancellation.
The weird thing to me is that it is usually the marketing department and not the legal or the compliance department that has the upper hand in these data retention discussions. Fortunately thanks to the GDPR this is now changing and slowly companies are coming around on this.
I think just about everything should be a soft delete, however you need a time limit where you sweep those. Ideally you would even give the user an option to accelerate that (as much as technically possible) if they really want something gone.
That would be one way to implement a data life cycle that would probably meet with regulators approval, but it depends on lots of little details and the length of that 'time limit'.
I have seen people giving advice to retain trial, inactive and deleted accounts for a little while so they can analyze and train their ml models in startup groups.
Yes, I've seen this too. Too often to be happy about. On the plus side, more and more companies we look at really get it and do their utmost best to do it right. The GDPR has definitely woken people up.
Yes, especially backups and log files are hard. The databases are relatively easy if they allow for in place overwrites or compaction of tables. Even then you have to be careful.
Data is funny that way. It is easy to acquire, easy to lose if you want to keep it and devilishly hard to get rid of for real if that is what you want to do.
You have several options, none of them pretty. The first is to load, update and rewrite the backup. Time consuming, error prone and you may fuck up the backup that you need.
The second starts way further back: encrypt sensitive fields at rest, drop the keys when the user requests a deletion.
That way you never even have to touch the backups in order to have the right end result.
Not currently running a user visible service. Fortunately. The one I did run was started in 1998 and had turned into a very large pile of super insecure PHP and was shut down. If I were to start a user visible service today I would think really long and hard about whether or not the kind of superstructure required would be offset by the potential gains. The web is no longer a place for amateur projects with a focus on the happy path, you need to do it all 'just so' or you can expect to be compromised. And doing it right isn't simple nor is it cheap.
Key management is indeed just another - hopefully simpler - version of the same problem. The reason why it simplifies things is because a single key can invalidate a lot of data stored in places that are out of reach such as cold backups.
The vast majority of companies I've seen consider issuing "SQL DELETE" is sufficient for GDPR compliance.
One day, they will be burned when some investigation finds out that those rows are still sitting in uncompacted tables, or on now-unallocated disk blocks, or on now-remapped SSD sectors.
The only way to be sure the data is deleted is to copy all the data you want to keep to a new drive and burn the old one. Anything less and there is a reasonable chance an expert could recover at least some of the deleted info.
All it takes is one dataleak[+] to blow that wide open. Secure erase is hard. But what is much less hard is to encrypt data and then to limit the problem to getting rid of the decryption key. This reduces the problem in scope to one single datum rather than a whole chain of possible plaintext copies.
Secure erase is a contractual requirement for many relationships, it is interesting that none of the major db vendors as far as I know have a secure delete option.
There are lots of tricks and attempts to work around it but no official support for such functionality afaik. And that's before we get into VM snapshots, database snapshots, backup copies, copies that were made by developers or data scientists for test purposes (of course, nobody ever does that) and so on. Nasty little problem.
[+] or some employee shooting their mouth off in an online forum.
> But what is much less hard is to encrypt data and then to limit the problem to getting rid of the decryption key. This reduces the problem in scope to one single datum rather than a whole chain of possible plaintext copies.
Key issue: you can't do any operations on encrypted data, essentially you're killing off your database.
Homomorphic encryption is academic research, not something that is widely available and supported in common open and closed source databases. The best you can get is a database that encrypts the on disk data (Oracle TDE), but that only protects against a server being stolen or hacked on the OS level.
I fully expect that we will see application level and system level tools that are compliant with the law pop up any day now, the need is certainly there.
Also, for purposes of operations on data it all depends on what the column holds. For instance, you don't actually need access to fields such as names and dates of birth if you have a client ID and that is yours, not the customers and so you could leave that field in plain text. Any operation would then need that client ID but that's workable.
You could even say that if you would need that decryption key for anything other than user or controller directed computation that you are probably doing something you shouldn't be doing. In all other cases the context is clear, consent has been obtained and the data can be decrypted if required.
Secure delete is not implemented in databases because it is extraordinarily expensive, and destroys the performance of all other non-delete operations. When you say “encrypt data”, you are ignoring the fact that it can’t be implemented as “same data structures, just encrypted”. Encryption puts constraints on data structures and data representation that are fundamentally incompatible with and adverse to the design and functioning of most databases in existence, even ignoring the other ugly operational issues with that approach which no one thinks about. It would require radically redesigning the internals of most database engines, which is not really a practical option.
I’ve thought a lot about what it would take to design a “delete optimized” database kernel every since GDPR became a thing. In principle it is possible, but I seriously doubt anyone would use a database that is literally orders of magnitude slower and less scalable for everything except delete operations. Would it be acceptable to increase the resource intensity of databases by 10x (and environmental footprint implied) to get “real” deletes? That is the tradeoff here.
This has precedent in SQL databases designed for high-assurance applications with ultra-fine access and visibility controls. When the average software engineer understands how they work, it sounds like a great idea for having more secure data and they wonder why it doesn’t seem to exist. The reality is that they do exist but they are so abysmally slow for even elementary things that no one would ever dream of using it unless there is a narrow government requirement.
> When you say “encrypt data”, you are ignoring the fact that it can’t be implemented as “same data structures, just encrypted”. Encryption puts constraints on data structures and data representation that are fundamentally incompatible with and adverse to the design and functioning of most databases in existence, even ignoring the other ugly operational issues with that approach which no one thinks about. It would require radically redesigning the internals of most database engines, which is not really a practical option.
You can't do effective per-user encryption on columns the database software needs to read (things you'll query or join on), but the database rarely needs message/post content and image content (often not even stored in the database). So encrypting those could be privacy helpful, if you can make the per-user encrypted store better at deletes than in general.
For person to person messages, if you have a separate record for the sender and the receiver, you can do some per-user transformation on the other correspondent, but that might be indexed, so data without keys would still show messaging patterns.
The law doesn't really care about your notion of what can be done effeciently, if it can be done then you should probably do it or risk being found in violation of the law. There is some very specific language to that effect in the GDPR to make sure that it is clear that 'whatever you could reasonably do' needs to be done. You can then go and argue that you felt it wasn't reasonable but I doubt that will fly.
But you're totally right that this is a tricky problem and hard to do properly. On the plus side, field level encryption is something that we've been doing for ages and that already works quite well. If you design carefully you can even get some processing done on those fields though it will take some major breakthroughs in DB engine design before you can have your cake and eat it too, in the sense that you can be both legally compliant and do all the kinds of processing you can do today. I even doubt if that is desirable, lots of those examples of processing should probably not be done in the first place.
Encryption at record granularity introduces two technical problems that no one has come up with a tractable solution for, and a solution likely doesn't exist. It is actually a discussion of fundamental tractability, not "efficiency". The law can declare that we should be able to break AES encryption too but that doesn't manufacture plausibility.
First, encryption has a block size. Data field storage in databases is typically measured in bits, as closed to the information theoretic limit as practical. Storing an 11-bit datum for some column as a 256-bit AES block represents a 20x expansion in storage cost. We could go back to the old row storage model, which would allow the record to be encrypted as contiguous memory, but that would both bloat storage (for different reasons) and we get to relive the golden age of very poor query performance. Modern databases are built on succinct representations because throughput is memory-bandwidth bound. Even in conventional databases, ignoring this design detail will cost you 100x in throughput.
Second, keys and key schedules thoroughly thrash the CPU cache for scan operators. In a typical data model, a single row will be much smaller than the key schedule for decrypting that row. Every single row, several thousand per page, will require an unpredictable cache line fill to access the required key. Then you have two choices: compute a new key schedule for each row, which will be computationally expensive, or precompute the key schedule and take the even larger RAM hit. In large scale-out databases, many gigabytes of key infrastructure will need to be locally cached in RAM on each server -- you can't afford a network hop or page fault -- to decrypt each row in a page scan. Key management state consumes most of your runtime resources and crowds out the data model. I've worked on the design of such schemes in real systems, you end up devoting almost all of your cache/RAM to key state to the exclusion of the actual data model. Also burns up quite a bit of precious memory bandwidth without doing any real work.
Any database built on encryption for record-level physical deletion will be unusable for almost any modern application for well-studied technical reasons. It will work, in theory, if you can run your business on a database that performs and scales like it is 1995.
The best technical solution for physical deletion today is to rewrite cold storage, which is still extremely expensive and has extremely low delete operation throughput if you do it synchronously but at least it doesn't break database computer science. The only high-throughput and economical way to implement rewriting is asynchronously with very long deferrals e.g. over 30 days. Which is how databases have always worked, but with the deferral being indefinite.
The tricky bit to me is that fundamentally deletion should not be harder than insertion, but because we have historically only focused on the happy path nobody cares at all about the deletion mechanism. Whereas a secure erase-in-place of a field option alone would already take you to 80% or more of a workable solution.
Perfection, the way you describe it is for now unattainable. But the bigger problem as far as my practice tells me is that people simply don't care and set the 'deleted' bit and leave the plain text records + backups + log files all untouched.
The low hanging fruit is pretty much dragging the ground.
Sure, you can trivially design a data structure where insert and delete have the same cost. This works if you don’t care about query performance; most people care about query performance a great deal. This was litigated in the marketplace decades ago. Even every open source database rejects designs that produce rubbish query performance. It also does not reflect the real-world distribution of operations between insert and delete — we do a lot more inserts.
The “erase-in-place” operations you mention were common in databases a few decades ago and abandoned because the typical performance was terrible compared to the alternative. I am old enough I even implemented a few. It isn’t like these designs didn’t exist, they were deeply flawed for reasons that apply today.
This has nothing to do with perfection. Database engineers would love if deletes where inexpensive even in the absence of hard delete requirements, as that would make update operations, which people want, dramatically cheaper. But that isn’t the reality. You are essentially re-litigating settled database kernel engineering without understanding why they are designed the way they are.
If you thinking there is “low hanging fruit dragging on the ground” then I encourage you to prove it by designing a useful database kernel that can scale deletes while preserving insert and query performance, the two operations that drive all economics in databases. You’ll be instantly famous as a computer scientist because there are some nasty theoretical computer science problems in there.
They probably know the odds of an "investigation" actually occurring is very low. The odds of one occurring with that level, or probably any level, of technical sophistication is near zero.
From regulators: yes, they typically work reactive, until there is some kind of breach you'll never hear from them. From investors, customers and potential acquirers audits are pretty common and becoming more so every day.
We know that HN is visited by a fair share of Facebook employees.
Can some of you weigh in (anonymously?) on this topic? Do you guys do hard deletes of user data instead of just soft deletes? If so, are logs or backups kept? For how long?
In other words: if I'm a user of $POPULAR_SERVICE and I delete my account at time t0, is there a t1 > t0 after which every trace of my data is gone from the platform?
Google takes deletes seriously. Extreme efforts go into deleting stuff within 30 days of the user requesting deletion.
Imagine how hard that is when a datacenter is switched off for 14 days for maintenance, and then a fire breaks out and takes it offline for a further 20 days... When something is powered off, it's very hard to do those deletions... Yet misses of the deadline are exceedingly rare, even in cases like the above.
Sometimes disks are crushed in a crusher to meet the deadline if software approaches to deletion can't be done in time.
Excellent. Thank you for posting this. If anybody I would expect Google to take this serious, they have the eye of Sauron on them continuously and can not afford to trip up.
As does Microsoft. I'm always baffled at these "soft deletes are normal" threads. Maybe at a bootstrapping startup, but most real tech companies implement hard deletes within the first couple years.
> But clicking delete or unsend on a photo is not that.
In Google, a user clicking delete is treated exactly the same as a written deletion request.
In fact, the law requires that they be the same - "Therefore, an individual can make a request for erasure verbally or in writing. It can also be made to any part of your organisation and does not have to be to a specific person or contact point.". (https://ico.org.uk/)
Sorry, but no. That is a deletion request. The GDPR tells you exactly what to do once such a request is made. There is no such thing as a 'specific GDPR deletion request'.
That's true, but regulators operate outside of that and will take the intent rather than the letter of the law to heart, and the GDPR is quite specific in its language.
A person who may decide to bring suit however should always cite chapter and verse to lay down the line and to indicate that they are very serious about it. Just the fact that you would be citing that article will likely give you a better chance of seeing your request honored. But if your request is refused and you decide to tip off a regulator it won't make all that much of a difference, they will do their own investigation outside of the particular case and may broaden/narrow the scope of that investigation as they see fit.
This has already surprised more than one company by the way, they decided to play fast and loose with a single individual and as a result found their whole infra and processes under review with plenty of things found out of order. Fines were handed out that were higher than what it would have cost to arrange things properly in the first place.
Soft deletes are generally a good idea. Being to recover data for some time is very useful. Ideally this is also exposed to users. However respectable companies will keep this time-limited. For example Google has very strict deadlines on all of their deleted data wiping. Their privacy policy is quite vague but gives some examples: https://policies.google.com/technologies/retention
I work at Facebook, and I even work in storage, but even that might not make my experience as complete or relevant as you might think. You see, I work on one storage system. There are others that get primary data before us and still others that get it after us (for longer-term backup). There are systems above us that do their own replication on top of the service that we provide. There's a system off to the side to do all sorts of analytics on that data, which often involves copying some pieces of it. In fact, that system is our biggest internal customer, even bigger than the one that sits in the "normal" I/O path. There are systems whose whole purpose is to move data around between these others, which naturally requires some buffering.
You're probably starting to see the problem here. It's that the data actually exists in many systems, big and small, all of them with different processes and staffed by different teams. So deletion is really not a single operation but a coordination of many actions, relying heavily on a complex system of attribution and provenance to find all the places that each piece of data (among literally trillions) went.
All of this infrastructure is huge and it's active. I've been pinged many times while oncall to provide information or take actions in support of it. Every log stream, every database table, has to be carefully scrutinized to see if it could possibly contain user data, no matter how remote that possibility might be. It really is something we work hard at, and I know we're not perfect but anyone who says it's because we don't care is talking out of their ass. We're merely human.
That said, I'm hard pressed to explain the particular scenario in the OP. It seems to me that, no matter what other mechanisms are in place, there should be an egress filter to provide that One Last Check on data leaving our custody, and that should have kicked in here. But I have almost no interaction with Instagram from where I sit, so I can't speak for them any more than anyone else here can. Nor should I try. Probably said too much already.
I have actually worked at Facebook, but not on deletions. So I’m probably better informed than the other random speculation in the comments replying to you, but still, take it with a grain of salt.
My understanding is that deletion is a hard problem with entire teams working on it (imagine how many different random systems data flows to...) but that yes, the intended behavior is for deleted data to really be gone after 30 days. This is necessary to comply with GDPR and various other laws.
Of course, if two people each own a copy of a piece of data (for example, messages person A sent to person B), then person A deleting their copy won’t affect person B’s copy (just like how emails work).
Contrary to popular belief, Facebook doesn’t actually have anything to gain from nefariously storing data you delete. Ad targeting has plenty of non-deleted data to train on; Facebook has no incentive to break the law to keep tiny amounts of dubiously useful extra data on the margins. I’m almost certain the issue described here was genuinely a bug.
If you weren't working on deletions it would seem to me that you are not better informed at all.
The rest of what you write is an open book to anybody in tech. And whether Facebook has anything to gain or not from nefariously storing data you delete is a much lighter shade of gray than building up shadow profiles, profiles on people without an account.
For non-cynical reasons, no. It's basically 90 days for FB data, which is mostly because the majority of logs get deleted after 3 months.
However, there's usually some large slice of user data under legal hold, which legally can't be dropped as it's pertinent to some random long running court case, so not every trace of your data is gone.
This is how most services work at scale. It's much cheaper to set a flag than actually delete an entry in a database. The data can then be scrubbed by some periodic maintenance process.
Only that doesn't typically happen. It just sits there, for years or until the company goes bust.
The typical reasoning is that marketing wants to hold on to the data, they will never ever say 'ok, enough, you may delete it' because there is this infinitely small chance that they can re-activate an account, market to it for some other product (no matter that that is against the GDPR) or to sell the data to some third party if there ever is a cash crunch or panic. They see data as having positive value no matter what, whereas data that you shouldn't be holding on to is actually a liability.
Did I claim that Facebook never really deletes data?
I just answered the GP, maybe you have your threads mixed up?
But if I were to speculate I would say that if Instagram does what the article title says that you could already make that claim about Facebook since they are a part of it.
We’re in a thread that specifically asks whether Facebook really does hard deletes, so I took your comment as claiming they don’t.
> We know that HN is visited by a fair share of Facebook employees.
Can some of you weigh in (anonymously?) on this topic? Do you guys do hard deletes of user data instead of just soft deletes?
The problem with just a flag is it slows down queries. You can scrub it later, but that’s just kicking the can down the road and now you have to deal with the consequences of an actual hard delete.
I wonder when popular databases will have some first class support for soft deletion built in.
Based on the conversation here, it would seem like deleting data is an unsolved problem in CS, or some of you just make it overly complex. Every system I have built had deletes, because it is the right thing to do... No team I've been on got away with not doing the right thing because it was hard. We put our heads down and got it done. Smh.
Some of the engineers at these companies forgot their morals as soon as something got hard I guess.
I genuinely laughed at this - I thought the same thing!
If you ever downloaded your own account data, I think OP and other concerned posters would understand how much data companies retain and this wouldn't come off as a surprise.
Snapchat data for example, has chat logs, snap history, what accounts you've added/requested as a friend, and friends that have added you all retained. I bring this up as an example because the idea behind this application was to send a message that would disappear after a variable amount of time :)
I still don't get why we're all writing bugs in production software in the first place. I know it's a time-honored tradition in this field, but I'm starting to think the practice is a net negative -- at least we should cut back a bit to be able to respond to requests like this in a timely fashion.
If a car model has a defect, it has to be recalled and fixed in a timely fashion, otherwise, there will be a fine. Why should software be any different?
As per my layman understanding, the organization need to comply with the data erasure request within 30-60 days. You cannot retain data for longer without consent or purpose (GDPR).
Is there? I don't know what's true, but I'll just quote this comment by londons_explore from another subthread:
> > But clicking delete or unsend on a photo is not that.
> In Google, a user clicking delete is treated exactly the same as a written deletion request.
> In fact, the law requires that they be the same - "Therefore, an individual can make a request for erasure verbally or in writing. It can also be made to any part of your organisation and does not have to be to a specific person or contact point.". (https://ico.org.uk/)
This is not how it works. A deletion request is all that it takes. The difference here is between what you think the requirements are and what they really are. As an engineering manager for a major payment service provider you should know better and it is disturbing that you do not.
I assume all companies except may be Apple and google never delete the data even if they provide the delete option. For example, I still keep getting emails from mint saying that my credit score has changed even though I deleted all my info from my account.
Some real (understandable) ignorance about how the tech industry works in this article. There should be no expectation that Instagram ever deletes anything permanently from their servers.
The bug was showing users photos that were internally marked as deleted. Not that the photos were not in fact removed from Instagrams servers.
> There should be no expectation that Instagram ever deletes anything permanently from their servers.
No, there should be this expectation. If I have a photo of myself that I'm not comfortable being saved somewhere, there should be an expectation that when I delete this from a service, that service will actually delete it.
The expectation should be that the service will do what it said, not that it hid everything very well.
Untrue, even on flash-based storage full deletion is possible.
Here's an answer I pulled from reddit:
If you are appropriately using TRIM the data will be obliterated when 'garbage collection' occurs -- forensics will not be able to recover deleted data. This is going to be largely dependent on your OS and Hardware, but if you confirm TRIM is working correctly in your setup, shortly after you permanently delete something (not recycle bin or trash) it will be gone for good.
The main takeaway is that people vastly exaggerate how recoverable disks are, making any effort at all will stop 99.9% of hackers trying to recover data.
Maybe we'll get to that place someday, but that is a far cry from reality both in industry norms and regulations in place (and frankly, even regulations being discussed at this point)
You are several years out of touch with the regulatory environment. The GDPR is pretty clear about this sort of thing and if you do not actually delete data upon request but pretend you do that might not be construed to be an accident but willful. Your comments in this thread provide further evidence that this is in fact the case.
It's also clear about how to request your data be deleted! Clicking delete in the instagram app does not constitute a request to be forgotten.
The GDPR does not allow for your data to selectively be forgotten either. You can request that they forget all of you, but not individual pieces of data.
What is not revealed is that the actual bug was that an end user found the deleted data, not that the data wasn't deleted. Should be no surprise that Instagram would have similar data retention policies as its parent company.
Maybe you shouldn't share that info in the first place? Even if the provider(Instagram in this case) acts in good faith and deletes the data, 3rd parties would still own it(i.e US sec agencies, various crawlers etc)
And here I am dreaming of a world where companies do real deletions of data when a user requests (and also deleting older transactional data that has outlived its utility, including regulatory requirements) and storage prices being lower with a slightly smaller (steady) market for it from the major companies.
On the other hand, storage prices seem to be low enough for all these companies with bulk, long term contracts that developers wouldn’t bother doing real deletes of data.
> storage prices being lower with a slightly smaller (steady) market for it from the major companies.
I suspect that big companies in the market help justify manufacturer R&D into larger and faster and helps justify production capacity. I think we'd have smaller, slower, and slightly more expensive drives without their demand. But, speculation only.
I'm surprised Social Media networks don't convey when your data is actually 'deleted'. This approach seems a little more evident in the "archive" status on Instagram.
A flag to mark data for deletion makes sense at scale.. given the number of other automated processes that run more often than a few times a year.. the user should be in control of their information and intent.
Few months ago I requested my data from Discord. Interestingly enough, it didn't include my messages from server that was deleted some time before that.
Internal sources at Discord confirmed to me (a few months ago, so may be outdated) that deleted messages are fully gone from the servers within hours of deletion.
Deleted files hang out longer but are gone within a month.
I remember seeing messages in channels that were deleted on the server so I am skeptical but it's been a long time since I requested a package. Maybe it's different for server deletion?
I don't know how many times this is going to have to happen before people understand this: when you put something online, assume it is essentially public. Forever. If you don't want it to be public forever, don't put it online.
This is victim blaming and only reinforces the status quo. The way things are isn't the way things have to be. We can change the rules if we work together. Or we can give up and blame the victims.
You can have Instagram delete that photo when you change your mind, but how about the friend who saw it and saved it, or the already-illegal bot that crawled the site and archived it?
"Changing the rules" can reduce data availability, probably well enough for most purposes, and that's good. But it's simply strictly true that once you publish something, you cannot assure it's unpublished. And everyone should know this and act that way.
Depending on other people to change their actions when you've been victimized ensures that you'll always remain a victim. You can't control what other people do. You can control what you do.
I think it's more pragmatic advice than anything, even if all these websites deleted your data properly there'd be snapshots and pictures and backups around forever.
This seems like a reasonable enough perspective, especially for something like a photo. An average person still doesn't have a lot of power though in a world where the de facto standard is that goods and services are provided at a cost of all your data and usually some cash on top.
Opting out of the camera mesh being built in SF is impossible without moving. ISPs monetize our browsing habits, and a typical individual has no recourse -- common suggestions like a VPN just kick the can of trust down the road, and logless VPNs are usually proven to actually track you. Even if you didn't use any Google services, the adwords network is on the bulk of websites, and even with an ad blocker is a typical user expected to know that being logged in to Facebook means opting in to being similarly spied on nearly anywhere you visit on the web?
IMO it's reasonable to expect uploaded photos to live forever somewhere (even if GDPR obligates companies to do otherwise), but even learning how to opt out of the rest of what's collected where it's even possible would be a daunting task when undertaken from scratch that inevitably missed some key component and still resulted in being spied on and tracked. With that in mind, and given the scale of the problem, regulation preventing the most egregious of data-related offenses seems prudent.
I want to know if the photos are securely deleted. It's not enough that the mere reference to a file is gone. I want everything overwritten with zeroes, and the photo made properly irrecoverable.
They are definitely not. I’m honestly surprised at how many folks here on HN (a presumably tech savvy crowd) think that Instagram (or any tech co) is “writing zeros” when a user hits a delete button.
I'm responding here because we've hit the limit on the other comment stream, but I truly want you to better informed on this subject so here goes:
> Whether data is an image or some other record is immaterial.
This is wildly inaccurate. The content of the data is extremely important in determining how to properly store and (potentially) dispose of it. If the data contains PII, or is covered under PCI or HIPAA the processes are entirely different. Even under GDPR erasure requests there is specific guidelines for determining what types data should be deleted and for what data should be "vaulted" (aka soft-delete) rather than hard-deleted. Yes, even under GDPR the regulations say to soft-delete.
> If the user tells you to delete their data you delete it. Full stop. That's the right thing to do and in many places now a reason to get fined if you don't.
This is, maybe, a good goal for us to set as an industry but let's be perfectly clear - this is not a reflection of reality today and at it's core, this is my only point. I'm not aware of any fines for not hard-deleting a single piece of data but always ready to be more informed here, please shoot me that source.
That said, besides the technical challenges of actually deleting the data (which I agree are not a reason or an excuse to not delete it, and the GDPR does a great job of outlining this) there are myriad reasons to keep it around. There's a reason MacOS has Trash and Windows has a Recycle Bin, and a few dozen tools in existence for recovering data even once they've been deleted from those places. Many of these use cases are actually beneficial to the user. "Undelete" is a real thing that users expect to be available (often for good reason)...and this is impossible after writing zeros.
Data deletion is a very nuanced subject and treating it as black and white does no one any good.
> Whether your company is based in the USA or not is immaterial.
Again, incorrect. The regulatory body in effect is extremely material. I mentioned USA because those are the rules I am most familiar with.
As engineering manager of a large household name company you really should know better. PII isn't even a term under the GDPR.
But feel free to play fast and loose with this and see where it ends up, as far as I'm concerned it can't happen fast enough that regulators crack down on companies that wilfully ignore the law.
The GDPR does not say to 'soft delete', it says to delete, and that you should make every reasonable effort to do so.
The term 'soft delete' does not occur even once in the reference text for the GDPR. That there are valid cases for soft delete may be true but these are not the norm, they are the exceptions. The owner of the data (end user) has agency. If you wilfully ignore their instructions then you are prime game for the regulators. Whether this data is privacy sensitive or not doesn't matter, the only reason you might be allowed to hold on to it is if there is a legal requirement to do so.
Whether the industry as a whole has not caught up with the law - and Square apparently in particular - is a red herring, it should have caught up by now. That companies chose to spend their resources on other things than compliance isn't my problem, but it will be their problem.
If you are familiar with the rules in the United States but not the ones in Europe then why do you tell me how the GDPR works when clearly you have no clue.
Our compliance officer would have a field day auditing you and you are dangerously incompetent to make all these claims in public representing a company that has a lot of business in Europe and plans to do a whole lot more.
You're right that GDPR doesn't use the TLA "PII", but literally the first definition is "‘personal data’ means any information relating to an identified or identifiable natural person".
I guess? I'm trying to inform folks about data deletion works. And I'm surprised more people here are unaware of how data is stored at large (and small) tech co's.
You must be aware that Square is operating a service in a regulated field (fintech) and that openly advertising their possibly illegal business practices is not to their advantage. Let's assume that everything you've said is true that could come in handy one day when a regulator is looking to determine the difference between whether this is an oversight or willful. You are making their job a lot easier and Square's situation a lot more difficult.
For your sake I hope that Square plays it as much as they can by the book, and that if they don't that this will cause them to wake up to the fact that all it takes is one employee blabbing online to get them into seriously hot water.
People here are on average quite aware of how data is stored, but they are not always empowered to make the call on how it should be stored. As long as that happens behind the curtain it is a problem, but not a huge problem for any particular company.
You have just raised Square's profile in many ways at once, not in the least by alerting the hacker community to the fact that Square has a lot more data than they should and that their security measures are not exactly top notch. The consequences could be substantial especially when that deleted data turns up in a dataleak. When you represent your employer online be careful about what you say and write.
I assure you nothing Square is doing is illegal or even "possibly illegal". Square takes GDPR deletion and right to be forgotten incredibly seriously. But you must understand that soft-deleting an image that the user uploaded is not in violation of any regulation in place in the USA.
"I'm trying to inform folks about data deletion works."
Is hard to reconcile with this comment.
Whether data is an image or some other record is immaterial. If the user (or the controller, for that matter) tells you to delete their data you delete it. Full stop. That's the right thing to do and in many places now a reason to get fined if you don't. Whether your company is based in the USA or not is immaterial.
You also responded to an inquiry of Facebook employees as though you were one, when in fact you work somewhere else entirely. I think you mean well but possibly do not understand the implications of your statements here.
Always encrypt the data at rest, and delete by deleting the key is likely how this would be done. This way you can also delete e.g. tape backups without actually loading the tape and re-writing the whole thing with certain portions deleted, which is not really practical.
Yes, and you could also queue files for deletion at a later stage by throwing away the encryption key for a large batch of files which have been queued for deletion.
So what if someone sleuths around in the e-waste dept. of Instagram and steals a few hard-drives with literally Terabytes of potentially very sensitive data? That's why I would hope Instagram are encrypting data at rest with something like LUKS.
Data that I wanted to delete isn't necessary for the functioning of the service, so doesn't that mean the GDPR requires it to be hard-deleted within a reasonable time?
These people imagine you just rm -rf a file and run a few SQL queries, you literally can't just delete data. I've worked at the scale of Instagram before. Literally nothing works like that at that scale, the person who wrote that article should have called up someone from Instagram and had them explain at a really high level how complicated deleting data is at large scale.