Discussion
Successful Software
hermitcrab: If you are putting out data without doing even the most basic validation, then you should be ashamed.
ramon156: What about most of Show HN's projects nowadays? Sometimes the docs straight up lie, and it takes 5 minutes to figure that out. Should they also be ashamed?What about people who don't know how their own code works? Despite it working flawlessly? I'm asking because I don't really know.
Phlogistique: That it's it's better to publish the garbage data than to not publish it though. I would worry about complaining too much lest they just decide to stop publishing it because it creates bad PR.
GMoromisato: Clean data is expensive--as in, it takes real human labor to obtain clean data.One problem is that you can't just focus on outliers. Whatever pattern-matching you use to spot outliers will end up introducing a bias in the data. You need to check all the data, not just the data that "looks wrong". And that's expensive.In clinical drug trials, we have the concept of SDV--Source Data Verification. Someone checks every data point against the official source record, usually a medical chart. We track the % of data points that have been verified. For important data (e.g., Adverse Events), the goal is to get SDV to 100%.As you can imagine, this is expensive.Will LLMs help to make this cheaper? I don't know, but if we can give this tedious, detail-oriented work to a machine, I would love it.
mlaretallack: I saw the RAC one this morning, though I was miss reading the graph, as why would the RAC publish such an obvious mistake.I have written my own Home Assistant custom component for the UK fuel finder data, and yes, the data really is that bad.
hermitcrab: >Sometimes the docs straight up lie, and it takes 5 minutes to figure that out. Should they also be ashamed?Yes. Lying is bad, even if some people are trying hard to normalise it.>What about people who don't know how their own code works? Despite it working flawlessly?I think that is fine, as long as you aren't making untrue claims.
Calazon: > Sometimes the docs straight up lie, and it takes 5 minutes to figure that out. Should they also be ashamed?Yes.
add-sub-mul-div: This has become a spam site for AI shovelware projects that are nearly always posted by accounts with no activity here outside of self promotion.
albert_e: Concluding passage:> Authors should have their work proof readAgreed.Opening passage:> A quick plot of the latitude and longitude shows some clear outliners"outliners"Ouch!
Tempest1981: Agree. Maybe just add a Disclaimer.md file.
chaps: I have mixed feelings about this. On one hand, yeah stop publishing garbage data, but as a FOIA nerd... I'll take the data in any state it is. I'm not personally going to be able to clean the data before I receive it. Does that mean I shouldn't release the unsanitized (public) data knowing that it has garbage data within? Hell no. Instead, we should learn and cultivate techniques to work with shit data. Should I attempt to clean it? Sure. But it becomes a liability problem very, very quickly.
hermitcrab: So you expect the 1000s of people trying to use the fuel price data to each individually clean and validate it, rather than the supplier doing it?
akudha: How is it fair to compare a Show HN project with official government datasets? People depend on government datasets, multi-billion dollar businesses are built on top of them. A show HN project is typically someone building it in a weekend. They’re not even remotely in the same league.Sure it is expensive to check every number, but at least some of it can be automated and flagged for human review, no? Switching lat/long numbers. For example
stared: I dislike the premise. I mean, good data is wonderful.But if institutions are expected to release clear data or nothing, almost always it is the later.What is important, is to offer as much methodology and caveats. Because there is a difference between "data covers 72% of companies registered in..." vs expecting that data is full and authoritative, whereas it is missing.(Source: 10 years ago I worked a lot with official data. As all data it requires cleaning.)
hermitcrab: Hard disagree on that. They just need a basic smell test before they put it out.
gdulli: Why would you give this sort of work to a machine that can't be responsibly used without checking its output anyway?
hermitcrab: OP here. Ouch indeed. I did actually get it proofread. But that was missed. I can't fire my proofreader, as we are married. ;0)Now fixed.
rdiddly: Not fixed at this hour
hermitcrab: >Clean data is expensive--as in, it takes real human labor to obtain clean data.Yes, data can contain subtle errors that are expensive and difficult to find. But the 2nd error in the article was so obvious that a bright 10 year would probably have spotted it.
GMoromisato: Agreed--and maybe they should have fixed it.But sometimes the "provenance" of the data is important. I want to know whether I'm getting data straight from some source (even with errors) rather than having some intermediary make fixes that I don't know about.For example, in the case where maybe they flipped the latitude and longitude, I don't want them to just automatically "fix" the data (especially not without disclosing that).What they need to do is verify the outliers with the original gas station and fix the data from the source. But that's much more expensive.
Frank-Landry: Did a bot write this title?
torginus: What does it mean to clean the data?Do you remove those weird implausible outliers? They're probably garbage, but are they? Where do you draw the line?If you've established the assumption that the data collection can go wrong, how do you know the points which look reasonable are actually accurate?Working with data like this has unknown error bars, and I've had weird shit happen where I fixed the tracing pipeline, and the metrics people complained that they corrected for the errors downstream, and now due to those corrections, the whole thing looked out of shape.
hermitcrab: Or just omit the rows that are obviously wrong (and document the fact).
chaps: Exactly. This is a big problem with "open data". A lot goes into cleaning it up to make it publishable, which often includes removing data so that the public "doesn't get confused". Now I have to spend months and months fighting FOIA fights to get the original raw, messy data because someone , somewhere had opinions on what "clean data" is. I'll pass -- give me the raw, messy data.
chaps: "obviously wrong" is a never ending rabbit hole and you'll never, ever be satisfied because there will always be something "obviously wrong" with the data.Messy data is a signal. You're wrong to omit signal.
GMoromisato: 100%. There is even signal in the pattern of errors. If you remove some errors but not others, you lose signal.
sd9: Agreed, pretty much all data is flawed. I still want my hands on it.
chaps: "What does it mean to clean the data?"This isn't possible to answer generally, but I'm sure you know that.Look -- I've been in nonstop litigation for data through FOIA for the past ten years. During litigation I can definitely push back on messy data and I have, but if I were to do that on every little "obviously wrong" point, then my litigation will get thrown out for me being a twat of a litigant.Again, I'd rather have the data and publish it with known gotchas.Here's an example: https://mchap.io/using-foia-data-and-unix-to-halve-major-sou...Should I have told the Department of Finance to fuck off with their messy data? No -- even if I want to. Instead, we learn to work with its awfulness and advocate for cleaner data. Which is exactly what happened here -- once me and others started publishing stuff about tickets data and more journalists got involved, the data became cleaner over time.
GMoromisato: Deleting the row loses some information, such as the existence of that gas station.A better solution is to add a field to indicate that "the row looks funny to the person who published the data". Which, I guess is useful to someone?But deleting data or changing data is effectively corrupting source data, and now I can't trust it.
bobro: This article assumes that there is a person with dedicated time to validate the data. Imagine you want this data and ask for it, but the government says, “sorry, we have this data, but we read an article that said we can only publish it if we spend a lot of time validating it. This data changes frequently and we don’t have a chunk of a full-time data analyst’s salary to spend on it, so we just aren’t going to publish anything. We’d rather put out nothing than embarrass ourselves, so you can’t even try to validate it yourself.”
chaps: In fact, the government agencies will argue that they have zero legal obligation to clean the data, let alone figure anything about the data, and that they're just giving you the data as-is. This happened to me on a FOIA call where I was trying to get data from the county state's attorney. Clean vs not clean data is the wrong fight.