Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Shadow traffic: site visits that are not captured by typical analytics providers (parse.ly)
58 points by ahstilde on Aug 18, 2020 | hide | past | favorite | 68 comments


Okay, the cynic in me wants to write "New Age web designers are stumped by lack of analytics while still refusing to look at their HTTP server log data."

I remember when the ONLY analytics were those you could derive by analyzing your http logs. Which have useful information in them. Things like source IP address (which can be geo tagged), a bunch of HTTP headers (which are full of information too), and a timestamp which tells you when it came in and from where. Not to mention session cookies which take zero javascript to implement.

I've been retooling my site slowly to only use these analytics (less the cookies) because I value people's privacy while browsing as much as my own. During the transition I've been comparing what I can pull out of the logs vs what Google's analytics gives me. Sure, Google can do wonders, especially if the person is coming from a browser where they are logged into Google. But, as the article points out, they miss everyone running noscript and/or other privacy enhancers like Privacy Badger from EFF.

I don't feel like I'm going to miss the Google added insights.


I recently decided not to use GA on my site in favour of server logs for all the reasons you mention plus one that I see as the most important: it would require me to annoy users with a cookie consent banner.


The last time I processed my own web logs I used Urchin. What are the good choices these days for log processing?


I don't know, I just use perl to extract the data and feed it into influx. Then I pull data sets from influx into numpy and process it how ever I want.


This is pretty funny, buried in the middle of the article:

> Option 2 – Server-Side Tracking


With advice that this is so technically complex, don't even look there.

These guys can write java script that animates a web page so that it looks like a turning page and its "too technical" to pull data out of a server log?


The article talks about sending first-party analytics events to an analytics provider from your own servers. So the server-side tracking the article refers to is similar to the server-side tagging Google recently announced[1], not the analysis of server logs.

[1] https://developers.google.com/tag-manager/serverside


Ok, thanks for the correction.


I recognized this a couple years ago at a startup I work with. Comparing Google Analytics numbers to validated event logs, the numbers were off by ~20-30%. Surely there must be a quick workaround I thought, there's no way there's an entire multi-billion $ industry of 3rd party analytics software giving bogus numbers to websites?! But that's indeed the case. I immediately made top priority building out an in-house analytics platform where event logs were sent via the API and thus didn't get blocked.

And for those saying relative direction is all that matters, I guarantee you the behavior of users with adblocker installed is very different from those who can't be bothered or don't know how.


Let me tell you what was fun - trying to explain to a former boss why Facebook ads showed one number for amount of traffic sent, Google Analytics showed a different number for traffic from those ads, then lastly our server logs showing an entirely different number!


For clarity, so you’d recommend web server logs over client side analytics?

If so, what open source web server analytics tools do you believe is best? E.g. https://goaccess.io


I had an article on the front page of Hacker News last year that had about 17,000 real visits, as determined by analysing my server log files. I was also using Google Analytics at the time, which told me I had 10,000 visitors (of which only 7 were using Firefox!).

Obviously there's a gap between what trackers say and reality, bigger for some demographics than for others.


> of which only 7 were using Firefox

Were reporting using Firefox.

Also, not surprising, Firefox security leaves a lot to be desired.


>Also, not surprising, Firefox security leaves a lot to be desired.

Huh? Blocking google analytics tracking is a positive, not a negative regarding security.


He's talking about user agent spoofing.


I'm not so sure. It shouldn't matter whatsoever what user agent I'm using. In fact, not sending the user agent field would be massively better if only that in and of itself wasn't a unique datapoint.


Still not sure how that ties into Firefox having worse security. User agent spoofing is a privacy item and a positive privacy item.


If Firefox users are disproportionately lying about the user agent for privacy then counts are off. This would impact 1p telemetry as well. If Firefox users are disproportionately running Adblock then they'll be undercounted (it's also entirely possible that Chrome & GA have some kind of thing where even if you're using Adblock they're able to correct the GA data they show you for Chrome users).


Absolute numbers tend to be overrated in analytics. Often relative numbers, like the number of conversions per tracked user matter more. Also, if your product is targeted at privacy-savvy individuals like developers that often use blockers you might be better off using server-side tracking. That seems to have become a lost art though, especially since many sites use CDNs that hide a lot of visits for cacheable content.


If you use Cloudfront (AWS) they have the option for server-side tracking built-in. You just tell them which S3 bucket to dump the logs into and you get the raw HTTP requests with timestamps. I personally use a service called s3stat which takes those dumps and turns them into pretty graphs.


The other side of the coin are the inflated traffic statistics that include all kind of bots & crawlers with spoofed user-agent, and for low traffic sites like niche personal blogs with a custom domain it can be 90+% of the server-side logged visits.

How proud was the 15-year old me with my first .com domain, having over 100 visitors per day. Little did I know that the actual number of visitors was much, much less than that.


You should use Server-Side tracking.

There is no reason to design your website in a way that makes your legitimate analysis use cases depend on Client-Side computations.

If Server-Side tracking looks too complex for you, you might want to reevaluate the balance of technical knowledge in your enterprise.


Server-side tracking is what everyone started off with. There is a reason client-side analytics won in the marketplace. They just have a better balance of advantages to disadvantages.


Client side analytics won because setting them up involves copying and pasting some code into your HTML code.


What are those advantages and disadvantages?


The advantage is that you can measure anything that happens in the browser. As a product owner, what you really care about is the experience you give your visitor/customer, and that happens in the browser, not at the server. This advantage has become stronger over time as sites have used more javascript.

One disadvantage is that if your visitor/customer has javascript turned off, you get no data. This was a concern in the early days of client-side analytics, but not really any more.

A more modern disadvantage is that ad blockers might prevent your analytics script from running. However, this is only a problem for client-side analytics packages that are hosted by ad companies, like Google Analytics. It's not a problem with the concept of client-side analytics in general.

EDIT to add:

Another advantage is that only measuring things in the browser makes it a lot easier to exclude non-browser traffic like bots and spiders from your reports.

That's also a disadvantage because you can miss server-only events like "hot-linked" images or PDF downloads straight from Google. On balance, though, we care a lot less today about hot-linked files than we care about excluding automated traffic.

And in my experience, culturally, client-side packages were a huge help in getting management off of pointless vanity metrics like "hit counts" and caring more about human metrics like visits and time.


You missed the biggest disadvantage of client-side tracking: it’s slow. Especially compared with server-side tracking which has typically zero marginal cost because it’s already happening.

Counting the size of the executed JavaScript, because that’s what matters more than the compressed transfer size:

ga.js is 45KB, matomo.js is something like 50KB. The “new breed” of trackers are currently commonly 2–5KB (though generally if they were written more carefully they’d be well under 1KB), but they’re sure to continue growing because such is the nature of code.

By using client-side tracking on a different host name, you’re making the browser establish a new HTTPS connection—which can happen in the background so long as you do it properly, so it’s not of itself a serious performance issue—and parse and execute probably 50KB of JavaScript, which blocks the main thread while executing. On a substantial fraction of the devices your site will be running on, that’ll block for more than a hundred milliseconds (to say nothing of the couple of hundred more of CPU time that parsing took, which wasn’t blocking, but was taking away from other things it could be spent on), and on older and slower devices it’ll be adding several hundred milliseconds.

Seriously, parsing and executing JavaScript is slower than you realise. If your site uses JavaScript of your own, using just one type of client-side analytics is probably slowing useful page load down by 0.1–0.5s.


Try analyzing a TB of logs per day when all you really want is aggregated statistics. Or really any non-trivial amount of data, even 1GB is a problem.

If you want to know click paths you'll need additional data in your logs.

If you want to know how much time the average user spent reading an article on your blog, you are probably out of luck using logs.


> Try analyzing a TB of logs per day when all you really want is aggregated statistics.

You’re making a completely unfair comparison here. Client-side analytics is performing aggregation as it goes; server-side analytics can do just the same, and serious packages in that space do do that. As an example that readily springs to mind, you can feed server logs into Matomo and use it just as effectively as when you feed client logs into it via matomo.js.

On click paths, they’re largely just an artefact of aggregation, so long as you can track individual users (which you admittedly won’t get out of the box in server logs, and perhaps that’s what you were referring to, but you can definitely make it happen). Multiple tabs thwarts doing the “navigated from page A to page B” form of path tracking correctly purely server side, but that’s not a realistic form anyway and isn’t what you’re likely to use; rather you use “page A was loaded, then page B was loaded” and just guess paths to be adjacent loads, which is perfectly compatible with server side analysis.

How much time, yeah, that’s one that you can’t do any sort of good judgement of server-side. Not that client-side tracking is particularly excellent at it. All up though I suspect that so long as your bounce rate isn’t too high you’ll get decent enough figures from reckoning the time until the next page load and eliminating statistical outliers. If bounces are too high the figures could be misleading because bounces are likely to behave differently. But people often put far too much effort into trying to track everything, where it’s commonly quite sufficient to take a smaller sample and extrapolate to the whole.


Most websites aren't Google or Facebook so they're not going to generate terabytes of server logs. A million DAU will generate at most gigabytes of log activity in a day. It's pretty easy to tune logs to be more compact and they're highly compressible. Most sites don't have anywhere close to a million DAU. If you can't handle processing a gigabyte of data it seems like there's fundamental problems with your development team.

I don't know how to help you if you can't look at log time stamps and figure out click paths for users. It was a solved problem twenty years ago.

It really doesn't matter how long it took someone to read and article on a blog. They either saw your ads or clicked affiliate links or they didn't. It doesn't really matter if they finished an article or bounced right off the page. It doesn't matter if they took ten minutes or twenty minutes to read an article. You don't know if their kid interrupted them half way through or if they're just a slow reader.

That kind of shit is just meaningless made up metrics. It's the Gish gallop of web advertising. Throw a wall of bullshit metrics at site owners or advertisers to justify them paying for the privilege of listening to that bullshit.


> If Server-Side tracking looks too complex for you, you might want to reevaluate the balance of technical knowledge in your enterprise.

Shameless self-plug: if server-side analytics is too complicated for you, consider using the tool that we just launched to help with that (other functionality gets included as well): https://www.nettoolkit.com/gatekeeper/about


What if your site is a SPA? You would not know, for example, the time spent on site, what pages are visited, where exactly users leave, if there are client-side errors, right?


If you're using a SPA you need to build that instrumentation in to match the native behaviour, along with robust error handling using something like https://github.com/getsentry/sentry/ so you can tell when your code is broken client-side where you would otherwise not have visibility.

This is much less likely to be blocked if you self-host it — breaking requests to your server will break the app, whereas blocking common cross-site tracking services is popular because there are few drawbacks for the user.


You can self-host sentry?


Yes – it's pretty easy to run the open source app in your favorite container runtime:

https://hub.docker.com/_/sentry/


Wow, I didn't know that. I remember using it at my last company and we always kept receiving quota warnings, and their higher plans were really expensive.


Then now you have two problems ;)

I guess your options here are, do collect metrics in JS and hope that whatever the reason that 20% of visitors don't show up in Google analytics isn't preventing them from using your site.


Use a hybrid approach:

1. Server side analytics

2. Client side analytics for information akin to what you're asking that Server side misses

3. Crash analytics for client side errors

As mentioned though, you're only getting partial info from some of those options. It also gives you a chance to decide which of these you _really_ need and hopefully eliminate anything you don't


Some things can't be measured on the server side.

% of a video watched for example would be broken on the server side due to player buffering.


You cannot reliably measure that on the client-side, either. But if you assume a "normal" use case anyways, you could easily track the percentage of the video stream requested by a client. That should approximate percentage watched. Of course you need to have certain control over your infrastructure for the matter. If you outsourced everything to a CDN and thus have no clue what happens to your videos, well, you probably weren't that interested in the data in the first place, weren't you?


> % of a video watched

I would consider that to be private information regardless of the reason. Why should eg Youtube know where I stopped in the video?

> would be broken on the server side due to player buffering.

Would it, really?

You don't think that you can tie together "this much of video was buffered and _possibly_ displayed" is useful information?

You don't think that "60 seconds of a 10 minute video was buffered and _possibly displayed_, and another 30 seconds of buffer was requested every 30 seconds for 5 minutes" is useful information?

You don't think that you can determine that the user stopped watching the video after between four-and-a-half to five-and-a-half minutes of the video had played?


HLS is just chunks of video in different files, track exactly what files are actually served.


This is pretty much a parsely ad. I think the HN crowd is pretty aware that ad blockers, vpns, etc can break analytics.


I think you're too quick to dismiss it. It's one thing to know that it exists and another to recognize that it's somewhere between 20-40% of your total traffic, especially unprompted. I've had many conversations where people assumed their traffic numbers was real until this point was raised, at which point everyone metaphorically slapped their foreheads and realized that they had forgotten to take this into account.


another quite fun thing is that this depends on your site vertical and demographic: on some of the esports-related properties i've worked on in the past, adblock rates (that also block analytics) may exceed 80%. i have seen 90% before (along with per-device differences, like you suddenly think almost all of your traffic is mobile - because all desktop users are blocking client side analytics)


When does this actually matter though? Isn't growth (or shrinkage) what you'd normally really care about (e.g. this month we had 10% more DAUs than last month)? I suppose if you changed something to attract tons of new users who disproportionately use AdBlock (for example) this becomes an issue as it wouldn't show up in your metrics, but is that sort of thing common?

I suppose if nothing else it's good to know so you can immediately beef up your numbers +20% in your slide deck for VCs.


What if you're doing anything which doesn't involve logins — public information, advertising, etc. – where users don't otherwise trigger something like account creation / logins?

What if you're trying to get stats about people who don't convert or otherwise give you a signal that they're using the site?

I've run into sites where things like signup or checkout are blocked behind an analytics tracker (Adobe used to recommend running theirs in a synchronous navigation-blocking mode) which meant that any problem with that service was completely invisible unless they contacted you to complain.

I also remember people wondering why Firefox users stopped using their site when they shipped the release which enabled tracking protection by default.


Good points, thanks!


I agree about this being a parsely ad. However, I also think there's a large minority of HN users who aren't aware about just how client-side analytics are broken or by how many of their users.


I get wanting to create valuable thought leadership content, but this is the worst example of Bad product marketing:

1- present a new concept to readers (shadow visitors)

2- show how this concept is scary and bad for your business (your analytics is off by 20%!!!)

3- present 2 options, which by the way are free, but immediately shit all over them (Server logs! But that’s hard and complicated! Edge logs! But getting these are hard!)

4- present your company’s product as option 3, which surprisingly, have no downsides and isnt shit upon

5- profit

What disingenuous garbage. You should be ashamed Parse.ly.

The right way to do this is do steps 1 and 2. Then show in detail step 3: how to solve the problem with easy options, ideally with free and open software. It’s ok to show edge cases, corner cases, or just shear scale issues that makes these options challenging.

The difference is that good product marketing pieces show people how to solve the problem and offer a solution to do that at scale or in a automated/hosted way so the customer doesn’t have to deal with it.

If your product marketing content’s message is “you are screwed unless you buy our product “ you are doing it wrong


Yeah, this article just read super weird to anyone with even a basic understanding of the field. It's very clearly targeted at managerial folk, and even then it is as shallow in its arguments as it is transparent in its motives.


Server side tracking is useful for logging http requests. Client side tracking is useful for logging user interactions. Used to be there would be a small difference between server and client due to caches and user settings... but... Modern apps (i.e. React, Vue, Angular) often only load one page, and then all interaction is managed by client side code, so often client side tracking is the only thing that works.


Obvious question: how do you filter out bot traffic with server side logs? What percent of visitors are bots anyway?


Most legitimate bots identify themselves with specific user agent strings.

Script kiddie attack bots are generally fairly obvious as they hammer away at things like /wp-login.php for days on end regardless of what error codes the server returns.

Most other bots are pretty evident just by looking at access patterns. Just identify their IPs and drop them from your analytics.


User-Agent for nice bots, and client ASN for naughty bots will get you pretty far. Fake chrome from residential IPs is hard to detect though.


There are off-the-shelf open-source libraries for this that are pretty decent and are kept up-to-date by the community. For example, you can just do browser.is_bot? after you install https://github.com/fnando/browser#bots


Watching the bot traffic is an interesting exercise in itself. The trick is not to filter it out; it's to identify it (to the greatest degree possible.)


Does hosting client-side tracking on your own domain circumvent all the problems? How come that hasn't become the standard and killed 3rd-party trackers? If it's a question of having to manage an analytics platform, can't that still be deferred to a 3rd-party but through your own subdomain?


My interpretation of how Google Analytics does this is that pulling in a 3rd party dep lets them keep a lot of control over versioning and similar things. I assume they do this so out-of-date or vulnerable code (or extra tracking, sure) can be cleanly and quickly applied through the network and not rely on the clients to properly update "ga.js" a few times a day.


Funnily enough, I got a warning telling me uBlock prevented this page from loading...


What does parse.ly do differently to account for this discrepancy in the analytics?


Good question. I explain some of the technical approaches we've taken with customers in another comment here:

https://news.ycombinator.com/item?id=24205803


I'm one of Parse.ly's co-founders. This post was written by one of our product managers about a project and investigation we've been doing for the past few months. It first got on my team's radar when I posted this set of tweets back in 2019:

https://twitter.com/amontalenti/status/1165262620959617025

Specifically: I noticed a huge difference between the metrics we were reporting on my blog post in Parse.ly, and the metrics being reported by my personal blog's Cloudflare CDN (caching the content).

Ironically enough, this traffic was all coming from HN and the post was itself about modern JavaScript[1].

Since then, we've also been hearing from a lot of customers about various scenarios where traffic is either under-counted or mis-counted. For example, something that has been tripping us up lately is that our Twitter integration relies (partially) upon the official t.co link shortener[2], and yet, due to modern browser rules related to W3C Referrer Policy[3], the t.co link's path segment is often not transmitted to the analytics provider, and thus the source tweet for traffic cannot be easily ascertained.

I firmly believe in privacy and analytics without compromise[4], so the team is trying to come up with ways to at least quantify shadow traffic at an aggregate level, and to ensure legitimate user privacy interests are honored, while making sure they don't break legitimate privacy-safe first-party analytics use cases.

As a developer, something that concerned me recently was realizing that Sentry, the open source error tracking tool with a SaaS reporting frontend and a JavaScript SDK[5], gets blocked in many conservative browser privacy setups. Though the interest to user privacy is legitimate, I think we can all agree it'd be better for site/app operators to know when certain browsers are hitting JavaScript stack traces.

[1]: https://news.ycombinator.com/item?id=20785616

[2]: https://help.twitter.com/en/using-twitter/url-shortener

[3]: https://www.w3.org/TR/referrer-policy/

[4]: https://blog.parse.ly/post/3394/analytics-privacy-without-co...

[5]: https://sentry.io/for/javascript/


So what do you actually (want to) do in regards to measuring shadow traffic? The blog post tries to convince the reader they should care about shadow traffic and then handwaves away existing solutions, but "we're developing a solution" is as concrete as the post gets as to why I should turn to parse.ly. Now you say "the team is trying to come up with ways to at least quantify shadow traffic at an aggregate level", so it appears that you don't even have a solution yet.

In light of that, presenting "existing analytics services like parse.ly" as one of three solutions on "how to measure shadow traffic" seems borderline disingenuous. If you can do it, why not say so plain and clear? If you can't do it, why do you mention yourself as a solution? Or is it only other services like parse.ly that can do it?

It also rubs me the wrong way how both the blog post and your comment has an undertone of "it would be better if users didn't have as much tracking protection". Just take the framing of your last sentence as an example...


Actually we have a few accidental "solutions" to this problem already in production and we are just trying to figure out which one meets the right overlap of respecting privacy preferences and providing site/app owners with visibility into shadow traffic.

Here they are:

- Server-side proxy: The blog post I linked about The Intercept uses this setup. Basically, a web server run by our customer captures all the traffic inside their cloud or hosting environment. That traffic is then logged and proxied to our data capture server, with some data scrubbed before we receive it (e.g. IP address removed), with data sent via our server-side protocol.

- First-party custom domain: We spin up a server and HTTPS certificate and the customer points their own subdomain (via a DNS A or CNAME record) to that server, which serves as a proxy. We originally built this facility to clarify data ownership in a GDPR context -- where the customer is a data controller and we are a data processor, so the controller owns the domain where data ingest happens.

This and the prior solution have the side-benefit that Parse.ly couldn't do any cross-site linking even if third-party cookies were enabled in the browser. We never do this anyway, but both of these setups make it technically impossible due to the browser rules around cookies and domains, which is a nice security improvement. But it also raises other issues, like the fact that the customer setup is more complex, with more moving parts.

- API connection to CDN. This one is not actually productionized but was merely prototyped. We'd pull basic per-day and per-page CDN server request logs, and compare that to our pageview counts to understand the delta, which is likely mostly shadow traffic. The upside of this solution is that it might be pretty easy to setup for customers, the downside is that we'd have to build connectors for a lot of CDNs, and through market research we have learned that larger customers might use multiple CDNs at once (believe it or not).

- "Fallback" logging of blocked page loads. This one was also just prototyped, but the idea is that some JavaScript code would detect whether Parse.ly JavaScript SDK was blocked from loading, and if so, a basic privacy-safe "this page's analytics were blocked" event would be sent to a domain owned by the customer, perhaps one that ensured scrubbing of all details other than the "fact" that a block event happened at a particular timestamp. We actually prototyped this particular idea on our own marketing site because we ran into issues with Marketo & Parse.ly data vs our server logs and even our lead capture forms. (That is, situations where a lead was captured for someone with "zero pageviews", because their session was shadow traffic but their form fill nonetheless happened.)

Re: your comment that you sense an undertone of, "it would be better if users didn't have as much tracking protection", I have no such personal or professional belief, and I can assure it isn't a view held by our company/team. I understand the motivation for tracking protection and we even suggest use of Mozilla Firefox's tracking prevention option in our privacy policy.

But there's no doubt that it is leading to confusing data discrepancies for site/app owners, and I think site owners have a right to a basic understanding of how much of the traffic they are paying the hosting bills to serve is actually perceptible to their observability/reporting, even if the only detail they get about that visit is "the visit happened", similar to the level of detail they get from server logs or CDN logs as a matter of course.


Thank you for the detailed reply! I understand that the technical details might not be of great interest for who you were primarily targeting with the blog post, but I think the discussion on HN would be much enriched (possibly including actionable-for-you ideas and suggestions) if these points were mentioned in the submission itself.

> I have no such personal or professional belief, and I can assure it isn't a view held by our company/team

Fair enough, it might just be my interpretation that is colored by the context of the content being written by an analytics business. Although I'm still at a bit of a loss what the implication is in the sentence I was referencing. Could you spell that out for me?


Yea, I think the point of the post was to "name a thing", not necessarily to wade into the tech. (Like I mentioned, was written by a PM colleague who has been discussing this issue with customers/prospects.)

That is, aim was to introduce this idea of "shadow traffic", since most of our customers/prospects don't even know it exists! (Whereas, for example, "bot traffic", which can inflate analytics numbers, is a well-known problem.)

The sentence you are referencing is about Sentry error tracking, right? All I was intending to say is that sometimes, tracking protection throws the baby out with the bathwater. The end user wants to avoid creepy ads and privacy leaks. But, instead, they are blocking error tracking tools, whose primary purpose is to catch and fix frontend coding bugs. And then when those same blocking rules become browser defaults, it can end up in a situation where whole classes of users don't have errors tracked/logged, merely because the site owner (quite reasonably) chose to use a SaaS for error tracking/logging, rather than, say, rolling one's own self-hosted system for that commodity use case.


> Though the interest to user privacy is legitimate, I think we can all agree it'd be better for site/app operators to know when certain browsers are hitting JavaScript stack traces.

I don't want site owners to see that something has failed on my machine -- especially if it's something unique about my setup. To suggest otherwise is to miss the point about privacy. So no, I don't agree.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: