DeGi's comments

DeGi · on Oct 24, 2018

And without doing experimental setup (RCTs), how do you know you are not just paying for organic conversions?

I had this presentation a few months ago: https://www.slideshare.net/mobile/gregak/if-youre-not-measur... I would be interested to know what you think.

downandout · on Oct 24, 2018

Very interesting presentation. The question your research is attempting to answer is certainly a valid one for major sites, where people might be on the site anyway without having clicked on a given ad. In my specific case, most of the sites/offers we market through Facebook ads wouldn't have attracted many organic visitors, let alone conversions, on their own, so it's not a question I need to answer. These sites rely almost entirely on paid traffic, and if they don't get it they are out of business.

The importance of being able to figure out what actually led to conversions is not lost on me though. One unique technology I created allows us to do something that I have never seen anyone else in the online marketing world do: track conversions back to the initial click and ad campaign, even if someone just texts, emails, or uses an instant messenger to send the URL to a friend. So let's say someone visits the site, sees that the offer isn't for them but texts it to a friend. A month later, that friend finally gets around to looking at it, but decides it isn't for them either but knows someone else who might be interested, and they email it to someone else, who ultimately converts. We can track that conversion and all the steps in between back to the initial click and attribute it to the initial ad campaign, which gives us a much better sense of what each ad campaign is actually producing. The technology also lets us create custom Facebook audiences of anyone that has shared a link from the site - regardless of how they shared - text, email, Facebook - doesn't matter. We can then customize campaigns to encourage those people to share again.

DeGi · on Oct 25, 2018

If close to 0 of your traffic is organic, then you don’t have to care too much about the whole correlation vs. causality problem, yes.

What you describe is certainly interesting. I guess you are building a graph of unique IDs, with each shared URL containing the ID of the parent as a query param or something like that?

downandout · on Oct 25, 2018

Something like that, yes. When you visit any URL on the site, we use javascript to rewrite the URL in the location bar with a shortened, unique, trackable URL. So we know both what URL you came in through, and the new URL that we then assigned to you. With this we can track every click all the way back up through the tree to the initial click, even if you just copy/paste the URL or hit the button on your phone to text the page to a given contact. Where possible, we also track any link preview engines that visit the URL, so we can usually tell not just that you shared, but how you shared (skype, telegram, iMessage, gmail, facebook, twitter, etc.).

I initially wrote this system so that I could retarget through Facebook ads people that had previously shared viral news articles, but now we have found great applications for it in ecommerce and lead gen as well.

DeGi · on Oct 25, 2018

You wouldn’t know who shared until someone doesn’t actually visit the shared link, right? Unless I’m misunderstanding something. Also I don’t see how you would build a custom audience on FB for the people who shared, e.g. by copy-pasting the URL from the location bar. I see how you would do it on some javascript event (e.g. page load, click on share button, etc.), but that’s not the same.

downandout · on Oct 26, 2018

Correct, we don't know who shared until someone visits the link. But, we can build a custom audience after the fact because the Facebook retargeting pixel lets you pass an arbitrary ID of your choosing with each pixel load (the variable name is "extern_id" [1]). So when a click comes in on a given URL that we know had to have been shared, we know what extern_id we gave to Facebook for the original user that shared the link on the pixel fire back when they first visited the site. We can then build a custom audience using a list of those extern_id's for only people that have shared, after the fact.

https://developers.facebook.com/docs/marketing-api/audiences... - see "External Identifiers"

DeGi · on April 23, 2018

Thanks for letting me know about WikiWand.

DeGi · on March 29, 2018

Am I the only one who misread this as “Apache Spark”?

panchicore3 · on March 29, 2018

DeGi · on Dec 19, 2017

GoAccess seems like a great tool. Thanks for pointing it out!

DeGi · on March 21, 2017

We have ~80TB of (compressed) data in Snowflake at Celtra and I'm working with Snowflake on a daily basis. We've been using it for the last ~1 year in production. Overall maintenance is minimal and the product is very stable.

Pros:

  - Support for semi-structured nested data (think json, avro, parquet) and querying this in-database with
  custom operators
  - Separation of compute from storage. Since S3 is used for storage, you can just spawn as many compute
  clusters as needed - no congestion for resources.
  - CLONE capability. Basically, Snowflake allows you to do a zero-copy CLONE, which copies just the metadata,
  but not the actual data (you can clone a whole database, a particular schema or a particular table). This is
  particularly useful for QA scenarios, because you don't need to retain/backup/copy over a large table - you
  just CLONE and can run some ALTERs on the clone of the data. Truth be told, there are some privilege bugs
  there, but I've already reported those and Snowflake is working on them.
  - Support for UDFs and Javascript UDFs. We've had to do a full ~80TB table rewrite and being able to do this
  without copying data outside of Snowflake was a massive gain.
  - Pricing model. We did not like query-based model of BigQuery a lot, because it's harder to control the costs.
  Storage on Snowflake costs the same as S3 ($27/TB compressed), BigQuery charges for scans of uncompressed data.
  - Database-level atomicity and transactions (instead of table-level on BigQuery)
  - Seamless S3 integration. With BigQuery, we'd have to copy all data over to GCS first.
  - JDBC/ODBC connectivity. At the time we were evaluating Snowflake vs. BigQuery (1.5 years ago, BigQuery didn't
  support JDBC)
  - You can define separate ACLs for storage and compute
  - Snowflake was faster when the data size scanned was smaller (GBs)
  - Concurrent DML (insert into the same table from multiple processes - locking happens on a partition level)
  - Vendor support
  - ADD COLUMN, DROP COLUMN, RENAME all work as you would expect from a columnar database
  - Some cool in-database analytics functions, like HyperLogLog objects (that are aggregatable)

Cons:

  - Nested data is not first-class. It's supported by semi-structured VARIANT data type, but there is no schema
  if you use this. So you can't have nested data + define a schema both at the same time, you have to pick just
  one.
  - Snowflake uses a proprietary data storage format and you can't access data directly (even though it sits on
  S3). For example when using Snowflake-Spark connector, there is a lot of copying of data going on: S3 ->
  Snowflake -> S3 -> Spark cluster, instead of just S3 -> Spark cluster.
  - BigQuery was faster for full table scans (TBs)
  - Does not release locks if connection drops. It's pain to handle that yourself, especially if you can't
  control the clients which are killed.
  - No indexes. Also no materialized views. Snowflake allows you to define Clustering keys, which will retain
  sort order (not global!), but it has certain bugs and we've not been using it seriously yet. Particularly,
  it doesn't seem to be suited for small tables, or tables with frequent small inserts, as it doesn't do file
  compaction (number of files just grows, which hits performance).
  - Which brings me to the next point. If your use case is more streaming in nature (more frequent inserts, but
  smaller ones), I don't think Snowflake would handle this well. For one use case, we're inserting every minute,
  and we're having problems with number of files. For another use case, we're ingesting once per hour, and this
  works okay.

Some (non-obvious) limitations:

  - 50 concurrent queries/user
  - 150 concurrent queries/account
  - streaming use cases (look above)
  - 1s connection times on ODBC driver (JDBC seems to be better)

If you decide for Snowflake or have some more questions, I can help with more specific questions/use cases.

jbs40 · on March 27, 2017

Here's some useful updates and additional information on some of the items mentioned above regarding Snowflake:

- The concurrency limits mentioned above are soft limits that can be raised on customer request (those defaults are there so that runaway applications can be detected easily). Snowflake can handle very high concurrency--we have customers running hundreds of concurrent queries.

- We’ve recently released a new Spark connector with a bunch of optimizations, including additional push-down capabilities that speed up performance significantly.

- The clustering capability is currently in "preview", we're definitely taking input and have been working on incorporating feedback we've received so far into it.

- One important thing to note when it comes to full table scans is that Snowflake allows you to choose how much horsepower you apply to the job, so you can easily adjust the horsepower to get faster scans.

jbs40 · on March 27, 2017

Forgot to state the disclaimer that I work for Snowflake.

paladin314159 · on March 21, 2017

Many thanks! That's by far the most detailed analysis of Snowflake that I've seen.

DeGi · on Oct 21, 2016

For me it resolves to 192.30.253.113

eternalban · on Oct 21, 2016

that's github.

DeGi · on Aug 30, 2013

Can you survive in Sweden/Stockholm without speaking Swedish, just with English? Also, can you post any Sweden/Stockholm job posting sites where startups seek new hires?

Too · on Aug 30, 2013

Getting around on the streets is no problem, everybody in sweden speaks quite good english, even paperwork from bank/government can some times be found in english. Finding a job is possible but you only have 5% of the job market that natives do (95% of all statistics are made up on the spot), and that's assuming you are looking for a position that requires higher education. Finding a basic job such as being a waitress without native language is hard.

antocv · on Aug 30, 2013

I got no specific startup hire sites, but the casual google search should yield the most common job sites, a resourceful person might even join a few linkedin groups and bait.

You can get by with just English, although that is not too common, be prepared to study Swedish, the more you learn Swedish and learn the mentality the easier it will be for you.