Friday, 15 January 2010
Did you know that only 28% of websites make use of Google Web Analytics!? To be completely honest, when I first saw this figure, I was barely surprised. It seems there are very few businesses who see the value in web analytics.
Let me briefly explain. Without Google Analytics and Conversion Tracking, you are essentially bringing traffic through to your website, crossing your fingers, and hoping you may receive emails and phone calls. A bit like operating a checkout register and serving customers blind folded. Google Analytics allows all your advertising to be measured and optimised based upon the results achieved.
Here are a few questions which you, or your web advertising manager should be able to easily answer:
How did customers find my website? What percentage of traffic came through from Google? How about the Yellow Pages?
How much time did customers who came through to my website spend there?
Which Google keywords keep people on my website for the greatest amount of time?
What pages within my website are most popular?
What page on my website do the greatest number of people leave from?
What times of the day does the greatest number of customers come and look at my website?
What countries and states does traffic come from?
All this analytical data allows more informed advertising decisions to be made, and ensure that highest ROI is being achieved.
Here's the article from factual.com (http://blog.factual.com/very-large-websites-table-now-on-factual)
Like most respectable geeks, we at Factual get pretty excited about data. And sometimes we get so excited about something that we want to make sure our data geek brethren are aware of it. Today we have something that falls into that category. CommonCrawl.org, a non-profit web crawler, provided a data set of about 4 million websites (primarily hosted at Top Level Domains as well as some popular subdomains) with 30 various attributes. That's about 350MB -- not a shabby corpus of data to be made available to the public. The attributes on these 4 million websites include information on what's on the page (i.e., “contains a Twitter link”), what technology was used (i.e., "server"), and what crawling rules are set-up (i.e., "excludes GoogleBot"). The websites come from the CommonCrawl repository, which consists of over 3 billion URLs, and is a reasonable representation of the Internet, not to mention an interesting slice of what's happening on the Web.
A couple of interesting things we noticed were...
28% of websites have Google Analytics –- pretty impressive, while 12% of the sites have AdSense. (Side note: we're using the count of GetContentResults=Http-200 for the denominator, since it's not fair to count the sites that CommonCrawl was unable to get content from.)
5% of websites have a Twitter link and 5% have a Facebook URL, yet only 2% have both a Twitter and a Facebook URL. It'll be interesting to see how this changes over time.
The top five versions of Apache discovered are 2.2.11 (210,984 instances), 2.2.3 (200,065 instances), 1.3.41 (168,660 instances), 2.2.14 (166,644), and 2.0.52 (97,004 instances).
A bunch of very long names had enough linkage to get included in the crawl. Here's a fun one: http://iwillusegooglebeforeaskingdumbquestions.com/. Apparently they don't use Google Analytics.
You can check out the specific regular expression list on the table, just click on the "Discuss" tab. If you have any suggestions or see something wrong, chime in on the thread and let us know!
Since this data set is now on Factual, it is open for the world to share, collaborate, and mash. That's how our data sets roll! However, unlike most of the tables on Factual, this table is read-only, meaning cells can't be edited by inputting new data. Since this is CommonCrawl's analysis, it made more sense to place restrictions on who could change/add to the data.
Of course, there's still plenty of things you can do with the data. And if you want to join or merge data from another table to this one, this won't impact the original table; you simply have to do a "save as" and effectively fork it. However, if you do this, we suggest that you mention it as a thread in the "Discuss" tab. That way the community can track the various related data sets.
All of the data is available through the Factual API and anyone with some programming skills can build innovative apps on top of this table and/or any related ones. The sky's the limit. Indeed, as large as this data set already is, this is just the beginning of life for this table. Hopefully we'll see it grow and improve, and potentially power exciting products.
We believe that open and collaborative data is not only important unto itself, it also drives innovation. Removing the hassles associated with data licensing and data curation (verification, de-duping, updating) can free folks to concentrate on building the apps themselves. Perhaps over time, developers will see data as another layer in the open solution stack. Like the other open software elements in the stack, if others give back to the community, the resources can multiply and quickly surpass closed enterprise versions.
When we shared it with Creative Commons, their CTO Mike Linksvayer emailed us this response: "I am wildly enthusiastic about collaborative, web-scale analysis of the web itself, which is likely the best path to a more complete understanding and appreciation of the impact of Creative Commons. CommonCrawl and Factual are each extremely interesting in this regard, for providing web-scale data and a unique take on collaborative data curation." We couldn't have said it better ourselves.
If you have any suggested attributes or just general questions on how you can use the table, feel free to start a thread on the Factual Developer Google Group.