Friday, June 26, 2009

Lies, Damned Lies, and Eclipse Upload Statistics

Update: Added 4th assumption, Saturday, June 27

As promised, 48 hours ago I downloaded several Eclipse products across all five published platforms using Bittorrent. Now that two days have passed, and my downloads have been made available for others to upload, I thought I'd consolidate upload bittorrent upload data for the sake of gauging popularity.

Some notes on my bittorrent process:
  • I downloaded the torrent files at about 10:30pm, two nights ago. I collected the data at about 11:30pm tonight. I'm calling it 48 hours. Anyone who cares enough, call the data police.
  • I did nothing to cap bandwidth for any of these Eclipse distributions.
I collected data on both megabytes uploaded from my bittorrent client, as well as ratio of upload to download. Since all copies of my products were fully downloaded, an upload ratio of 2.5 means that virtually 2.5 copies of that distribution were uploaded from my machine. Of course bittorrent doesn't ship data in full files, it's just pieces of the distrubutions, here and there.

These are my three primary, and therefore, potentially disputable, assumptions:
  • More popular products will be uploaded by more people.
  • Upload ratios are a better measurement of popularity than megabytes uploaded. If product X is 50% larger than the size of product Y, equal bandwidth dedicated X and Y do not denote equal popularity. I just think people using bittorrent aren't really worried about the size of their Eclipse.
    • Similarly, I consider negligible any difference in compression ratios between the win32 zip file format and other gzipped tar files.
  • 48 hours of data collection is more than enough time to collect data, and taking more than the first 48-hours of data will not yield significantly different results.
    • I presume that files that are not as well seeded as others will take more time to initially download, and as such, will not contribute much to the other uploads during the first part of this process, and so may exaggerate the results slightly. Given that, I suspect the correctness of the 48-hour window will be the most disputed assumption.
  • Update: fourth assumption: People don't care about how long it takes, if they're using bittorrent. I assume it's a "set and forget" type of tool.
In retrospect, this turned out some really cool data. I'm very sorry now that I did not download all distributions. This analysis suddenly suffers from its absence.

Let's start with downloads by product:


Here, the clear winner is the JEE distribution. Modeling, which is heavily discussed on the modeling blog and had a crazy number of talks at EclipseCon, has just under 25% the popularity of JEE. I haven't used WTP in a while but I hope, if it's this popular, that the developer docs reflect the popularity. (Please?)

Next we move to downloads by operating system:


Wow, Windows, huh? That's a surprise, and also, not really a surprise. I'd love to see how these numbers compare next year. Will we see an complete inversion of Linux 32 and Linux 64 in a year? Two years? I predict four years.

Let's look at all the download ratio data without grouping by products or operating system.


This chart really highlights both the JEE and Win32 popularity. I'm pleased to see the CDT platform is well used by the Linux32 community. I wonder why, on the Classic platform, OSX Carbon is uploaded slightly more than OSX Cocoa? Are variances on that scale negligible?

I hear the Linux community say "Rob, come on. We just shut down the spin machine from that ludicrous browser comparison. How about something that reflects reality?" Reality, whatever. You punks are lucky drawing charts is fun, so I'll do one more for ya. Here's the same data, with OSX and Linux products as single pieces of data.


Seriously, this does have something more interesting to say:
  • RCP is more popular among the OSX and Linux communities than Win32.
  • Without JEE, Win32 is not nearly as popular a platform.
  • CDT and modeling are not particularly popular among the OSX community.
  • CDT is loved by the Linux users.
I mentioned earlier that downloads by megabyte are not interesting to me. That doesn't stop me from graphing it. Here's a chart:


and heck, here's another:


Do these last two charts tell you anything different from the download ratio charts? I'll leave that up to someone else to discover. Add it in the comments.

Here's the last three pieces of data I want to share tonight:
Total Uploads: 145
Total MB uploaded: 18,781.1 MB
Upload average (assuming 48 hours): 111.295 kbps
You could say, then, that 145 people got their instance of Eclipse from my bittorrent client.

Thank you, Verizon FIOS. Thank you, Eclipse Foundation!

8 comments:

AlBlue said...

Reason for the lack of CDT adoption on Mac OS X is because it doesn't support Objective C. XCode isn't great, but most OSX apps are written in ObjC (both for mac and iPhone).

Of course, I'm trying to change this state of affairs with ObjectivEClipse, so if any other Googlers or readers are interested, I'd love the help :-)

http://code.google.com/p/objectiveclipse

Donald Smith said...

Wow, very cool analysis of the torrent users -- this is greatly appreciated! Thanks.

Denis Roy said...

Thanks for helping out with the bandwidth!

David Plass said...

Cool analysis.

However (and you knew this was coming). People didn't *upload from* your machine. People *downloaded from* your machine.

See you later.

konberg said...

@alblue that makes sense.
@donald/@denis my pleasure. it was fun.
@david yeah, yeah. i need you on my shoulder when i type.

Egon Willighagen said...

This is indeed an interesting analysis... one comment... Linux torrents were not popular, and this would signifcantly bias the results, particularly in the first 48 hours. I have 4 Linux torrents seeded on my machine: jee and rcp both in 32 and 64 bit. Here JEE is about 6x as popular as RCP.

However, when downloading the RCP version on Wednesday, there were some 5 leechers and 1 seed. From your report, you are very likely that one seed. The funny thing was that this one seed was horribly slow, and you nicely saw me and the four other leechers all have the same share ratio: every bit that came from the the seed was quickly spread in the swarm (if you can speak of that, with the given numbers ;)

Now, the number of Linux seeds 60 (RCP) and 500 (JEE). And that makes a difference. The number of Windows seeds were much higher in those first 48 hours. This means that basically your Linux downloads are seriously biased upwards, and considerably overestimated, just because windows had many more seeds.

While you can assume that the numbers would be correct when the ratio seed/leecher is equal between platform, this no longer holds for low seed numbers. Simple reason: people will more quickly default to a regular FTP server. Actually, when I tried people to use the torrent, most said: why?!?

Anyway, I like the technology, but I sincerely hope we can set up a larger bittorrent network, *before* the download goes public. Really, this is done for the mirrors, so why not set up a reasonable number of seeds before the official launch too.

This issue is not limited to Eclipse, of course. Very little large projects actually think about setting up a default torrent seed network, like setting up FTP mirrors. Surely, this should not be too difficult. Would be nice to see the Eclipse project show the rest of the open source community how a large project can set up such a torrent network. Maybe Ubuntu (apt-p2p) or even SourceForge will follow...

konberg said...

Egon, thanks very much for your thoughtful reply! I don't entirely follow you, but here are my thoughts:

First, I don't think I was your seed. I may very well have been one of the five leeches on Wednesday evening. After all, I had to get the content from someone, and I don't know much about bittorrent, but I know enough that there had to be another seed besides me for you to get content from me.

Second, and here's a fourth assumption, and I will update the post accordingly.

> While you can assume that the numbers would be correct when the
> ratio seed/leecher is equal between platform, this no longer
> holds for low seed numbers. Simple reason: people will more
> quickly default to a regular FTP server. Actually, when I tried
> people to use the torrent, most said: why?!?

So, my fourth assumption is that people don't care about how long it takes, if they're using bittorrent. I assume it's a "set and forget" type of tool.

Have you ever abandoned bittorrent for FTP? Do you know anyone who has? I am happy to adjust my future assumptions, but I just don't know that to be the case.

I also have heard people ask "why", and to them I refer my previous post. Even Denis Roy isn't quite sure people want that. See https://bugs.eclipse.org/bugs/show_bug.cgi?id=281248 where he thinks people would not want to get a hold of Eclipse a day early just to have to use bittorrent.

For days I've considered posting updated stats (five- or seven-day graphs -- it's very interesting. But also, I may have been capped at 150kbps.) OTOH, there's almost ZERO activity in anything other than Classic, JEE and CDT. NOTHING with modeling and RCP.

Denis Roy said...

The torrents were all available on the Galileo Early access page one day before launch (same time as mirrors etc). I blogged about it too.