Introduction

This article will look at populating an archive of Twitter data into SQL Server 2014. This process can apply to older versions of SQL Server.

Obtaining an NSA™ Approved Twitter Archive

Log onto Twitter using the Web interface (i.e. navigate to http://www.twitter.com). Log in with your account credentials, and then click on the “cog” image on the top right hand side, and select “settings”.

click on “Account” from the left hand side menu and the scroll to the very bottom:

Request an archive of your tweets

You’ll be send an e-mail to your registered e-mail address when the archive is ready to be downloaded. My main Twitter account has nearly 10,000 tweets and was established in January 2007, yet the archive was ready almost instantly. By the time I checked my e-mail account, the link was there.

An e-mail arrives

Reviewing the Archive

The archive is a zip file and contains some interesting content.

The simplest way to access the archive data is to load the supplied index.html into any standard web browser.

The offline archive is completely searchable, and also has some interesting account information, where you can view your User ID (mine’s in the 600,000 range) as well as the date you joined (for me, January 11th, 2007 in a beautiful DB-compatible format).

But wait.. there’s more

If you examine the archive folder contents more closely, you’ll also discover something extremely handy – a file called tweets.csv. In my case, it is 1.7mb in size. Opening the .csv file, I can discern the following column headings:


Column Name	Potential Usage	Data Type
tweet_id	Unique Identifier of a Tweet?	Numeric
in_reply_to_status_id	In the context of someone else’s tweet	Numeric
in_reply_to_user_id	Reply to a specific user	Integer
timestamp	Date/Time of the tweet	Date
source	Origin of the tweet (e.g. web, mobile device)	Unicode String
text	Text of the tweet	Unicode String
retweeted_status_id	Original tweet (if re-tweeted)	Numeric
retweeted_status_user_id	User of the original tweet	Integer
retweeted_status_timestamp	Original date.time of the re-tweeted tweet	Date
expanded_urls	Original URI (before URL minification)	Unicode String

Now, for test data purposes, this is a little light – the data comes in at just over 9800 rows, but it should be useful enough for the time being. Remember that Unicode lengths are double-byte, to be safe, I gave the “text” column a precision of 300 (140*2 = 280) just to give some extra space, to be on the safe side.

Importing the Data into SQL Server

First thing I’m going to do is copy the .csv file into my test VM. The rest of this article more or less concerns itself with walking through how to properly use the SQL Server Data Import/Export Tool.

The first thing to do is change the Data Source to “Flat File Source” and then browse for the .csv file. Once you’ve done that you can verify the data by clicking on the columns option from the left hand side nav. Make sure you set the “Text qualifier” field in the “General” options to quotation mark (“) or you’ll get data conversion issues.

You should click on the “Advanced” option and rename the columns and data types/sizes. I applied the data types listed in the table below, and also set some column lengths/precision. This worked for me, but your experience may differ. If you’re at all unsure, use the “Suggest Types…” wizard to help with selections.

Column Name	Data Type	SQL Type	Length/Precision	Nullable?
tweet_id	numeric	DT_NUMERIC	20	No
in_reply_to_status_id	numeric	DT_NUMERIC	20	Yes
in_reply_to_user_id	four-byte signed integer	DT_I4		Yes
tweet_timestamp	date	DT_DATE		No
source	Unicode string	DT_WSTR	200	No
tweet_text	Unicode string	DT_WSTR	300	No
retweeted_status_id	numeric	DT_NUMERIC	20	Yes
retweeted_status_user_id	four-byte signed integer	DT_I4		Yes
retweeted_status_timestamp	date	DT_DATE		Yes
expanded_urls	Unicode string	DT_WSTR	350	Yes

NOTE: The columns above in bold I have renamed from the original .csv heading names to avoid using SQL Types as column names.

Select “Preview” to do a quick sanity check. Once satisfied, click next to set the destination options. For these articles, I’ve created a new database called ‘PreviewTest’. When you get to the “source tables” screen, click on the “edit mappings” dialog and just check everything’s set right.

When ready, execute the import as a new SSIS package. This took me about half an hour to get all the data types right, both for the input definition and the target destination.

If all goes to plan, you should have rows inserted into a brand new table.

Result!

After a successful import, I can now query from the [tweets] table, and see all my glorious Twitter contributions:

Post-Import Tidy Up

Let’s do some clean-up. Apply the following changes to the table schema:

Create a non-clustered PK on tweet_id
Create a clustered index on [timestamp]
Create a non-clustered index on [text]

Query with Execution Plan

Conclusion

This was a fairly painful exercise for me as it turns out I’m rather rusty with using the data import wizard. For those of you who are also rusty, here’s some common (and perhaps in hindsight, obvious) tips:

If your data is text qualified, make sure you specify the qualifying value
If you use Unicode types, make sure you specify a double-byte length
Remember to check column names to avoid using SQL or .NET types as names
If something doesn’t look right in the Preview pane, your import isn’t likely to run successfully

Check back for my next instalment where I’ll establish a new schema and start to look at what we can do with SQL Server 2014 and the latest version of the Entity Framework.

Importing a Twitter archive into SQL Server 2014