Importing a Twitter archive into SQL Server 2014

Introduction

This article will look at populating an archive of Twitter data into SQL Server 2014.  This process can apply to older versions of SQL Server.

Obtaining an NSA™ Approved Twitter Archive

Log onto Twitter using the Web interface (i.e. navigate to http://www.twitter.com).  Log in with your account credentials, and then click on the “cog” image on the top right hand side, and select “settings”.

image

click on “Account” from the left hand side menu and the scroll to the very bottom:

image
Request an archive of your tweets

You’ll be send an e-mail to your registered e-mail address when the archive is ready to be downloaded.  My main Twitter account has nearly 10,000 tweets and was established in January 2007, yet the archive was ready almost instantly.  By the time I checked my e-mail account, the link was there.

image
An e-mail arrives

Reviewing the Archive

The archive is a zip file and contains some interesting content.

image

The simplest way to access the archive data is to load the supplied index.html into any standard web browser.

imageimage

The offline archive is completely searchable, and also has some interesting account information, where you can view your User ID (mine’s in the 600,000 range) as well as the date you joined (for me, January 11th, 2007 in a beautiful DB-compatible format).

But wait.. there’s more

If you examine the archive folder contents more closely, you’ll also discover something extremely handy – a file called tweets.csv.  In my case, it is 1.7mb in size.  Opening the .csv file, I can discern the following column headings:

     
Column Name Potential Usage Data Type
tweet_id Unique Identifier of a Tweet? Numeric
in_reply_to_status_id In the context of someone else’s tweet Numeric
in_reply_to_user_id Reply to a specific user Integer
timestamp Date/Time of the tweet Date
source Origin of the tweet (e.g. web, mobile device) Unicode String
text Text of the tweet Unicode String
retweeted_status_id Original tweet (if re-tweeted) Numeric
retweeted_status_user_id User of the original tweet Integer
retweeted_status_timestamp    Original date.time of the re-tweeted tweet Date
expanded_urls Original URI (before URL minification) Unicode String

Now, for test data purposes, this is a little light – the data comes in at just over 9800 rows, but it should be useful enough for the time being.  Remember that Unicode lengths are double-byte, to be safe, I gave the “text” column a precision of 300 (140*2 = 280) just to give some extra space, to be on the safe side.

Importing the Data into SQL Server

 
First thing I’m going to do is copy the .csv file into my test VM.  The rest of this article more or less concerns itself with walking through how to properly use the SQL Server Data Import/Export Tool.

imageimage

The first thing to do is change the Data Source to “Flat File Source” and then browse for the .csv file.  Once you’ve done that you can verify the data by clicking on the columns option from the left hand side nav.  Make sure you set the “Text qualifier” field in the “General” options to quotation mark (“) or you’ll get data conversion issues.

image

You should click on the “Advanced” option and rename the columns and data types/sizes.  I applied the data types listed in the table below, and also set some column lengths/precision.  This worked for me, but your experience may differ.  If you’re at all unsure, use the “Suggest Types…” wizard to help with selections.

Column Name Data Type SQL Type Length/Precision Nullable?
tweet_id numeric DT_NUMERIC 20 No
in_reply_to_status_id numeric DT_NUMERIC 20 Yes
in_reply_to_user_id four-byte signed integer DT_I4   Yes
tweet_timestamp date DT_DATE   No
source Unicode string DT_WSTR 200 No
tweet_text Unicode string DT_WSTR 300 No
retweeted_status_id numeric DT_NUMERIC 20 Yes
retweeted_status_user_id four-byte signed integer DT_I4   Yes
retweeted_status_timestamp    date DT_DATE   Yes
expanded_urls Unicode string DT_WSTR 350 Yes

 

NOTE: The columns above in bold I have renamed from the original .csv heading names to avoid using SQL Types as column names.

image

Select “Preview” to do a quick sanity check.  Once satisfied, click next to set the destination options.  For these articles, I’ve created a new database called ‘PreviewTest’.  When you get to the “source tables” screen, click on the “edit mappings” dialog and just check everything’s set right.

 image

image

When ready, execute the import as a new SSIS package.  This took me about half an hour to get all the data types right, both for the input definition and the target destination.

image  image

If all goes to plan, you should have rows inserted into a brand new table.

Result!

After a successful import, I can now query from the [tweets] table, and see all my glorious Twitter contributions:

image

Post-Import Tidy Up

Let’s do some clean-up.  Apply the following changes to the table schema:

  • Create a non-clustered PK on tweet_id
  • Create a clustered index on [timestamp]
  • Create a non-clustered index on [text]

image

image
Query with Execution Plan

Conclusion

This was a fairly painful exercise for me as it turns out I’m rather rusty with using the data import wizard.  For those of you who are also rusty, here’s some common (and perhaps in hindsight, obvious) tips:

  • If your data is text qualified, make sure you specify the qualifying value
  • If you use Unicode types, make sure you specify a double-byte length
  • Remember to check column names to avoid using SQL or .NET types as names
  • If something doesn’t look right in the Preview pane, your import isn’t likely to run successfully

Check back for my next instalment where I’ll establish a new schema and start to look at what we can do with SQL Server 2014 and the latest version of the Entity Framework.

Leave a comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.