Le Nguyen The Dat bio photo

Le Nguyen The Dat

Data Science and Engineering at 90 Seconds

Email Twitter Facebook LinkedIn Github
Below are the main reasons why I choose Amazon Web Service (AWS):

  • No upfront cost (although, if you plan to use AWS for long enough, it's always better to pay upfront for some reserved instances).
  • Security or Customer Support is pretty much not to be worried about.
  • Amazon provides a full stack of various solutions that basically serves everything I ever need.


To start, here are what you will need:
  • Amazon S3: S3 acts as an intermediate storage, as well as backup location for all your data.
    • Cheap, scalable, fast enough.
    • Easy to use, various open-source tools.
  • Amazon Redshift: This will be your Data Warehouse.
    • Massively parallel processing.
    • Easily scalable, up to Petabytes of data.
    • Relatively cheap comparing to Oracle or MS SQL Server.
    • Based on PostgreSQL 8.0, supports postgresql driver / library / client.
  • Amazon RDS: This will be your Data Marts. (optionally, depends on business needs)
  • Amazon EC2: Your workers.
    • To orchestrate, link everything together
    • To perform ETL tasks.
  • Programming Ability: as least one or two of those scripting languages. 
    • E.g: Bash / Python / Ruby / Perl.

Locate your data sources:
  • Talk to your users, ask around, figure out how people gets their data day-to-day, take notes.
  • Do research, read up on those 3rd party company's APIs, figure out how to programmatically get their data. Talk to those company, request data dump / raw data access at any cost.
  • At the end of this step, you need to have a plan on how to programmatically get everything into S3.


Fill your Data Warehouse:

Build an Abstraction Layer:
  • Talk to your users, ask them what is the result they need. Dig into documentations, definitions, excel spreadsheets if you must. Control the definitions, put them on your wiki as soon as you can.
  • Create aggregated tables / views on your Data Warehouse, this will act as a first layer of caching for your Data Warehouse to minimize heavy lifting.

Give data to your users:
  • From Redshift to RDS: with DBLink.
  • RDS will not only act as a second layer of caching, but also as a Data Mart for your users: provides exactly what users need, don't flood your users with everything you have.