
I wanted to track topics across a wide range of social media platforms in real-time. I looked around and most options were expensive and limited in scope to a few social media sources. I found this great AWS blog to get started. Thanks to the open source (and powerful!) Elasticsearch and Kibana The overall project took me 10 hrs or so end to end and viola! I had a streaming social media platform up and running with over 30 million records per month.
The over all architecture which looks something like this:
Start by following all of the steps in the AWS blog as described and then tweak as needed. For tweaks, I found I wanted to make changes to the Elasticsearch mappings in the Twitter streamer. Since ES mappings can be tricky, its worth checking out some other examples, e.g. my code is here .
At this stage I’m assuming you have completed the above set up successfully and have:
If you have set up Twitter as per AWS blog you may want to make some tweaks and add some more social media sources.
Webhose is a great all in one almost live data source for news and blogs feed. The steps to setting it up are:
$ sudo apt-get install tmux
git clone https://github.com/silaseverett/aws-elk-data-stream.git
https://webhose.io/dashboard and scroll down to the API key. Copy it.confighose.py with your webhose API key. Check out the webhose API playground for making the query string that fits your need.$ cd webhose
$ vi confighose.py
Then paste your webhose API token.
$ virtualenv my_env
$ tmux
$ source ~/environments/my_env/bin/activate
(my_env) $ python webhoseio_producer.py
The official code on webhose python client used for the producer can be found here.
When you want to then modify the analytics platform you have just built here’s some basic guidance. For example you might want to (1) enable changes to the search terms used for filtering web documents into Elasticsearch and Kibana and (2) basic maintanence of the tool in cases where it needs to be restarted.
Modifying the search term filters requires logging on to the EC2 instance, then stopping the message producers (Twitter and Webhose.io), opening and modifying the producer files. So first logon to EC2 instance. Follow log on instructions above.
$ cd twitter-streaming-firehose-nodejs
$ vim config.js
Vim is a classic if not archaic editor as you can see, but it’s the one built in to Ubuntu. Scroll down to bottom of the file and you will see the ‘terms’. In order to make modifications:
hit 'i' key for insert
use arrow keys to navigate to the "terms" section
make changes
hit 'esc' then a colon ':'
then enter 'wq' to write to file and quit vim
(if no changes are desired enter 'q' instead of 'wq' to exit vim)
then hit return
$ cd webhose
$ vim configwhose.py
While in vim, you’ll find the terms in the ‘query_params’ dictionary at the top of the file. You can set the query params in accordance with webhose.io API playground output integrate box for Python.
Since we are now in Vim:
hit 'i' key for insert
use arrow keys to navigate to the "query_params" at the top
make changes
hit 'esc' then a colon ':'
then enter 'wq' to write to file and quit vim
(if no changes are desired enter 'q' instead of 'wq' to exit vim)
then hit return
Log on to EC2 instance following directions above.
$ tmux list-sessions
$ tmux attach-session -t 0
ctrl 'b'
hit down arrow once
ctrl 'c'
type 'exit'
$ cd twitter-streaming-firehose-nodejs
$ node twitter_stream_producer_app
ctrl 'b'
then up arrow
and detach tmux:
$ tmux detach
$ tmux list-sessions
$ tmux attach-session -t 0
$ ctrl 'b'
hit down arrow twice
(my_env) $ fg %1
(my_env) $ ctrl 'c'
(my_env) $ cd webhose
(my_env) $ python webhose_producer.py
ctrl 'b', then up arrow
and detach tmux:
$ tmux detach
***Note: the webhose.io producer is run from a Python2.7 virtual environment. To activate the env
$ source ~/environments/my_env/bin/activate
You will see (my_env) in the front of the command prompt when activated.
CloudWatch
ElasticSearch
Kibana
S3
Lambda