Friday, June 13, 2014

Behind the Scenes: A Look into Paper.li's Data Ingestion Pipeline

Today we're excited to unveil a new series of blog posts that describe how Paper.li works from a technological stand point. These posts will cover the technology, components and processes that take place behind the scenes before a paper goes live. Although this series of posts are technical in nature, we'll do our best to keep them understandable and interesting for all.

This series will be composed of multiple articles:
How we ingest and process social content data (today's article, the Input Pipeline)How we generate paper editionsHow we serve papers to users and curatorsWhich technologies we use for our key components : storage, search, messaging, configuration managementThere is a lot of cool stuff happening behind the scenes before a new edition is published. If this series doesn't answer all of your questions or cover the topics you would like to learn more about, please feel free to make a suggestion!

Today we will cover how our backend works and more specifically how we process social data.

At Paper.li, we are driven by the data we extract and analyze. Daily, we analyze around 250 million of posts on social medias and find referenced articles to find valuable information for our publishers.

The main goal of our backend is to allow our users to easily create daily automated content digests capturing the best of what is happening in the social web. Paper.li data ingestion & publishing can be represented by two main pipelines, each of them being event driven thanks to the Apache Kafka distributed messaging system :
The first channel is the Input pipeline, whose role is to keep user specified sources up to date and crawl the web for articlesThe Publishing pipeline, which publish timely digests for our users, will be covered in a future article


Regards,

A-M-S-Computers
6148 Ridge Rd.
Port Richey, FL 34668



No comments:

Post a Comment