blog

It started of as an attempt to analyse some data stored in Apache Kafka using R, and ended up becoming the start of an R package to interact with Confluent's REST Proxy API.

While rkafka already allows the creation of a producer and a consumer from R, writing some R functions interfacing with its REST API was an interesting way to learn a bit more about Kafka's inner workings, and demonstrate how easy it is to interact with any REST API from R thanks to httr.

The result is available to clone on my git server.

Posted Fri Sep 14 21:53:53 2018 Tags:

Working in analytics these days, the concept of big data has been firmly established. Smart engineers have been developing cool technology to work with it for a while now. The Apache Software Foundation has emerged as a hub for many of these - Ambari, Hadoop, Hive, Kafka, Nifi, Pig, Zookeeper - the list goes on.

While I'm mostly interested in improving business outcomes applying analytics, I'm also excited to work with some of these tools to make that easier.

Over the past few weeks, I have been exploring some tools, installing them on my laptop or a server and giving them a spin. Thanks to Confluent, the founders of Kafka it is super easy to try out Kafka, Zookeeper, KSQL and their REST API. They all come in a pre-compiled tarball which just works on Arch Linux. (After trying to compile some of these, this is no luxury - these apps are very interestingly built...) Once unpacked, all it takes to get started is:

./bin/confluent start

I also spun up an instance of nifi, which I used to monitor a (json-ised) apache2 webserver log. Every new line added to that log goes as a message to Kafka.

Apache Nifi configuration

A processor monitoring a file (tailing) copies every new line over to another processor publishing it to a Kafka topic. The Tailfile monitor includes options for rolling filenames, and what delineates each message. I set it up to process a custom logfile from my webserver, which was defined to produce JSON messages instead of the somewhat cumbersome to process standard logfile output (defined in apache2.conf, enabled in the webserver conf):

LogFormat "{ \"time\":\"%t\", \"remoteIP\":\"%a\", \"host\":\"%V\", \"request\":\"%U\", \"query\":\"%q\", \"method\":\"%m\", \"status\":\"%>s\", \"userAgent\":\"%{User-agent}i\", \"referer\":\"%{Referer}i\", \"size\":\"%O\" }" leapache

All the hard work is being done by Nifi. (Something like

tail -F /var/log/apache2/access.log | kafka-console-producer.sh --broker-list localhost:9092 --topic accesslogapache

would probably be close to the CLI equivalent on a single-node system like my test setup, with the -F option to ensure the log rotation doesn't break things. Not sure how the message demarcator would need to be configured.)

The above results in a Kafka message stream with every request hitting my webserver in real-time available for further analysis.

Posted Tue Sep 11 21:09:06 2018 Tags:

It appears to me the cross-industry standard process for data mining (CRISP-DM) is still, almost a quarter century after first having been formulated, a valuable framework to guide management of a data science team. Start with building business understanding, followed by understanding the data, preparing it, moving from modeling to solve the problem over to evaluating the model and ending by deploying it. The framework is iterative, and allows for back-and-forth between these steps based on what's learned in the later steps.

CRISP-DM

It doesn't put too great an emphasis on scheduling the activities, but focuses on the value creation.

The Observe-Orient-Decide-Act (OODA) loop from John Boyd seems to be an analogue concept. Competing businesses would then be advised to speed up their cycling through the CRISP-DM loop, as that's how Boyd stated advantage is obtained - by cycling through the OODA loops more quickly than ones opponent. Most interestingly, in both loops it's a common pitfall to skip the last step - deploying the model / acting.

OODA loop

(Image by Patrick Edwin Moran - Own work, CC BY 3.0)

Posted Tue Jun 19 20:51:36 2018 Tags:

I have been asked a few times recently about my management style. First, while applying for a position myself. Next, less expected, by a member of the org I joined as well as by a candidate I interviewed for a position in the team.

My answer was not very concise, as I lacked the framework knowledge to be so.

Today, I believe to have stumbled on a description of the style I practice (or certainly aim to) most often on Adam Drake's blog. Its name? Mission Command. (The key alternative being detailed command.)

Now this is an interesting revelation for more than one reason. I consider it a positive thing I can now more clearly articulate how I naturally tend to work as a team leader. It now becomes clear too what is important to me, by reviewing the key principles:

  • Build cohesive teams through mutual trust.
  • Create shared understanding.
  • Provide a clear commander’s intent.
  • Exercise disciplined initiative.
  • Use mission orders.
  • Accept prudent risk.

Reviewing these principles in detail, this style of leadership should not be mistaken for laissez-faire. Providing clear commander's intent, creating shared understanding, using mission orders are very active principles for the leader. For the subordinate, the need to exercise disciplined initiative is clearly also not a free-for-all. The need for mutual trust for this to work cannot be emphasised enough.

Posted Wed Feb 14 15:38:33 2018 Tags:

Dries Buytaert wrote last week about intending to use social media less in 2018. As an entrepreneur developing a CMS, he has a vested interest in preventing the world moving to see the internet as being either Facebook, Instagram or Twitter (or reversing that current-state maybe). Still, I believe he is genuinely concerned about the effect of using social media on our thinking. This partly because I share the observation. Despite having been an early adopter, I disabled my Facebook account a year or two ago already. I'm currently in doubt whether I should not do the same with Twitter. I notice it actually is not as good a source of news as classic news sites - headlines simply get repeated numerous times when major events happen, and other news is equally easily noticed browsing a traditional website. Fringe and mainstream thinkers alike in the space of management, R stats, computing hardware etc are a different matter. While, as Dries notices, their micro-messages are typically not well worked out, they do make me aware of what they have blogged about - for those that actually still blog. So is it a matter of trying to increase my Nexcloud newsreader use, maybe during dedicated reading time, and no longer opening the Twitter homepage on my phone at random times throughout the day, and conceding short statements without a more worked out bit of content behind it are not all that useful?

The above focuses on consuming content of others. To foster conversations, which arguably is the intent of social media too, we might need something like webmentions to pick up steam too.

Posted Mon Jan 8 21:04:09 2018 Tags:

The Internet Archive contains a dataset from the NYC Taxi and Limousine Commission, obtained under a FOIA request. It includes a listing of each taxi ride in 2013, its number of passengers, distance covered, start and stop locations and more.

The dataset is a wopping 3.9 GB compressed, or shy of 30 GB uncompressed. As such, it is quite unwieldy in R.

As I was interested in summarised data for my first analysis, I decided to load the CSV files in a SQLite database, query it using SQL and storing the resulting output as CSV file again - far smaller though, as I only needed 2 columns for each day of the 1 year of data.

The process went as follows.

First extract the CSV file from the 7z compressed archive.

7z e ../trip_data.7z trip_data_1.csv

and the same for the other months. (As I was running low on disk space, I had to do 2 months at a time only.) Next, import it in a SQLite db.

echo -e '.mode csv \n.import trip_data_1.csv trips2013' | sqlite3 NYCtaxi.db

Unfortunately the header row separates with ", ", and column names now start with a space. This does not happen when importing in the sqlite3 command line - tbd why. As a result, those column names need to be quoted in the query below.

Repeat this import for all the months - as mentioned, I did 2 at time.

Save the output we need in temporary csv files:

sqlite3 -header -csv trips2013.db 'select DATE(" pickup_datetime"), count(" passenger_count") AS rides, sum(" passenger_count") AS passengers from trips2013 GROUP BY DATE(" pickup_datetime");' > 01-02.csv

Remove the archives and repeat:

rm trip_data_?.csv
rm trips2013.db

Next, I moved on to the actual analysis work in R.

Looking at the number of trips per day on a calendar heatmap reveals something odd - the first week of August has very few rides compared to any other week. While it's known people in NY tend to leave the city in August, this drop is odd.

Calendar heatmap of trips

Deciding to ignore August altogether, and zooming in on occupancy rate of the taxis rather than the absolute number or rides, reveals an interesting insight - people travel together far more in weekends and on public holidays!

Occupancy heatmap

Just looking at the calendar heatmap it's possible to determine 1 Jan 2013 was a Tuesday and point out Memorial Day as the last Monday of May, Labour day in September, Thanksgiving day and even Black Friday at the end of November, and of course the silly season at the end of the year!)

The dataset contains even more interesting information in its geo-location columns I imagine!

Posted Thu Nov 30 21:36:05 2017 Tags:

Trying to plot the income per capita in Australia on a map, I came across a perfectly good reason to make good use of a spatial query in R.

I had to combine a shapefile of Australian SA3's, a concept used under the Australian Statistical Geography Standard meaning Statistical Area Level 3, with a dataset of income per postal code. I created a matrix of intersecting postal codes and SA3's, and obtained the desired income per capita by SA3 performing a matrix multiplication. If the geographical areas were perfectly alignable, using a function like st_contains would have been preferred. Now I fell back on using st_intersects, which results in possibly assigning a postal code to 2 different statistical areas. Alternative approaches are welcome in the comments!

As Australia is so vast, and the majority of its people are earning a living in a big city, a full map does not show the difference in income per area at this level of detail. Instead, I opted to map some of the key cities in a single image.

Income distribution in major AU cities

The full code is available on my git server for you to clone using git clone git://git.vanrenterghem.biz/R/project-au-taxstats.git.

Posted Thu Nov 16 15:07:59 2017 Tags:

In the Northern hemisphere, it's commonly said women prefer to give birth around summer. It would appear this does not hold for Australia. The graph below actually suggests most babies are conceived over the summer months (December to February) down under!

seasonal subseries plot Australian births by month 1996-2014

In preparing the graph above (a "seasonal subseries plot"), I could not help but notice the spike in the numbers for each month around 2005. It turns out that was real - Australia did experience a temporary increase in its fertility rate. Whether that was thanks to government policy (baby bonus, tax subsidies) or other causes is not known.

Full R code is on my git server. Check it out - there are a few more plots in there already. I might write about these later.

Posted Thu Oct 26 13:06:27 2017 Tags:

How cool is this? A map of Western Australia with all state roads marked in only 5 lines of R!

WARoads <- st_read(dsn = "data/", layer = "RoadNetworkMRWA_514", stringsAsFactors = FALSE)
WALocalities <- st_read(dsn = "data/", layer = "WA_LOCALITY_POLYGON_shp", stringsAsFactors = FALSE)
ggplot(WALocalities) +
  geom_sf() +
  geom_sf(data = dplyr::filter(WARoads, network_ty == "State Road"), colour = "red")

Map of WA state roads

Courtesy of the development version of ggplot2 - geom_sf is not yet available in the version on CRAN.

Posted Tue Oct 17 20:35:40 2017 Tags:

Turns out it is possible, thanks to the good folks at Stamen Design, to get fairly unobtrusive maps based on the OpenStreetMap data.

Combining this with a SLIP dataset from the Western Australian government on active schools in the state provided a good opportunity to check out the recently released sf (Simple Features) package in R.

Simple features are a standardized way to encode spatial vector data. Support in R was added in November 2016. For users of the tidyverse, this makes manipulating shapefiles and the likes easier, as simple features in R are dataframes!

Plot of secondary schools by student population

Full details in git.

Posted Thu Oct 12 21:30:51 2017 Tags:

This blog is powered by ikiwiki.