Data Preparation for Process Mining — Part II: Timestamp Headaches and Cures
6 min read
This is a guest blog post originally written for Fluxicon and reprinted here. The original post can be seen on the fluxicon blog.
Timestamps are core to any process mining effort. However, complex real-world datasets frequently present a range of challenges in analyzing and interpreting timestamp data. Sloppy system implementations often create a real mess for a data scientist looking to analyze timestamps within event logs. Fortunately, a few simple techniques can tackle most of the common challenges one will face when handling such datasets.
In this post I’ll discuss a few key points relating to timestamps and process mining datasets, including:
- Reading timestamps with code
- Useful time functions (time shifts and timestamp arithmetic)
- Understanding the meaning of timestamps in your dataset
Note that in this post all code samples will be in Python, although the concepts and similar functions will apply across just about any programming language, including various flavors of SQL.
Reading timestamps with code
As a data type, timestamps present two distinct challenges:
- The same data can appear in many different formats
- Concepts like time zones and daylight savings time mean that the same point in real time can be represented by entirely different numbers
To a computer time is a continuous series. Subdivisions of time like hours, weeks, months and years are formatted representations of time displayed for human users. Many computers base their understanding of time on so called Unix time, which is simply the number of seconds elapsed since the 1st of January 1970. To a computer using Unix time, the timestamp of 10:34:35pm UTC April 7, 2015 is 1428446075. While you will occasionally see timestamps recorded in Unix time, it’s more common for a more human-readable format to be used.
Converting from this human readable format back into something that computers understand is occasionally tricky. Applications like Disco are often quite good at identifying common timestamp formats and accurately ingesting the data. However, if you work with event logs you will soon come across a situation where you’ll need to ingest and/or combine timestamps containing unusual formats. Such situations may include:
- Combining event logs from multiple sources
- Standardizing event logs onto one time zone
- Performing calculations based on event logs
- Ingesting data into another system (e.g., loading it into a database)
The following scenario is typical of what a data scientist might find when attempting to complete process mining on a complex dataset. In this example we are assembling a process log by combining logs from multiple systems. One system resides in New York City and the other in Phoenix, Arizona. Both systems record event logs in the local time. Two sample timestamps appear as follows:
System in New York City: 10APR2015 23.12.17:54
System in Phoenix Arizona: 10APR2015 20.12.18:72
Such a situation presents a few headaches for a data scientist looking to use such timestamps. Particular issues of concern are:
- The format of the timestamps is non-standard
- Both systems are run in local time rather that UTC
- The systems are in different time zones (US-Eastern and US-Mountain)
- New York uses daylight savings time whereas Arizona does not
You can see how this can all get quite complicated very quickly. In this example we may want to write a script that ingests both sets of logs and produces a combined event log for analysis (e.g., for import into Disco). Our primary challenge is to handle these timestamp entries.
Ideally all system admins would be good electronic citizens and run all their systems logging functions in UTC. Unfortunately, experience suggests that this is wishful thinking. However, with a bit of code it’s easy to quickly standardize this mess onto UTC and then move forward with any datetime analytics from a common and consistent reference point.
First we need to get the timestamps into a form recognized by our programming language. Most languages have some form of a ‘string to datetime’ function. Using such a function you provide a datetime string and format information to parse this string into its relevant datetime parts. In Python, one such function is strptime.
We start by using strptime to ingest these timestamp strings into a Python datetime format:
# WE IMPORT REQUIRED PYTHON MODULES (you may need to install these first) import pytz import datetime # WE IMPUT THE RAW TEXT FROM EACH TIMESTAMP ny_date_text="10APR2015 23.12.17:54" az_date_text="10APR2015 20.12.26:72" # WE CONVERT THE RAW TEXT INTO A NATIVE DATETIME # e.g., %d = day number and %S = seconds ny_date = datetime.datetime.strptime(ny_date_text, "%d%b%Y %H.%M.%S:%f") az_date = datetime.datetime.strptime(az_date_text, "%d%b%Y %H.%M.%S:%f") # WE CHECK THE OUTPUT, NOTE THAT FOR A NATIVE DATETIME NO TIMEZONE IS SPECIFIED print(ny_date) >>> 2015-04-10 23:12:17.540000
At this point we have the timestamp stored as a datetime value in Python; however, we still need to address the time zone issue. Currently our timestamps are stored as ‘native’ time, meaning that there is no time zone information stored. Next we will define a timezone for each timestamp and then convert them both to UTC:
# WE DEFINE THE TWO TIMZEONES FOR OUR DATATYPES # NOTE: ‘ARIZONA’ TIMEZONE IS ESSENTIALLY MOUNTAIN TIME WITHOUT DAYLIGHT SAVINGS TIME tz_eastern = pytz.timezone('US/Eastern') tz_mountain = pytz.timezone('US/Arizona') # WE CONVERT THE LOCAL TIMESTAMPS TO UTC ny_date_utc = tz_eastern.localize(ny_date, is_dst=True).astimezone(pytz.utc) az_date_utc = tz_mountain.localize(az_date, is_dst=False).astimezone(pytz.utc) # WE PRINT CHECK THE OUTPUT, NOTE THAT THE TIMEZONE OF +0 IS ALSO NOW RECORDED print(ny_date_utc) >>> 2015-04-11 03:12:17.540000+00:00 print(az_date_utc) >>> 2015-04-11 03:12:26.720000+00:00
Now we have both timestamps recorded in UTC. In this sample code we manually inputted the timestamps as text strings and then simply printed the results to a terminal screen. An example of a real-world application would be to leverage the functions above to read in raw data from a database for both logs, process the timestamps into UTC and then write the corrected log entries into a new table containing a combined event log. This combined log could then be subjected to further analytics.
Useful time functions
With timestamps successfully imported, there are several useful time functions that can be used to further analyze the data. Among the most useful are time arithmetic functions that can be used to measure the difference between two timestamps or add/subtract a defined period of time to a timestamp.
As an example, let’s find the time difference between the two timestamps imported above:
# WE COMPARE THE DIFFERENCE IN TIME BETWEEN THE TWO TIMESTAMPS timeDiff = (az_date_utc - ny_date_utc) print(timeDiff) >>> 0:00:09.180000 The raw output here reads a time difference of 9 seconds and 18 milliseconds. Python can also represent this in rounded integer form for a specified time measurement. For example: # WE OUTPUT THE ABOVE AS AN INTEGER IN SECONDS print(timeDiff.seconds) >>> 9
This shows us that the time difference between the two timestamps is 9 seconds. Such functions can be useful for quickly calculating the duration of events in an event log. For example, the total duration of a process could be quickly calculated by comparing the difference between the earliest and latest timestamp for a case within a dataset.
These date arithmetic functions can also be used to add or subtract defined periods of time to a timestamp. Such functions can be useful when manually adding events to an event log. For example, the event log may record the start time of an automated process, but not the end time. We may know that the step in question takes 147 seconds to complete (or this length may be recorded in a separate log). We can generate a timestamp for the end of the step by adding 147 seconds to the timestamp for the start of the step:
# WE ADD 147 SECONDS TO OUR TIMESTAMP AND THEN OUTPUT THE NEW RESULT az_date_utc_end = az_date_utc + datetime.timedelta(seconds=147) print(az_date_utc_end) >>> 2015-04-11 03:14:53.720000+00:00
Understanding the meaning of timestamps in your dataset
Having the data cleaned up and ready for analysis is clearly important, but equally important is understanding what data you have and what it means. Particularly for data sets that have a global geographic scope, it is crucial to first determine how timestamps have been represented in the data. Relative to timestamps in your event logs some key questions you should be asking are:
- How was my dataset generated? (e.g. has the data been pre-processed from multiple systems and what steps were taken to standardize it?)
- Are all timestamps standardized to a single time zone?
- What are the expected hours of activity for each geography and does the data confirm this?
- What triggers the creation of the timestamp in your log? (e.g., is it an automated process or is it triggered by a human pushing a button?)
- Does the timestamp represent the beginning, middle or end of a particular step?
- If the timestamp is recorded because of a human action, does that action always take place at the same point in the process? (e.g., do some users record data in a system at the beginning of a step while others wait until the end of that same step?)
- What does the time between two adjacent timestamps represent? (e.g., does this time represent work occurring, or a pause waiting for work to begin again?)
- Are there additional datasets that can be used to add additional detail to these gaps in the timestamps? (e.g., in the example above where a separate log contains the duration of a particular step)
While this piece was hardly an exhaustive look at programmatically handling timestamps, hopefully you’ve been able to see how some simple code is able to deal with the more common challenges faced by a data scientist working with timestamp data. By combining the concepts described above with a database it is possible to write an automated script to quickly ingest a range of complex event logs from different systems and output one standardized log in UTC. From there, the process mining opportunities are endless.