Using Chrome Traces to Automate Rendering Performance - aka how browser-perf works

Cross post from calendar.perfplanet.com

Ten years ago, increasing the performance of a website usually meant tweaking the server side code to spit out responses faster. Web Performance engineering has come a long way since then. We have discovered patterns and practices that make the (perceived) performance of websites faster for users just by changing the way the front end code is structured, or tweaking the order of elements on a HTML page. Majority of the experiments and knowledge has been around delivering content to the user as fast as possible.
Today, web sites have grown to become complex applications that offer the same fidelity as applications installed on computers. Thus, consumers have also started to compare the user experience of native apps to the web applications. Providing a rich and fluid experience as users navigate web applications has started to play a major role in the success of the web.
Most modern browsers have excellent tools that help measure the runtime performance of websites. The Chrome Devtools features a powerful Timeline panel that gives us the tracing information needed to diagnose performance problems while interacting with a website. Metrics like frame rates, paint and layout times present an aggregated state of the website’s runtime performance from these logs.
In this article, we will explore ways to collect and analyze some of those metrics using scripts. Automating this process can also help us integrate these metrics into the dashboards that we use to monitor the general health of our web properties. The process usually consists of two phases – getting the tracing information, and analyzing it to get the metrics we are interested in.

Collecting Tracing information

The first step to get the metrics would be to collect the performance trace from chrome while interacting with the website.

Manually recording a trace

The simplest way to record a trace would be to hit the “start recording” button in the timeline panel, performing interactions like scrolling the page or clicking buttons, and then finally hitting “stop recording”. Chrome process all this information and shows graphs with framerates. All this information can also be saved as a JSON file by right-clicking (or Alt-clicking) on the graph and selecting the option to save the timeline.

Using Selenium

While the above method is the most common way to collect tracing from Chrome, doing these repeatedly over multiple deployments or for different scenarios can be cumbersome. Selenium is a popular tool that allows us to write scripts that perform interactions like navigating webpages or clicking buttons and we could leverage such scripts to capture trace information.
To start performance logging, we just need to ensure that certain parameters are added to the existing set of capabilities.
Selenium scripts have binding for multiple programming languages and if you already have selenium tests, you could add the capabilities in the following examples to also get performance logs. Since we are only testing Chrome, running just Chromedriver instead of setting up Selenium and then executing the following node script gets the trace logs.

var wd = require('wd');
var b = wd.promiseRemote('http://localhost:9515');

b.init({
    browserName: "chrome",
    chromeOptions: {
        perfLoggingPrefs: {
            "traceCategories": "toplevel,disabled-by-default-devtools.timeline.frame,blink.console,disabled-by-default-devtools.timeline,benchmark"
        },
        args: ["--enable-gpu-benchmarking", "--enable-thread-composting"]
    },
    loggingPrefs: {
        performance: "ALL"
    }
}).then(function() {
    return b.get('http://calendar.perfplanet.com'); // Fetch the URL that you want
}).then(function() {
    // We only want to measure interaction, so getting a log once here
    // Flushes any pervious tracing logs we have
    return b.log('performance');
}).then(function() {
    // Perform custom actions like clicking buttons, etc
    return b.execute('chrome.gpuBenchmarking.smoothScrollBy(2000, function(){})');
}).then(function() {
    // Wait for the above action to complete. Ideally this should not be an arbitraty timeout but be a flag
    return b.sleep(5000);
}).then(function() {
    // Get all the trace logs since last time log('performance') was called
    return b.log('performance');
}).then(function(data) {
    // This data is the trace.json
    return require('fs').writeFileSync('trace.json', JSON.stringify(data.map(function(s) {
        return JSON.parse(s.message); // This is needed since Selenium spits out logs as strings
    })));
}).fin(function() {
    return b.quit();
}).done();

The script above tells selenium to start chrome with performance logs, and enables specific trace event categories. The “devtools.timeline” category is the one used by Chrome devtools for displaying its graphs about Paints, Layouts, etc. The flags “enable-gpu-benchmarking” expose a window.chrome.gpuBenchmarking object that has a method to smoothly scroll the page. Note that the scroll can be replaced by other selenium commands like clicking buttons, typing text, etc.

Using WebPageTest

WebPageTest also roughly relies on the performance log to display metrics like SpeedIndex. As shown in the image, the option to capture the timeline and trace json files has to be enabled.
Image of WebPage Test Interface
Custom actions can also be added to perform actions in a WebPageTest run scenario. Once WebPageTest run finishes, you can click download the trace files by clicking the appropriate links. This method could not only be used in WebPageTest, but also with tools like SpeedCurve that support custom metrics.

Analyzing the Trace Information

The trace file that we get using any of the above methods can be loaded in Chrome to view the page’s runtime performance. The trace event format defines that each record should contain a cat (category), a pid (processId), a name and other parameters that are specific to an event type. These records can be analyzed to arrive at individual metrics. Note that the data obtained using the Selenium Script above has some additional metadata for every record that may need to be scrubbed before it can be loaded in Chrome.

Framerates

To calculate the FrameRates, we could look at events by the name DrawFrame. This indicates that a frame was drawn on the screen and by calcuating the number of frames drawn, divided by the time for the test, we could arrive at the average time per frame. Since we also have benchmarking category enabled, we could look “BenchmarkInstrumentation:*” events that have timestamps associated with them. Chrome’s telemetry benchmarking system uses this data to calculate the average frame rates.

Paints, Styles, Layouts and other events

The events corresponding to ParseAuthorStyleSheet, UpdateLayoutTree and RecalculateStyles are usually attributed to the time spent in Styles. Similarly, the log also contains Paint and Layout events that could be useful. Javascript time can also be calculated using events like FunctionCall or EvaluateScript. We can also add the GC events to this list.

Network Metrics

The tracelogs also have information related to firstPaint etc. However, the ResourceTiming API and the NavigationTiming APIs make this information available on the web page anyway. It may be more accurate to collect these metrics directly from real user deployments using tools like Boomerang, or from WebPageTest.

Considerations

There are a few things to consider while running the tests using the methods mentioned above.

Getting Consistent metrics

Chrome is very smart at trying to draw the page as fast as possible and as a result, you could see variations in the trace logs that are gathered when running tests for a scenario twice. When recording the traces manually, human interaction may introduce differences in the way a mouse is moved or a button is clicked. Even when automating, running separate selenium scripts occurs over the network and executing these scripts is also recorded in the logs.
To get consistent results, all this noise needs to be reduced. It would be prudent to not combine scenarios when recording trace runs. Additionally, running all automation scripts as a single unit would also reduce any noise. Finally, comparing tests runs across multiple deploys is even better since both these runs would have the same test script overhead.

Trace file size

The trace file can take up a lot of space and could also fill up Chrome’s memory. Requesting for the logs from ChromeDriver could in a single, buffered request could also result in Node throwing out of memory exceptions. The recommendation here would be to use libraries like JSONStream to parse the records as a stream. Since we are only interested in aggregating the individual records, streaming can help consume scenarios that are longer and hence take more memory.

Browser Support

The performance logs are available on the recent versions of the Chrome browser, both on the desktop and on Android. I had worked on a commit to make similar trace logs available for mobile Safari. Hybrid applications using Apache Cordova use WebViews which are based on browsers. Hence, this method could also be applied to Apache Cordova apps for Android and iOS.
Note that these performance logs started as being Webkit specific, and hence are not readily available for Firefox or IE yet.

Conclusion

All the techniques described above are packaged together in the browser-perf package. Browser-perf only adds the extra logic to smooth out the edge cases like the scrolling timeout, or ability to pick up specific metrics, or processing the trace file as a stream. I invite you to check out the source code of browser-perf to understand these optimizations.
These metrics can be piped into performance dashboards and watching out for these metrics should give us an idea of general trends of how the web site has been working across multiple commits. This could be one small step to ensure that websites continue to deliver a smooth user experience.
 

React Custom Renderer using Web Workers

TL;DR; A custom renderer for ReactJS that uses Web Workers to run the expensive Virtual DOM diffing calculations - [link]

Update: A newer implementation now runs WebWorkers faster than normal DOM. Blog post  here.


React's fast reconciliation algorithm one of the key pieces that enables an entire web application to be re-rendered on every state change in a performant way. This "diff-ing" logic compares states between the original and changed virtual DOMs and results in an efficient minimal set of instructions to modify the actual DOM on the browser.
In an earlier post, I had run experiments to compare the rendering performance of different JavaScript frameworks. React was one of the faster JavaScript frameworks and the tests showed an interesting insight into how React works during a render cycle. While other frameworks are busy with paints/layouts or modifying the DOM nodes, React spends most of its time in Javascript events, possibly running the diffing algorithms. For a smooth website running at 60 frames per second, each frame needs to be computed in less than 16ms. Translating this into React's case, each reconciliation and subsequent DOM modification should happen in that short time frame.
Recently, ReactNative has also showed how separating the UI thread from the Virtual DOM computations can make applications fast. I wanted to see if we could achieve similar performance gains for web pages by making the web UI DOM updates backed by a parallel DOM-diffing computation set that happens in a separate thread (aka web worker). An open issue about this idea exists in the React issue tracker and I hope to present data that could validate (or invalidate) the hypothesis that web workers would make React even faster.
The resulting solution should also have the constraint that it could be a drop-in replacement in the usual react-dom that runs in the main thread.

Demo

Here is a demo of the DBMonster application rendered using Web Workers, compared to normal React rendering. Each case shows the frame rates with the number of DOM nodes being rendered increasing.


As the video shows, the effect of the web workers starts to show as the number of nodes increases. You can run the sample app using the instructions in the README file or look at the gh-pages to see it in action. 
The repository also contains the custom renderer that is based on react-blessed and react-titanium. This custom renderer can be tried out in other projects. Note that it does not yet implement events.

Observations

Batching 
The app tested may not be a typical ReactJS app, but it does stretch the Virtual DOM diffing to its limits. While it may seem that separating UI thread and the Virtual DOM computations should always make the process faster, there is a cost associated with the postMessage call that is used for communicating between the Web Worker and the UI thread. Infact, this cost is significant enough to make the process much slower if we called "postMessage" for every DOM mutation.
Batching these calls seems to help a lot, and I ran a couple of experiments to see how different batch sized may impact the speed of updates.
All the frame rates was collected using browser-perf.



In the graph, y axis represents frames per second, while x axis is the number of nodes rendered. Each line indicated the batch size - eg. batch-1000 means that every 1000 DOM modification messages were sent per postMessage call.

While calling postMessage on every call is slow, making the batch size too large also seems to be bad. Increasing the batch size to a very high number increases the number of DOM operations sent to the UI thread, and this high number of DOM operations slows down the render speed.
The best approach seems to be a hybrid one. We could create a feedback loop that would tell the worker, the time it took for a batch of DOM updates to run. This information would in turn be used by the worker to estimate the number of updates in the next batch, thus trying to achieve a situation where do not take more than 16ms per batch or leave any free time on the table.
This strategy could fail when the reconciliation algorithm generates more updates than the DOM can handle, creating a backlog. One way to fix this problem is to discard old render calls corresponding to previous states since React needs to render the app in its latest state only. 

Multiple Top Level Components
Most real applications have only one top level component. Out of theoretical curiosity, I ran this experiment with multiple top level components to see if more that 2 workers are better than one. The theory is that the VirtualDOM tree could be split and the "diffs" calculated concurrently.
The single UI thread to which all the DOM updates are sent eventually becomes the bottleneck and careful batching and discarding old renders seems to be the only way such a parallelization could help. Here is a graph that show how it all shaped up.



In the legend, Normal-2 indicated normal react-dom with 2 top level components while worker-3 indicated web workers with 3 top level components.
As shown in the graph, 3 top level worker-based updates are slower than 2 top level worker based updates. Also, 3 normal react updates are not as bad comparatively.

RoadMap

For the purposes of this experiment on performance, I have not yet implemented events. For any real app to use this renderer, we would need events. Events are tricky since they are not processed synchronously and thus cannot call methods like "preventDefault" or event propogation. Events will have to be treated like how they are with React Native.This renderer would definitely work better with project like react-native-web where the concepts of events is closed to what this project has.
I have only implemented a based ReactComponent and would also need to implement other more complex components that React-Dom implements. Finally, this renderer also needs a better way for distribution since there are now 2 separate files that have to be included in the application.

Conclusion

The experiments above show that while Web Workers have potential, they may only benefit the apps that have many nodes that need to be compared and result in a smaller number of nodes that need to be changed. I believe that converting React to workers may not benefit all classes of applications as much, but having a separate renderer that uses web workers would definitely cater to the classes of applications that have a lot of components.