Puppeteer – Random IT Utensils https://blog.adamfurmanek.pl IT, operating systems, maths, and more. Mon, 14 Feb 2022 19:31:29 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 Chatterbox Part 11 — Scraping memory dump in Chrome with Chrome Debugging Protocol https://blog.adamfurmanek.pl/2022/03/05/chatterbox-part-11/ https://blog.adamfurmanek.pl/2022/03/05/chatterbox-part-11/#respond Sat, 05 Mar 2022 09:00:31 +0000 https://blog.adamfurmanek.pl/?p=4403 Continue reading Chatterbox Part 11 — Scraping memory dump in Chrome with Chrome Debugging Protocol]]>

This is the eleventh part of the Chatterbox series. For your convenience you can find other parts in the table of contents in Part 1 – Origins

Today we are going to scrape memory dump capture with Chrome Debugging Protocol and Puppeteer. Some reading before moving on might be helpful.

Let’s start node with Puppeteer and heapsnapshot-parser:

const puppeteer = require('puppeteer');
const parser = require('heapsnapshot-parser');

puppeteer.launch({headless: false, devtools: true, userDataDir: "SomeProfileDirectory", ignoreDefaultArgs: ["--disable-extensions"], args: ["--enable-remote-extensions", "--disable-web-security"]}).then(br => b = br);
b.newPage().then(pa => p = pa);

Got to example.com (or whatever other page) and execute this snippet in console window:

window.someObject = {
	someText: "Some Text here",
	someNumber: 12345678,
	someArray: ["Array element 1", "Array element 2"],
	someBoolean: true
};

We create an object with different properties. We now want to capture the memory dump, find the object and examine its content.

First, we create the Chrome Debugging Protocol session:

p.target().createCDPSession().then(cd => c = cd);

Now we need to take the dump. It’s delivered as a series of chunks so we need to join it manually on our end:

var d = [];
c.on('HeapProfiler.addHeapSnapshotChunk', (data) => {
	d.push(data);
});

c.send('HeapProfiler.takeHeapSnapshot', { reportProgress: false, treatGlobalObjectsAsRoots: false, captureNumericValue: true }).then(() => console.log("Done"));

Important thing here is captureNumericValue — without this the dump will not have the numbers (integers, doubles).

We parse the dump after it’s done:

var snapshotFile = d.map(d => d.chunk).join("");
var snapshot = parser.parse(snapshotFile);

What we have here is the pure dump of the objects graph. Now, we need to recreate JS objects from it:

var objectsById = {};
snapshot.nodes.map(node => {
	objectsById[node.id] = {};
	
	if(node.type === "string"){
		objectsById[node.id].__syntheticValue = node.name;
	}
	
	objectsById[node.id].__syntheticId = node.id,
	objectsById[node.id].__syntheticParents = [];
});

snapshot.edges.map(edge => {
	objectsById[edge.fromNode.id][edge.name_or_index] = objectsById[edge.toNode.id];
	objectsById[edge.toNode.id].__syntheticParents.push({
		target: objectsById[edge.fromNode.id],
		edgeName: edge.name_or_index
	});
});

We end with objectsById collection which holds all the objects. Notice that we extract the string value from name and store a couple of helper synthetic fields.

Now, we want to traverse them and find the string. We provide a helper function:

var parentPathsAtHeight = (o, height, path) => {
	if(height == 0) return [{path: path.join(","), target: o}];
	return o.__syntheticParents.flatMap(p => parentPathsAtHeight(p.target, height-1, path.concat(p.edgeName)));
}

This thing will go through the object hierarchy up the tree up to a given height. We now want to find the text Some Text here and since we know it’s a direct child of the object we’re after, we just need to go one parent up:

var oneLineText = "Some Text here";
var matchingStrings = Object.values(objectsById).filter(o => o.__syntheticValue == oneLineText).map(s => {return {
	o: s,
	parents: parentPathsAtHeight(s, 1, [])
}});

Obviously, this one string may be held by multiple objects so we need to understand the structure of the parents to find the right one. We can now traverse this in any way, for instance like this:

var wantedObject = matchingStrings[0].parents.filter(p => p.path.startsWith("whateverProperty")).filter(p => p.path.indexOf("someProperty,someOtherProperty") >= 0)[0].target

Since this is a simple memory dump, we don’t need to do that. We just now that the first string is the one we need. Now, we can dump values:

matchingStrings[0].parents[1].target
{
  __syntheticId: 5777,
  __syntheticParents: [
    { target: [Object], edgeName: '85 / DevTools console' },
    { target: [Object], edgeName: '86 / DevTools console' },
    { target: [Object], edgeName: 'someObject' },
    { target: [Object], edgeName: 'value' }
  ],
  someText: {
    __syntheticValue: 'Some Text here',
    __syntheticId: 15581,
    __syntheticParents: [ [Object], [Object], [Object] ],
    map: {
      __syntheticId: 91,
      __syntheticParents: [Array],
      dependent_code: [Object],
      map: [Object]
    }
  },
  someNumber: {
    __syntheticId: 49703,
    __syntheticParents: [ [Object] ],
    value: {
      __syntheticValue: '12345678',
      __syntheticId: 49705,
      __syntheticParents: [Array]
    }
  },
  someArray: {
    __syntheticId: 49707,
    __syntheticParents: [ [Object] ],
    '<dummy>': {
      __syntheticValue: 'Array element 1',
      __syntheticId: 14887,
      __syntheticParents: [Array],
      map: [Object]
    },
    '': {
      __syntheticValue: 'Array element 2',
      __syntheticId: 15879,
      __syntheticParents: [Array],
      map: [Object]
    },
    elements: {
      '0': [Object],
      '1': [Object],
      __syntheticId: 49711,
      __syntheticParents: [Array],
      map: [Object]
    },
    map: {
      __syntheticId: 34723,
      __syntheticParents: [Array],
      transitions: [Object],
      descriptors: [Object],
      prototype: [Object],
      back_pointer: [Object],
      dependent_code: [Object],
      map: [Object],
      '<dummy>': [Object]
    }
  },
  someBoolean: {
    __syntheticId: 71,
    __syntheticParents: [
      [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      [Object], [Object], [Object], [Object],
      [Object], [Object], [Object]
    ],
    map: {
      __syntheticId: 267,
      __syntheticParents: [Array],
      dependent_code: [Object],
      map: [Object]
    },
    '<dummy>': {
      __syntheticValue: 'true',
      __syntheticId: 1101,
      __syntheticParents: [Array],
      map: [Object]
    },
    '': {
      __syntheticValue: 'boolean',
      __syntheticId: 619,
      __syntheticParents: [Array],
      map: [Object]
    }
  },
  map: {
    __syntheticId: 49709,
    __syntheticParents: [ [Object], [Object] ],
    descriptors: {
      '0': [Object],
      '3': [Object],
      '6': [Object],
      '9': [Object],
      __syntheticId: 49697,
      __syntheticParents: [Array],
      enum_cache: [Object],
      map: [Object]
    },
    prototype: {
      __syntheticId: 30425,
      __syntheticParents: [Array],
      constructor: [Object],
      __defineGetter__: [Object],
      __defineSetter__: [Object],
      hasOwnProperty: [Object],
      __lookupGetter__: [Object],
      __lookupSetter__: [Object],
      isPrototypeOf: [Object],
      propertyIsEnumerable: [Object],
      toString: [Object],
      valueOf: [Object],
      'get __proto__': [Object],
      'set __proto__': [Object],
      toLocaleString: [Object],
      properties: [Object],
      map: [Object]
    },
    back_pointer: {
      __syntheticId: 56833,
      __syntheticParents: [Array],
      transition: [Circular],
      descriptors: [Object],
      prototype: [Object],
      back_pointer: [Object],
      dependent_code: [Object],
      map: [Object],
      '<dummy>': [Object]
    },
    dependent_code: { __syntheticId: 315, __syntheticParents: [Array], map: [Object] },
    map: {
      __syntheticId: 77,
      __syntheticParents: [Array],
      dependent_code: [Object],
      map: [Circular]
    },
    '<dummy>': { __syntheticId: 1429, __syntheticParents: [Array] }
  }
}

Okay, we can see a lot here. We do see maps stored by V8 to handle object internals, we see properties, parents etc. The most important thing is:

> matchingStrings[0].parents[1].target.someText.__syntheticValue
'Some Text here'
> matchingStrings[0].parents[1].target.someNumber.value.__syntheticValue
'12345678'
> matchingStrings[0].parents[1].target.someBoolean["<dummy>"].__syntheticValue
'true'
> matchingStrings[0].parents[1].target.someArray.elements["0"].__syntheticValue
'Array element 1'
> matchingStrings[0].parents[1].target.someArray.elements["1"].__syntheticValue
'Array element 2'

So we can see that strings are extracted and stored in the __syntheticValue. Booleans are stored in some property named < dummy> whereas arrays have additional property called elements. Aprat from that, we can get all values from the dump.

It should be now straightforward to analyze memory dumps automatically. Obviously, parsing logic is very straightforward and can be adjusted to our needs.

]]>
https://blog.adamfurmanek.pl/2022/03/05/chatterbox-part-11/feed/ 0