Login

Factual Blog /

Investigating Low Quality Location Data #2 - Suspicious Activity Over Greenland - Lab Notes

Note: This is a companion post to Investigating Low Quality Location Data #2 - Suspicious Activity Over Greenland

Audience data validation is a crucial part of delivering accurate behavioral profiles. After seeing a suspiciously high number of mobile ads with locations over Greenland, the Arctic Circle, and the middle of the ocean, the curious engineers at Factual decided to figure out what these apps were up to.

Suspicious points are visible just by looking at the aggregated coordinates from all apps. Using nfu, we can plot latitude and longitude points from a file with %d specifying one dot per point:

$ nfu latlngs.gz -f10p %d

There appear to be two rectangles, one superimposed on the other. Each spans 90 degrees of latitude and longitude and appears to be uniformly distributed:

Time to do some digging.


Confirming sources of bogus data
This one is kind of easy because we can choose a rectangle that contains almost entirely invalid data by geofencing. Let’s take longitudes in [-90, 0] and latitudes above 70 degrees (the Arctic Circle is everything north of about 66 degrees).

In the snippet below, “all-tuples.gz” contains latitude and longitude (fields 2 and 3) and app name (field 9). The output is apps with the most bogus data printed in descending order.

$ nfu all-tuples.gz -k '%2 >= 70 && %3 >= -90 && %3 <= 0' -f9gcO
498     app-752
438     app-453
344     app-833
339     app-528
290     app-431
274     app-38
252     app-561
233     app-181
212     app-789
...

None of the app names looked familiar (anonymized for this blog post). Let’s see how much of the overall data they represent:

$ nfu all-tuples.gz -f9gcf10 | gzip > apps-by-count.gz
$ nfu apps-by-count.gz -f1sT+1
1013189         # total data points
$ nfu all-tuples.gz -k '%2 >= 70 && %3 >= -90 && %3 <= 0' \
      -f9gcf1.i0 apps-by-count.gz \
      -f1sT+1m '%0 / 0.2'
97670           # approximate number of total points in the rectangles

Ok, so they collectively make up about 10% of the invalid location data in Greenland. But how do we know for sure that we can disregard all geocoordinates coming from these particular apps? We need to make sure we didn’t also catch any apps with legitimate snow-loving travelers, so let’s look for things outside the known-invalid rectangles:

$ nfu all-tuples.gz -f923 | gzip > latlngs-by-app.gz
$ nfu all-tuples.gz -k '%2 >= 70 && %3 >= -90 && %3 <= 0' -f9gcf1 \
    | gzip > rectangle-apps.gz
$ nfu rectangle-apps.gz -i0 latlngs-by-app.gz \
      -k '%1 < 0 || %2 > 0 || %2 < -100' -f0gcO
184     app-38
179     app-90

For app-38 and app-90, it appears two-thirds of their data points aren’t in either known-invalid rectangle. For now let’s exclude those from the file we’re investigating, since they do appear to be reporting some valid user data:

$ zgrep -Ev 'app-90|app-38' rectangle-apps.gz | gzip > real-rectangle-apps.gz


How it was generated
For the remaining apps, the high user activity in Greenland is obviously bad data, but sometimes we can still get useful information by taking a closer look. Borrowing a couple of ideas from my last post, let’s start by looking for abnormal correlations that might tell us something about how it was generated. In particular, here’s latitude correlation between the integer and first-two digits of the fractional (-gcf1. is an nfu idiom to deduplicate, which we want to do here because devices tend to cache locations):

$ nfu real-rectangle-apps.gz -i0 latlngs-by-app.gz \
      -gcf1.m 'row $1, $2 if %1 =~ /(\d+)\.(\d{2})/' \
      -m 'row map $_ + rand(), @_' -p %d

Longitude:

Some things jump out right away:

  1. Both latitude and longitude appear to be generated the same way because the graphs look nearly identical.
  2. Fractional parts beginning with 0 are artificially uncommon; they seem to belong only to the rectangle extending north to 100 degrees (the “more north” pole).

I also attached the OS to see whether that was a causal factor (and because Android apps can easily be disassembled):

$ nfu real-rectangle-apps.gz -i0 latlngs-by-app.gz \
      -gcf1.A 'row $_, scalar(@{%1}), scalar grep /\.0/, @{%1}' -k '%2 == 0' \
      -i0 @[ all-tuples.gz -f96gcf1. ] -k '%3' -f1230

13      0       Android app-20
821     0       Android app-55
794     0       Android app-64
8       0       Android app-65
8       0       iOS     app-65
1019    0       iOS     app-181
2       0       iOS     app-329
1291    0       Android app-431
2109    0       iOS     app-453
19      0       Android app-462
1593    0       iOS     app-528
2       0       iOS     app-561
2118    0       Android app-752

I’m surprised to see apps on different operating systems all using the same strategy. Before getting into any code, let’s see if there’s an obvious reason someone would want to omit the zero digit prefix (e.g. to avoid numbers like 34.058). Here’s correlation between digit position and value:

$ nfu real-rectangle-apps.gz -i0 latlngs-by-app.gz \
      -gcf1.m 'my @ds = %1 =~ s/^.*\.//r =~ /(\d)/g;
               map row($_, $ds[$_]), 0..$#ds' \
      -m 'row map $_ + 0.9 * rand(), @_' -p %d

That looks about as we expect; the absence of zeroes in the sixth position is just because most number formatters drop trailing zeroes in the fraction.

Separating app into the third dimension:

$ nfu real-rectangle-apps.gz -i0 latlngs-by-app.gz \
      -gcf1.m 'my @ds = %1 =~ s/^.*\.//r =~ /(\d)/g;
               map row(%0, $_, $ds[$_]), 0..$#ds' \
      --intify 0 -m 'row %0, %1 + 0.9*rand(), %2 + 0.9*rand()' \
      --splot %d

It’s difficult to visualize 3D data effectively, but the thing to notice here is that most of the high-data apps have exactly the same pattern.

Another thing to verify is that these aren’t all coming from the same device:

$ nfu real-rectangle-apps.gz -i0 latlngs-by-app.gz \
      -gcf1.A 'row $_ unless grep /\.0/, @{%1}' -k %0 \
      -i0 @[ all-tuples.gz -f98 ] \
      -A 'row scalar(@{%1}), scalar(uniq @{%1}), $_'

# N     devs    appname
16      13      app-20
824     821     app-55
795     794     app-64
289     8       app-65
1023    1019    app-181
129     2       app-329
1296    1291    app-431
2110    2109    app-453
19      19      app-462
1594    1593    app-528
597     2       app-561
2119    2118    app-752

Quite the opposite: most of the apps have just one data point per device. Likewise, only six of the 9788 devices were associated with more than one app:

$ nfu real-rectangle-apps.gz -i0 latlngs-by-app.gz \
      -gcf1.A 'row $_ unless grep /\.0/, @{%1}' -k %0 \
      -i0   @[ all-tuples.gz -f98 ] \
      -f1i0 @[ all-tuples.gz -f89 ] \
      -gA 'row $_, scalar(@{%1}), scalar uniq @{%1}' \
      --intify 0 -k '%2 > 1'

# device       N       unique apps
device-1071    2       2
device-8570    1250    3
device-8628    2750    3
device-8692    28224   2
device-8693    4900    3
device-8892    21540   3


Back to the drawing board
I disassembled app-431 and looked for calls to Random methods like nextDouble, nextFloat, etc., but didn’t end up finding anything interesting. At this point my guess was that the apps themselves weren’t generating the bogus points; instead, some third party was fabricating app names, device IDs, locations, and even operating systems.

To gather more information I did a significant-terms analysis of the records with bad location data by splitting the unfiltered data into individual words and looking for any with an unusually high frequency. The first interesting terms, about 100 lines down, belonged to an app publisher:

$ nfu --run '++$::x{$_} for grep length, rl("real-rectangle-apps.gz")' \
      unfiltered.gz -k 'grep $::x{$_}, %0 =~ /"([^"]+)"/g' \
      -nm 'map row(%0, $_), split /\W+/, %1' \
      -f10gA 'row $_, scalar uniq @{%1}' -O1

...
pubnameword-851    4229
pubid-501          4229
pubnameword-320    4229
...

Terms are anonymized for this post, but it stuck out because the terms were uncommon and outside of the openRTB schema.

I searched for that publisher through the unfiltered stream and looked at the geo:

$ nfu unfiltered.gz -k '%0 =~ /"pubid-501"/' | gzip > pub501.gz
$ nfu pub501.gz \
      -m 'my $j = jd(%0); row $j.payload.device.geo.lon,
                              $j.payload.device.geo.lat' -p %d

Here’s the difference when we remove this publisher (original on left):

This publisher is clearly contributing a significant amount of the bad data. No zeroes are ever generated in the most-significant fractional position, which could be a deliberate attempt to increase precision; just as likely, though, is that they started by generating two random integers and glued them together with string operations. PHP’s rand() function, for example, generates integers rather than floats – though that’s not much to go on.


Other bogus publishers
Does any publisher that produces bogus data also produce any valid data? If not, then we can just blacklist all of them. Here I’m splitting the data into separate geo-planes, one per publisher:

$ nfu unfiltered.gz \
      -m 'my $j = jd(%0);
          row $j.payload.app.name,
              je($j.payload.app.publisher),
              $j.payload.device.geo.lat // "",
              $j.payload.device.geo.lon // ""' \
  | gzip > publatlng-by-app.gz

$ nfu --run '++$::p{$_} for
             rl row qw[sh:nfu real-rectangle-apps.gz
                              -i0 publatlng-by-app.gz
                              -f1m "eval {(jd(%0)//{}).id} // %0"
                              -k "length %0" -gcf1.]' \
      unfiltered.gz -k 'grep $::p{$_}, %0 =~ /\"([^"]+)\"/g' \
      -m 'my $j = jd(%0); row $j.payload.app.publisher.id // "",
                              $j.payload.app.publisher.name // "",
                              $j.payload.device.geo.lon // "",
                              $j.payload.device.geo.lat // ""' \
      -k '%2 && %3' \
  | gzip > suspect-tuples.gz

$ nfu suspect-tuples.gz -f023 --intify 0 --splot %d

Not bad; just one publisher has non-bogus data (and we can easily detect this false-positive because it has data outside the rectangles). This means that the app’s publisher is probably the predictive dimension. Let’s make sure we got everything:

$ nfu --run '++$::p{$_} for rl q{sh:nfu suspect-tuples.gz -f0};
             ++$::n{$_} for rl q{sh:nfu suspect-tuples.gz -f1}' \
      publatlng-by-app.gz \
      -K 'my $j = eval {jd(%1)};
          !$j || $::p{$j.id // ""} || $::n{$j.name // ""}' \
      -k '%2 && %3' -f32p %d

Looks like we did. Here’s the distribution without those bogus publishers:


Constructing the classifier
There isn’t any publisher or app metadata that reliably indicates when the data is suspect, but we have a lot of statistical leverage we can use. Starting with the base rate of bad publishers (well, for this particular mode of bad data anyway):

$ nfu unfiltered.gz \
      -m 'my $j = jd(%0).payload.app.publisher // {};
          row $j.id // "", $j.name // ""' \
      -gc \
  | gzip > publisher-distribution.gz

$ zgrep -Ec $ publisher-distribution.gz
910
$ nfu suspect-tuples.gz -f0gcnf0T+1
31

Just under 3.4% of publishers provide mostly or entirely bad data. Put in another way, we need about 5 bits of evidence to make a bad publisher the most likely explanation. We gather that evidence cumulatively:

# bits per point in rectangle:
$ nfu all-tuples.gz -m '%2 >= 0 && %3 >= -90 && %3 <= 0 || 0' \
                    -a0T+1m '-log2(%0)'
1.27671079919326

# bits per nonzero lead fractional digit in both lat and lng:
$ nfu all-tuples.gz -m '%2 !~ /\.0/ && %3 !~ /\.0/ || 0' -a0T+1m '-log2(%0)'
0.342563687691395

Put together, that’s 1.62 bits of evidence per positive observation; so after observing 10 data points we can conclude with 99.9% accuracy that a publisher is supplying wholly inaccurate data. In fact, we can actually do much better because we have population models for geo areas. We can use that to significantly increase the information content of most of the rectangular area while reducing the probability of false-positives.

- Spencer Tipping, Software Engineer

Related Posts:

Enjoy this read? Factual might be the place for you!
See Openings