Privacy/Reviews/KPI Backend: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(→‎Type of data stored:: Adding timestamp rounding note)
 
(33 intermediate revisions by 3 users not shown)
Line 8: Line 8:
|'''Product Champions:''' || Austin King
|'''Product Champions:''' || Austin King
|-
|-
|'''Privacy Champions:''' || TBD
|'''Privacy Champions:''' || Sid Stamm
|-
|-
|'''Security Contact:''' || TBD
|'''Security Contact:''' || Curtis Koenig
|-
|-
|'''Document State:''' || <section begin='status'/>{{ok|ready for review?}}<section end='status'/>
|'''Document State:''' || <section begin='status'/>{{ok| CfC closed]}}<section end='status'/>
|}
|}


Line 19: Line 19:


{|
{|
|'''Architectural Overview:''' || TBD
|'''Architectural Overview:''' || July-2012
|-
|-
|'''Recommendation Meeting:''' || TBD
|'''Recommendation Meeting:''' || TBD
|-
|-
|'''Wrap-up Meeting:''' || (if necessary)
|'''Review Complete ETA:''' || <section begin='revieweta' />July-2012<section end='revieweta' />
|}
|}


Line 44: Line 44:
Describe any major components in the system and how they interact.  Also include any third-party APIs (those Mozilla does not control) and what type of data is sent or received via those APIs.
Describe any major components in the system and how they interact.  Also include any third-party APIs (those Mozilla does not control) and what type of data is sent or received via those APIs.


== Client Component ==
[[File:Kpi_backend_components.png]]
The client portion of the KPI Dashboard feature is the HTML/Javascript that runs in a user's browser when they sign into a website using Persona Sign-In on a browser without native support.  The dialog that is displayed records interactions and timing information, building a JSON data structure during interaction with the dialog.  This JSON data structure is then sent to Persona Sign-In servers at the end of the interaction.


=== Type of data stored: ===
=== Client Component ===
During the users interaction with the dialog we capture various information:
The client portion of the KPI feature is the HTML/Javascript that runs in a user's browser when they sign into a website using Persona Sign-In on a browser without native support.  The dialog that is displayed records interactions and timing information, building a JSON data structure during interaction with the dialog.  This JSON data structure is then sent to Persona Sign-In servers at the end of the interaction.
* '''timestamp:''' the time that the interaction started, rounded off to a coarser-grained level to reduce traceability
* '''event_stream:''' interesting events that occurred during the user's interaction, including both events initiated by the user (mouse clicks) as well as events originating from running javascript code (keypair generation).  Each event is uniquely named and includes a time offset for when it occurred measured from when the dialog was displayed
* '''email_type:''' In the event that the interaction results in the user selecting an email address to use to sign in, we include the type of email used: "primary" is an email address from a domain that has browserid support, "secondary" is an email address from a domain that does not directly support browserid
* '''number_emails:''' When an interaction proceeds to the point where the user authenticates to the personaid service we include the number of emails that this user has verified with BrowserID
* '''new_account:''' If during an interaction a new browserid account is created, this property is true (as opposed to an interaction which represents sign in using an existing account)
* '''language:''' the language that was displayed to this user during the interaction
* '''sites_logged_in:''' If the user is authenticated to the Persona servers at any point during the interaction, we include the number of distinct sites that the user has logged into recently using browserid
* '''screen_size:''' The screen dimensions of the device used by the user, determined programatically with javascript
* '''sample_rate:''' Rate at which the server is sampling clients data for KPI messages. 0.1 would be a 10% sample rate. We plan on shipping with 100% traffic, so 1.0
* '''user_agent''' A generalized version of user agent which includes coarse grained details for Operating System, Browser, and Browser version. Does not contain original user agent string. Implementation will prefer non-fingerprintable values over technical accuracy (okay to encode 'unknown' for Operating System or 'Windows' if reg doesn't match a given version of Windows NT). No Operating system version is collected. Browser version will be as course grained as possible, while remaining useful. Examples: Firefox 13, Chrome 13, Safari 5.1 - no build number or other fine grained value is recorded.


=== Example data:===
=== Server Component ===
<pre>
{
    "timestamp": 1333046104322,
    "event_stream": [
        [ "picker", 732 ],
        [ "picker::change", 1700 ],
        [ "picker::signin": 2300 ],
        [ "assertion_generation": 2500 ],
        [ "certified": 3300 ],
        [ "assertion_generated": 4500 ],
        [ "complete": 4777 ]
    ],
    "email_type": "secondary",
    "number_emails": 3,
    "new_account": false,
    "language": "en_US",
    "number_sites_logged_in": 1,
    "screen_size": { "width": 640, "height": 480 },
    "sample_rate": 1.0,
    "user_agent": {
      "os": "iOS",
      "browser": "Safari",
      "version": "5.1" 
    }
}
</pre>
 
== Server Component ==
Persona ID is currently implemented in two data centers with six "webheads", frontline web servers receiving requests from client devices.  For this feature each webhead will expose a new API that accepts JSON data and forwards it to data storage servers.
Persona ID is currently implemented in two data centers with six "webheads", frontline web servers receiving requests from client devices.  For this feature each webhead will expose a new API that accepts JSON data and forwards it to data storage servers.
Data will be retained forever or purged based on resource usage. Historical data will be valuable for guiding the teams design decisions.
Data will be retained forever or purged based on resource usage. Historical data will be valuable for guiding the teams design decisions.
Line 98: Line 59:
The API requires an <code>HTTP POST</code> with a <code>CSRF</code> token.  The JSON document described above is the payload.  The server returns a 200 on successful storage, and a non-500 otherwise.  In the event of failure, the client may store the blob in <code>localStorage</code> and retry transmission at a later point.
The API requires an <code>HTTP POST</code> with a <code>CSRF</code> token.  The JSON document described above is the payload.  The server returns a 200 on successful storage, and a non-500 otherwise.  In the event of failure, the client may store the blob in <code>localStorage</code> and retry transmission at a later point.


== Data Storage Component ==
=== Data Storage Component ===
 
Persona Sign-In webheads (Server Component) serve as simple forwarders to this component, whose primary purpose is to store the data.  The Server Component does some input validation and then POSTs the data to a small number of servers (The Data Storage Component) who store it.  These servers expose a similar API for receipt of the data as the Server Component.  Additionally, these servers have APIs to allow read access to the data based on a date range, supporting streaming or pagination as desired. 
 
Access to this data may be highly restricted initially, with a ''goal of opening up access as much as is feasible to allow for transparency, community involvement, and a high level of decoupling between the systems that store and the systems that analyze the data'' to answer meaningful questions about project health and usability.
 
'''Stored Data:'''


Persona Sign-In webheads serve as simple forwarders for this feature.  They may do some input validation and then POST the data to a small number of servers who store it. These servers expose a similar API for reciept of the data. Additionally, these servers have APIs to allow access to the data based on a date range, supporting streaming or pagination as desired. 
{| class="wikitable"
|-
! What
! Where
|-
| JSON file
| Transmitted to a CouchDB database (this Server Component) behind our firewall, once each time a user uses the BrowserID dialog
|}
 
'''Communication with Server Component'''
 
{| class="wikitable"
|-
! Direction
! Field
! Data Sample
! Notes
|-
| ''In:''
| timestamp
| 13330461000000
| Unix timestamp rounded to 10 minute intervals
|-
|
| event_stream
| [ picker, 732 ], [ picker::change, 1700 ], [ picker::signin: 2300 ]
| A list of UI Events and time offsets in milliseconds
|-
|
| email_type
| secondary
| Which type of email was used? Primary or Secondary.
|-
|
| number_emails
| 3
| The number of emails a user has associated in their account
|-
|
| new_account
| false
| Is the user brand new in the last 24 hours?
|-
|
| language
| en_US
| i18n language code
|-
|
| number_sites_logged_in
| 1
| Number of websites which user used BrowserID on in last 24 hours
|-
|
| screen_size
| { width: 640, height: 480 }
| Device screen size
|-
|
| sample_rate
| 1.0
| Controls sampling of data submission client side
|-
|
| user_agent
| { os: iOS, browser: Safari, version: 5.1 }
| Course grained user agent (not the same as User Agent string)
|-
| Out:
|
|
| Same data available behind our firewall
|}


Access to this data may be highly restricted initially, with a goal of opening up access as much as is feasible to allow for transparency, community involvement, and a high level of decoupling between the systems that store and the systems that analyze the data to answer meaningful questions about project health and usability.
[https://wiki.mozilla.org/Identity/BrowserID/KPI_Dashboard#Data_Glossary Data Glossary] for deeper details of each Data element.


= User Data Risk Minimization =
= User Data Risk Minimization =
In this section, the privacy champion will identify areas of user data risk and recommendations for minimizing the risk.  
In this section, the privacy champion will identify areas of user data risk and recommendations for minimizing the risk.
 
=== Data Access ===
 
It's an open question who will have access to this data and for what purposes.  While users may be willing to let us collect this data to monitor the quality of service and improve it, they may not be okay with us publishing the data in its raw form.
 
''The Risk'' is that the scope of data sharing and use will be unclear and we will creep from allowing unlimited access to a core group of engineers to allowing anyone to see and use the raw data.
 
''Requirement:'' Work with legal/privacy policy folks to ensure this the PersonaID privacy policy documents this data collection and our intended use of the data.  Make sure our data collection practices are disclosed publicly and as clearly as possible.
 
''Requirement:'' ensure proper access controls are in place to limit access to the stored data.  List the employed controls in this resolution.
 
(ozten) We've worked with the user data safety committee to limit the scope of the data and air our plan publicly. We're worked with the legal team to revise the privacy policy and terms of use for Persona Beta. We plan on deploying this behind Corporate LDAP to limit access to current employees.
 
{{ResolutionBox|{{resolved|will be deployed behind corporate LDAP}}}}


= Alignment with Privacy Operating Principles =
= Alignment with Privacy Operating Principles =
Line 112: Line 165:


See Also: [[Privacy/Roadmap_2011#Operating_Principles:]]
See Also: [[Privacy/Roadmap_2011#Operating_Principles:]]
====Principle: Transparency / No Surprises====
This feature is transparent and it may not be obvious to users that we know more than their email addresses.
''The Risk'' is that users may not know we are collecting this data.
''Recommendations'': document in the relevant privacy policies and wherever disclosures happen (e.g., on enrollment) that we collect non-personal statistics about how the system is used (and why we collect it).
{{ResolutionBox|{{resolved|updated privacy policy for Persona Beta to cover this}}}}
====Principle: Real Choice====
Can users opt out of this data collection?
''Recommendations'': Provide a way for users to opt-out of this data collection for their PersonaID profile.
(ozten) While working with the user data saftey group, we worked hard with UX to try to find a way to provide an opt-out. We are unable to find a solution. We're collecting less data than the web analytics packages used on many Mozilla web properties; these packages also offer no opt-out mechanism.
{{ResolutionBox|{{done|Todo: enable opt-out via http cookie or similar mechanism [https://github.com/mozilla/browserid/issues/2412 Issue #2412]}}}}
Much of the work and design that has gone into the BrowserID protocol is to provide an "Opt Out" at a fundamental level; Users can use the decentralized mode and no KPI data is sent.
KPI data is only sent from our shim.
We cannot provide a quality shim nor fallback IdP without this KPI data.
This is a short term need to get analytics for quality of service, as this grows and as the browser ID spec is addopted with others this will become unnecessary and in fact will not exist because we won't be the provider. This is a short term need to make sure this works as designed at scale.
====Principle: Sensible Defaults====
This feature collects reasonably innocuous data, so the risk of collecting it by default for all PersonaID interactions is fairly minimal.
''The Risk'' is that this data will be obtained by third parties and used in correlation with other data sets, including the server logs for the rest of PersonaID interactions.  If we store all the data with the same credentials (or in the same system), a security breach could result in more valuable data than a breach that only obtains one of the server log or KPI log.
''Recommendations'':  Ensure that this data is always kept on separate systems from the rest of PersonaID logs to minimize the effect of a data breach.
(ozten) Excellent. We'll make sure that happens.
{{ResolutionBox|{{ok|{{bug|773407}} deployment to separate KPI logs and other PersonaID logs}}}}
====Principle: Limited Data====
Users traffic on the system is potentially logged more than necessary.  We should ensure this data collection only happens at appropriately spaced intervals (not hundreds of times per login, for example) and we are collecting the minimum amount of data required and it is retained for as short a term as possible.
''The Risk'' is that we may end up with too much data that never gets used for the clear value proposed to users of the system.
''Requirement'': Work with Security Assurance and our Legal/Privacy Policy folks to minimize logging, minimize retention window, deploy a secure data storage infrastructure, and document and publish a data collection and retention policy.
(ozten) Great, we'd love to make sure this happens. We have several bugs open with these teams. What are next steps on these fronts as this fits with our current plans?
{{ResolutionBox|{{ok|Using {{Bug|742796}} and {{Bug|746245}} to track where this work is being done}}}}
{{new| (ozten) It's not clear to me who makes this happen. Will delegate to Tauni to find out who "project team" is and what needs to happen to get these links here.}}


= Follow-up Tasks and tracking  =
= Follow-up Tasks and tracking  =
Line 121: Line 227:
! Bug  
! Bug  
! Details
! Details
|-
|{{done|[https://groups.google.com/forum/?fromgroups#!topic/mozilla.dev.planning/vyK9Pa-bjt8 Call for Comments]}}
|
|
|2012.07.19
|-
| {{new|File bug and implement an opt-out.}}
| Project team
| [https://github.com/mozilla/browserid/issues/2412 Issue #2412]
|
|- 
| {{new|Add links to wiki page where requested}}
| Project team
|
|
|}
|}


[[Category:Privacy/Reviews|Template]]
[[Category:Privacy/Reviews|KPI Backend]]

Latest revision as of 16:41, 10 September 2012

Document Overview

Feature/Product: KPI Backend
Projected Feature Freeze Date: End of Q2
Product Champions: Austin King
Privacy Champions: Sid Stamm
Security Contact: Curtis Koenig
Document State: [ON TRACK] CfC closed]


Timeline:

Architectural Overview: July-2012
Recommendation Meeting: TBD
Review Complete ETA: July-2012

Architecture

In this section, the product's architecture is described. Any individual components or actors are identified, their "knowledge" or what data they store is identified, and data flow between components and external entities is described.

The main objective of this feature/product is: to allow the BrowserID product team to access how well changes to the service are meeting key performance indicators (KPI). UX will design a feature change, engineering will build it and a KPI Dashboard will give us the feedback of how successful the change is with real users.

KPI Backend must be built before we build the KPI Dashboard, which will be built next quarter and have it's own privacy review. KPI Backend stores the raw data described below.

Design Documents: Link to any design or architectural documents here.

Components

Describe any major components in the system and how they interact. Also include any third-party APIs (those Mozilla does not control) and what type of data is sent or received via those APIs.

Kpi backend components.png

Client Component

The client portion of the KPI feature is the HTML/Javascript that runs in a user's browser when they sign into a website using Persona Sign-In on a browser without native support. The dialog that is displayed records interactions and timing information, building a JSON data structure during interaction with the dialog. This JSON data structure is then sent to Persona Sign-In servers at the end of the interaction.

Server Component

Persona ID is currently implemented in two data centers with six "webheads", frontline web servers receiving requests from client devices. For this feature each webhead will expose a new API that accepts JSON data and forwards it to data storage servers. Data will be retained forever or purged based on resource usage. Historical data will be valuable for guiding the teams design decisions.

We reserve the right to sample data, but will start with 100% intake.

The client accessible API is: /wsapi/interaction_data

The API requires an HTTP POST with a CSRF token. The JSON document described above is the payload. The server returns a 200 on successful storage, and a non-500 otherwise. In the event of failure, the client may store the blob in localStorage and retry transmission at a later point.

Data Storage Component

Persona Sign-In webheads (Server Component) serve as simple forwarders to this component, whose primary purpose is to store the data. The Server Component does some input validation and then POSTs the data to a small number of servers (The Data Storage Component) who store it. These servers expose a similar API for receipt of the data as the Server Component. Additionally, these servers have APIs to allow read access to the data based on a date range, supporting streaming or pagination as desired.

Access to this data may be highly restricted initially, with a goal of opening up access as much as is feasible to allow for transparency, community involvement, and a high level of decoupling between the systems that store and the systems that analyze the data to answer meaningful questions about project health and usability.

Stored Data:

What Where
JSON file Transmitted to a CouchDB database (this Server Component) behind our firewall, once each time a user uses the BrowserID dialog

Communication with Server Component

Direction Field Data Sample Notes
In: timestamp 13330461000000 Unix timestamp rounded to 10 minute intervals
event_stream [ picker, 732 ], [ picker::change, 1700 ], [ picker::signin: 2300 ] A list of UI Events and time offsets in milliseconds
email_type secondary Which type of email was used? Primary or Secondary.
number_emails 3 The number of emails a user has associated in their account
new_account false Is the user brand new in the last 24 hours?
language en_US i18n language code
number_sites_logged_in 1 Number of websites which user used BrowserID on in last 24 hours
screen_size { width: 640, height: 480 } Device screen size
sample_rate 1.0 Controls sampling of data submission client side
user_agent { os: iOS, browser: Safari, version: 5.1 } Course grained user agent (not the same as User Agent string)
Out: Same data available behind our firewall

Data Glossary for deeper details of each Data element.

User Data Risk Minimization

In this section, the privacy champion will identify areas of user data risk and recommendations for minimizing the risk.

Data Access

It's an open question who will have access to this data and for what purposes. While users may be willing to let us collect this data to monitor the quality of service and improve it, they may not be okay with us publishing the data in its raw form.

The Risk is that the scope of data sharing and use will be unclear and we will creep from allowing unlimited access to a core group of engineers to allowing anyone to see and use the raw data.

Requirement: Work with legal/privacy policy folks to ensure this the PersonaID privacy policy documents this data collection and our intended use of the data. Make sure our data collection practices are disclosed publicly and as clearly as possible.

Requirement: ensure proper access controls are in place to limit access to the stored data. List the employed controls in this resolution.

(ozten) We've worked with the user data safety committee to limit the scope of the data and air our plan publicly. We're worked with the legal team to revise the privacy policy and terms of use for Persona Beta. We plan on deploying this behind Corporate LDAP to limit access to current employees.

Resolution:
[RESOLVED] will be deployed behind corporate LDAP

Alignment with Privacy Operating Principles

In this section, the privacy champion will identify how the feature lines up with Mozilla's privacy operating principles.

See Also: Privacy/Roadmap_2011#Operating_Principles:

Principle: Transparency / No Surprises

This feature is transparent and it may not be obvious to users that we know more than their email addresses.

The Risk is that users may not know we are collecting this data.

Recommendations: document in the relevant privacy policies and wherever disclosures happen (e.g., on enrollment) that we collect non-personal statistics about how the system is used (and why we collect it).

Resolution:
[RESOLVED] updated privacy policy for Persona Beta to cover this

Principle: Real Choice

Can users opt out of this data collection?

Recommendations: Provide a way for users to opt-out of this data collection for their PersonaID profile.

(ozten) While working with the user data saftey group, we worked hard with UX to try to find a way to provide an opt-out. We are unable to find a solution. We're collecting less data than the web analytics packages used on many Mozilla web properties; these packages also offer no opt-out mechanism.

Resolution:
[DONE] Todo: enable opt-out via http cookie or similar mechanism Issue #2412

Much of the work and design that has gone into the BrowserID protocol is to provide an "Opt Out" at a fundamental level; Users can use the decentralized mode and no KPI data is sent.

KPI data is only sent from our shim.

We cannot provide a quality shim nor fallback IdP without this KPI data.

This is a short term need to get analytics for quality of service, as this grows and as the browser ID spec is addopted with others this will become unnecessary and in fact will not exist because we won't be the provider. This is a short term need to make sure this works as designed at scale.

Principle: Sensible Defaults

This feature collects reasonably innocuous data, so the risk of collecting it by default for all PersonaID interactions is fairly minimal.

The Risk is that this data will be obtained by third parties and used in correlation with other data sets, including the server logs for the rest of PersonaID interactions. If we store all the data with the same credentials (or in the same system), a security breach could result in more valuable data than a breach that only obtains one of the server log or KPI log.

Recommendations: Ensure that this data is always kept on separate systems from the rest of PersonaID logs to minimize the effect of a data breach.

(ozten) Excellent. We'll make sure that happens.

Resolution:
[ON TRACK] bug 773407 deployment to separate KPI logs and other PersonaID logs

Principle: Limited Data

Users traffic on the system is potentially logged more than necessary. We should ensure this data collection only happens at appropriately spaced intervals (not hundreds of times per login, for example) and we are collecting the minimum amount of data required and it is retained for as short a term as possible.

The Risk is that we may end up with too much data that never gets used for the clear value proposed to users of the system.

Requirement: Work with Security Assurance and our Legal/Privacy Policy folks to minimize logging, minimize retention window, deploy a secure data storage infrastructure, and document and publish a data collection and retention policy.

(ozten) Great, we'd love to make sure this happens. We have several bugs open with these teams. What are next steps on these fronts as this fits with our current plans?

Resolution:
[ON TRACK] Using bug 742796 and bug 746245 to track where this work is being done

[NEW] (ozten) It's not clear to me who makes this happen. Will delegate to Tauni to find out who "project team" is and what needs to happen to get these links here.

Follow-up Tasks and tracking

What Who Bug Details
[DONE] Call for Comments 2012.07.19
[NEW] File bug and implement an opt-out. Project team Issue #2412
[NEW] Add links to wiki page where requested Project team