Data sources and processes

Tidelift delivers a continuously curated stream of human-researched and maintainer-verified data on open source packages and their licenses, releases, vulnerabilities, and development practices. This article provides several technical details about the data types and how it compares to public data sources.

Automated, structured, and centralized data via Libraries.io, NIST, and other sources

Tidelift uses the Libraries.io open source project to scan data from upstream package manager ecosystems and from upstream source repositories. This data is easily accessible in one centralized Tidelift location, saving customers the time and resources required to find key information on public open source packages. 

This scraped information includes things such as:

  • Lists of releases and release dates from the package manager
  • Upstream license information from the package manager and/or source repository
  • Upstream source repository location
  • Per-release dependencies, as specified in package manager metadata
  • Source repository maintenance information from GitHub (Last commit date, contributions, issues, and pull requests over the past year

Tidelift also pulls vulnerability data straight from the NIST National Vulnerability Database., and scorecard information from OpenSSF Scorecard.

Tidelift human-researched data

Tidelift works to research data on open source software when it can’t be scraped reliably.

Normalized data

When license information is unclear or nonstandard upstream, Tidelift works to normalize it. When license information is missing, Tidelift researches it so that our customers can have confidence. Tidelift has normalized and researched licenses for over one million software releases, and this data is only available with a Tidelift Subscription.

Mapped data

Tidelift researches new vulnerabilities as they are described by NIST, and maps the vulnerability to specific vulnerable packages, and what releases of the package are and are not vulnerable to the vulnerability.

Analyzed and researched data

Tidelift uses the raw scraped data, analyzes it for patterns, and performs research to provide conclusions to our customers.

For example

Tidelift analyzes the contribution statistics that have been scraped to determine whether a package might be unmaintained. If it appears unmaintained, then Tidelift researches a number of criteria (maintainer activity elsewhere, documentation and repository markers, public statements) to determine whether the package is actually unmaintained, and makes that information available to Tidelift subscribers.

Tidelift uses this information to then analyze releases. Releases are assessed on a number of criteria, not just vulnerabilities, and assessed for suitability.

Tidelift then combines this information with information on the releases’ dependencies, to determine whether any of the releases’ dependencies have any of these issues. Tidelift consolidates this into a recommendation field that lets you know whether using this release will bring any issues into your environment, either directly or indirectly through transitive dependencies–and if it will, Tidelift tells you what those issues are.

In addition, Tidelift provides additional analyzed quality checks relating to security, development practices, and long-term outlook.

For more details, see How Tidelift evaluates packages and How Tidelift evaluates releases.

All normalized, analyzed, and researched data is only available with a Tidelift subscription.

First party maintainer data

Tidelift works directly with open source maintainers to get expert information on the packages they maintain, including their development practices, and issues that affect the packages. Tidelift also pays those maintainers to improve their packages’ development practices and security posture.

Among the data Tidelift provides directly from maintainers:

  • Reviews of who has publishing rights  on upstream package managers to ensure only those who should push releases can
  • Assertion of multi-factor authentication for both contributing code and publishing releases
  • Detailed recommendations on vulnerability handling, including:
    • Available workarounds
    • Specific affected methods and access patterns (such as whether it affects usage in development and testing, or only production)
    • Are issues false positives, and why

This information from maintainers is only available with a Tidelift subscription.

How often is this data collected and refreshed?

Tidelift pulls information directly from upstream package managers (such as PyPI or NPM), source code hosting platforms (such as GitHub), and vulnerability trackers (the NIST National Vulnerability Database).

For our supported ecosystems, Tidelift subscribes to a feed of new releases to pull information on new packages and new releases of existing packages as they are published. Additionally, we re-check existing packages for any metadata updates over the course of a two-week cycle, and directly when there is a significant change (such as a new vulnerability disclosed).

What steps are used to assess data quality / how does Tidelift verify authenticity of the data?

As Tidelift pulls information, it checks it for completeness and agreement. If, for example, there is a discrepancy in licensing information between a package’s record in the upstream package manager and the upstream source code repository, Tidelift researchers perform a review. Tidelift also performs a review when specific data points are missing (such as licensing information, or upstream source location).

Tidelift also performs out-of-band checks directly with our upstream data sources to ensure we have correct, and up-to-date, information.

What is the process for requesting insights on new packages?

All packages that are in our supported ecosystems that our customers either:

  • Include in a manifest that they are evaluating
  • Request information for via our APIs

are queued for evaluation by Tidelift. The automated parts of this evaluation are done under an SLA of 40 minutes. Manual research can take up to 3 business days, depending on the amount required.

 

For a detailed description of the data available, see our Tidelift API documentation.

Was this article helpful?
2 out of 2 found this helpful

Comments

0 comments

Article is closed for comments.

Articles in this section

See more