Delivering AI on Code

source{d} builds the open-source components that enable large-scale code analysis and machine learning on source code.

Our powerful tools can ingest all of the world’s public git repositories turning code into ASTs ready for machine learning and other analyses, all exposed through a flexible and friendly API.

We are paving the way for the future of the software development life cycle, where code becomes analyzable data powering the next generation of developer tools.

Examples

engine

The source{d} engine is a unified, scalable code analysis pipeline running on Apache Spark™ available through a friendly and flexible API; A single entry point to all tools.

It crawls, retrieves, stores, accesses, identifies languages and filters a single or all of the world’s public git repositories, generating from source code a dataset of universal ASTs ready to be analysed or input into machine learning tools & models.

Excited to try it? Request a live demo.

go to project

The source{d} engine is a unified, scalable code analysis pipeline running on Apache Spark™ available through a friendly and flexible API; A single entry point to all tools.

It crawls, retrieves, stores, accesses, identifies languages and filters a single or all of the world’s public git repositories, generating from source code a dataset of universal ASTs ready to be analysed or input into machine learning tools & models.

Excited to try it? Request a live demo.

go to project
# import the source{d} engine
from sourced.spark import API as Engine
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# start a new session
spark = SparkSession.builder \
        .master("local[*]").appName("Examples") \
        .getOrCreate()

engine = Engine(spark, "/repositories")

# get identifiers of all Python files
idents = engine.repositories.filter("is_fork = false") \
         .references \
         .head_ref.commits.first_reference_commit \
         .files \
         .classify_languages() \
         .extract_uasts() \
         .query_uast('//*[@roleIdentifier and not(@roleIncomplete)]') \
         .filter("is_binary = false") \
         .filter("lang = 'Python'") \
         .select("file_hash", "result").distinct()

# get and show the tokens from the identifiers
tokens = idents.extract_tokens()
tokens.limit(10).show()
go to project

Projects

Our roadmap

source{d} is building the tech stack for machine learning on source code (MLoSC). We allow code to become a first-class analyzable asset across tens of millions of repositories as well as a single one through our powerful source{d} engine and machine learning tools.

With access to every open source and public git repository online today, developers and organizations can understand their code as part of the complex set of dependencies it really is. We envision every organization running a data pipeline over their software development life cycle, where source code becomes a unique, actionable dataset that can be analyzed and used in machine learning models.

These are the building blocks for the next generation of new and impactful developer tools & systems that will change the way we learn programming as well as how we write & review code.

What developers say

jpetazzo

The folks at @srcd_ are doing neat deep learning on source code history; and of course it's in @docker containers. 🐳

Jérôme Petazzoni @jpetazzo
anler

💭ing @srcd_ & @47deg are among the freaking best/coolest/opensource-devoted companies in Spain atm, they just don't say it, they do it 🙌🤜🤛

Anler @anler
dzello

"Our engineers dedicate at least 10% of their working hours towards any open source projects of their choosing." \m/

Josh Dzielak @dzello
jna_sh

source{d} are putting out so much OSS

Joe Nash @jna_sh
omfgcoffee

I think this is my favorite thing I've read all year: blog.sourced.tech/post/lapjv/

Eric Smith @omfgcoffee

Blog