model_config.yaml contains Orion's configuration parameters. Its purpose is to increase Orion's transparency and give you an easy way to modify its data pipeline.
model_config.yaml does not contain API keys, database configurations and other personal information. These are kept in a
.env file and are loaded as environmental variables.
Example of a
We will explain the content of the
model_config.yaml by using the example below.
Naming the database
Creates a database named
misinformation_deployment and instantiates the required tables.
Querying Microsoft Academic Knowledge API
Orion creates a composite query that will retrieve papers published (
entity_name) in bioRxiv or medRxiv (
query_values). It will only collect papers with a DOI (
with_doi) that were published between 01-01-2000 (
mag_start_date) and today, whenever that is (
mag_end_date). Orion will query each year separately, split it in two periods (
intervals_in_a_year) and for every paper, it will collect its id, title and source (
How to query MAG with a conference name?
entity_name. For example, the snippet below will collect papers presented at the ACL.
The journal, conference and field of study must match the Microsoft Academic Graph format. To find the right format, start typing your query to the Microsoft Academic and it will recommend you the appropriate format.
How to query MAG with a field of study?
entity_name. For example, the snippet below will collect papers with machine learning or deep learning as a field of study.
query_valuescan take multiple inputs, as we did in the first example. Note that this works as an OR query; Orion will retrieve papers published either in bioRxiv or medRxiv, or in both.
Collecting World bank indicators
Collects data on the NY.GDP.MKTP.CD World Bank indicator (
indicators) for every country (
country) and year till 2019 (
end_year). It stores the data in a table named
table_names). You can collect multiple indicators by extending the extending the
You have to create an ORM and create the table for each new indicator.
By default, Orion fetches the following indicators and creates their corresponding tables:
Creating and populating the Elasticsearch index
Erases the existing Elasticsearch index when the
erase_index is set to
True. Set to
False if you want to add documents to the existing index.
Transforming text to vectors with Sentence Transformers
Uses a pretrained sentence-DistilBERT (
bert_model) and to encode batches of 1,000 documents (
batch_size). You can choose any transformer model from this list.
Reduces the dimensionality of the text embeddings with UMAP. You can provide a list of paper IDs to exclude before the model fitting. Leave the list empty if you want to fit a UMAP with all the documents (recommended for the first run).
Transforming text to vectors with TFIDF and SVD
Vectorises documents with TFIDF and reduces their dimensionality with Singular Value Decomposition. You can change the length of the TFIDF and SVD vector by modifying the
n_components parameters respectively.
Creating a country collaboration network
Considers academic documents published after 2013 (
Selecting topics for the research indicators and the visualisations
Orion leverages MAG's Field of Study taxonomy to create a set of topics that are granular enough to make meaningful comparisons and broad enough to capture the diversity of the research topics in the data. In the snippet above, it considers only the
level one topics and the Fields of Study with a frequency in the 75th
You can also select topics from multiple levels and percentiles. For example, this snippet would consider the levels one and three and the Fields of Study with a frequency in the 75th and the 50th percentiles respectively.
Removing low accuracy name-to-gender matches
Consider name-to-gender matches with an accuracy higher than .75% (
threshold). We advise keeping this threshold relatively high and monitor how it affects the metrics.
Selecting research indicators parameters
Considers academic documents published after 2013 (
year) and filters countries prior to calculating the research indicators:
- Research specialisation: Filters countries with less than 10 publications in a year (
- Gender diversity: Filters countries with less than 10 publications in a year (
- Research diversity: Filters countries with less than 15 academic documents containing the field of study that the indicator is measured for (
Use thresholds to remove any countries with a low number of academic documents to avoid procuding misleading indicators.
Creating S3 buckets
Creates four S3 buckets that are used in Orion's pipeline:
mag-data-bucketcontains the raw response from MAG API.
names-batchescontains batched author names that Orion queries GenderAPI with.
document-vectorscontains the Faiss index that is used in the search engine. When using the TFIDF+SVD approach to vectorise documents, Orion stores the TFIDF vectors in this bucket too.
mag-topicscontains the filtered topics that are used to produce the research indicators and the visualisations.
You can rename these buckets.
Creating prefixes for output files
Adds a prefix to files before storing them on S3. These are intermediate outputs used in downstream tasks. You don't need to change them.
Inferring authors' gender
Orion creates four batches (
parallel_tasks) with 20,000 author names each (
batch_size) before querying them to GenderAPI, a name-to-gender inference service. If you are running Orion for a new data collection, you are advised to set the
batch_size high enough to cover all author names. If you do not do so, some of them will not be passed to the GenderAPI. No need to worry though. Orion's ETL label the rest of the names once you rerun that part of the pipeline.
Consider how many cores your machine has before choosing the number of
Mapping country names
Homogenises country names between the Google Places API and the World Bank (
google2wb) and the Google Places API and restcountries API (
model_config.yaml contains the full mapping and you do not have to make any changes.
Identifying non-industry affiliations
Uses a hand-crafted list of terms or names to tag an author affiliation as non-industry or industry. You can find and modify the full list in Orion's
Identifying open access publications
Uses a hand-crafted list of terms or journal names to tag an academic publication as open access. You can find and modify the full list in Orion's