IMDb Non-Commercial Datasets
from: https://developer.imdb.com/non-commercial-datasets/
Official doc
Those are the 7 tables.
title.akas.tsv.gz
- titleId (string) - a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title. Those have more than 150 titles:
- tt0088814
- tt0168366
- tt0407304
- tt1077274
- tt15837206
- tt0099785
- tt2872750
- region (string) - the region for this version of the title
- language (string) - the language of the title. Eg 5M of jp/JP, 4.8M of hi/IN, fr/FR, es/ES, 51K of cmn/CN etc.
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following:
- “alternative”
- “dvd”
- “festival”
- “tv”
- “video”
- “working”
- “original”
- “imdbDisplay”
- attributes (array) - Additional terms to describe this alternative title, not enumerated. One or none of the following:
- transliterated title
- alternative spelling
- new title
- literal English title
- complete title
- literal title
- short title
- series title
- isOriginalTitle (boolean) – 0: not original title; 1: original title. Of the 52M titles, 12M are original title, and 6M have 0 or 1 variant title.
title.basics.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title
title.crew.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- directors (array of nconsts) - director(s) of the given title
- writers (array of nconsts) – writer(s) of the given title
title.episode.tsv.gz
- tconst (string) - alphanumeric identifier of episode
- parentTconst (string) - alphanumeric identifier of the parent TV Series
- seasonNumber (integer) – season number the episode belongs to
- episodeNumber (integer) – episode number of the tconst in the TV series
title.principals.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) - alphanumeric unique identifier of the name/person
- category (string) - the category of job that person was in
- job (string) - the specific job title if applicable, else ‘\N’
- characters (string) - the name of the character played if applicable, else ‘\N’
title.ratings.tsv.gz
- tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received
name.basics.tsv.gz
- nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else ‘\N’
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for
Insert Order
title_basics
- Reason: This is the root table. Many other tables depend on it via foreign keys (
tconst
). It must be populated first. - Dependencies: None (it’s the foundation).
- Tables that depend on it:
title_akas
title_crew
title_episode
title_principals
title_ratings
- Reason: This is the root table. Many other tables depend on it via foreign keys (
name_basics
- Reason:
title_principals
depends on it via thenconst
foreign key. - Dependencies: None (except the implicit dependency that
tconst
inknown_for_titles
should exist intitle_basics
, but this is not a strict foreign key constraint). - Tables that depend on it:
title_principals
- Reason:
title_crew
- Reason: Depends on
title_basics
via thetconst
foreign key. - Dependencies:
title_basics
- Tables that depend on it: None
- Reason: Depends on
title_episode
- Reason: Depends on
title_basics
via bothtconst
andparent_tconst
foreign keys. - Dependencies:
title_basics
- Tables that depend on it: None
- Reason: Depends on
title_ratings
- Reason: Depends on
title_basics
via thetconst
foreign key. - Dependencies:
title_basics
- Tables that depend on it: None
- Reason: Depends on
title_principals
- Reason: Depends on both
title_basics
(viatconst
) andname_basics
(vianconst
). - Dependencies:
title_basics
name_basics
- Tables that depend on it: None
- Reason: Depends on both
title_akas
- Reason: Depends on
title_basics
via thetitle_id
foreign key. - Dependencies:
title_basics
- Tables that depend on it: None
- Reason: Depends on
Detailed Instructions
1 | gzip -d name.basics.tsv.gz \ |
replace \N characters with empty values in order to import them as NULLs, according to a blog post from metis
1 | sed 's/\\N//g' name.basics.tsv > name.basics.tsv2 \ |
In psql:
1 | psql -d postgres |
Last, copy data:
1 | \copy imdb.title_basics FROM '/database/title.basics.tsv2' DELIMITER E'\t' QUOTE E'\b' CSV HEADER; |
If use docker on Mac:
1 | docker volume create imdb_data |
Now you’ve created Postgres with username/password imdb/1234
, and db name imdb_db
Import data dump into Postgres
Seems like there’s slight issue with the genres of title_basics, thus need to convert the comma-separated list in the last field to a PostgreSQL array format
Run this:
1 | awk -F'\t' -v OFS='\t' '{if ($9 != "") $9 = "{"$9"}"; print}' title.basics.tsv2 > title.basics.tsv3 |
Great it works!
2nd Table
1 | awk -F'\t' -v OFS='\t' '{if ($5 != "") $5 = "{"$5"}"; if ($6 != "") $6 = "{"$6"}"; print}' name.basics.tsv2 > name.basics.tsv3 |
Small problem with this person, thus line 938151 is empty, we need to remove it.
1 | sed -n '938151p' name.basics.tsv3 # Read this line |
After this I somehow I need to run
1 | insert into imdb.name_basics values ('nm17199025', 'Kay Masten') |
to avoid a fk_title_principals_name_basics error (in later stage).
Alternatively: ALTER TABLE imdb.name_basics ALTER COLUMN primary_name DROP NOT NULL;
does the trick well!
3nd to 7th Tables
1 | docker exec -i imdb-postgres psql -U imdb -d imdb_db -c "\copy imdb.title_crew FROM '/data/imdb/title.crew.tsv2' DELIMITER E'\t' QUOTE E'\b' CSV HEADER;" |
SQL
1 | -- Schema: imdb |
Updated schema (live data)
Type changes:
directors VARCHAR(20)[]
writers VARCHAR(20)[]
Remove checks:
CHECK (season_number > 0),
CHECK (episode_number > 0)
CHECK (birth_year > 1800),
CHECK (death_year >= birth_year)
Remove some unavailable titles from principals
table.
1 | DELETE FROM imdb.title_principals |
Updated Table as of Apr 20th, 2025:
1 | -- imdb.name_basics definition |
Below 3291 people are present in title_principals, but missing from name_basics.
That’s why I have to drop the fk_title_principals_name_basics
constraint.
1 | SELECT tp.nconst , COUNT(*) as count |
Also, need to convert text format of directors, writers column of title_crew
to list format.