Chapter 5: Social Media Intelligence¶

Learning Objectives¶

By the end of this chapter, you will be able to: - Apply systematic SOCMINT methodology across major platforms - Identify and cross-reference accounts belonging to the same individual - Analyze social networks to identify key relationships and influence patterns - Extract and verify geolocation data from social media content - Build automated social media monitoring workflows - Navigate platform API restrictions with appropriate alternatives - Apply AI tools to accelerate social media analysis

Social media has fundamentally changed the OSINT landscape by inducing people to voluntarily publish extraordinary quantities of personal information. The average social media user:

Posts photographs that may include geolocation metadata, identifiable locations in the background, and images of associates
Publishes biographical information — employment, education, relationships, interests
Documents life events — travel, purchases, milestones, conflicts
Reveals behavioral patterns — posting times, activity cycles, response patterns to events
Creates network graphs through follows, connections, and interactions

What was previously invisible — the social network, daily activities, opinions, relationships — is now publicly documented by its subjects. This is intelligence gold.

But social media data comes with significant complications: access restrictions, content manipulation, fake accounts, selective presentation, and the aggregation problem. Working effectively with social media requires understanding these limitations alongside its remarkable capabilities.

5.2 Platform-Specific Intelligence Methodology¶

Facebook¶

Facebook contains the most comprehensive social data for the largest number of users, but direct investigative access has been severely restricted since the Cambridge Analytica scandal triggered API restrictions in 2018-2019.

What is accessible: - Public profile pages (name, profile photo, cover photo, public posts) - Public pages and groups - Facebook Marketplace listings - Facebook Events (public) - Facebook Live streams (public) - Check-ins (when made public)

Investigative techniques:

Graph Search alternatives: Facebook's internal Graph Search was a powerful investigative tool that was largely disabled. Limited manual equivalents remain through URL manipulation and specific search queries.

Profile enumeration: Facebook user IDs in profile URLs can be used to construct direct profile links that bypass some search restrictions. Old Graph Search syntax still partially works for specific query types.

Photo analysis: Public profile photos and posts often contain geolocatable backgrounds, tagged individuals, and identifiable items. Reverse image search on profile photos can identify whether the same image appears elsewhere.

Group and page monitoring: Public groups and pages often contain rich information about their members' activities, associations, and opinions. Regular monitoring of relevant groups can yield significant intelligence.

Archived content: The Wayback Machine archives public Facebook profiles. Facebook content deleted from the live platform may persist in web archives, news articles that embedded the content, and third-party sharing sites.

Twitter/X¶

Twitter/X has been the most important real-time text intelligence source in OSINT, though API restrictions have significantly complicated programmatic access.

What is accessible: - Public tweets and replies - Public profile information - Follower/following lists (partially) - Media attachments on public posts - Spaces (public audio)

Investigative techniques:

Advanced search: Twitter's search operators support powerful querying: from:username, to:username, near:location, date ranges (since:2023-01-01 until:2023-06-01), language filtering, and keyword combinations.

Username history: People change usernames but their tweet history often reveals former usernames through auto-generated phrases ("@oldname retweeted...") and third-party archiving services.

Deleted tweet recovery: Specialized archiving services (Pushshift before it was shut down, Wayback Machine, various Twitter archive sites) preserve deleted tweets.

Geolocation from content: Tweets with location data enabled include coordinates. Tweets without GPS data can be geolocated from image metadata, visible landmarks, and context clues.

Network analysis: Follower/following relationships, retweet patterns, and reply patterns reveal social network structure and potentially undisclosed relationships.

API access: The Twitter API v2 has multiple access tiers. Free tier is extremely limited. Basic tier enables limited access. The research track for academic access was eliminated. Commercial access is expensive. Many previously available research capabilities have been removed.

Alternative access: Third-party Twitter search tools (Nitter instances, while they lasted), web scraping (subject to ToS), and the Twitter web interface provide partial alternatives for manual research.

LinkedIn¶

LinkedIn is the most reliable source of professional biographical data because professional reputation depends on accuracy. The platform is particularly valuable for corporate investigations, background checks, and due diligence.

What is accessible: - Public profiles (name, photo, headline, summary, experience, education, skills, endorsements) - Company pages (size, location, industry, employees) - Public posts and articles - Job listings (which reveal organizational structure and technology stack)

Investigative techniques:

Profile gap analysis: Employment gaps, inconsistencies between claimed tenure and company history, and unverifiable credentials are identifiable through profile analysis combined with public records verification.

Company analysis via employees: By analyzing the LinkedIn profiles of employees at a company, investigators can map organizational structure, identify key personnel, understand technology stacks (from skills sections), and track personnel changes.

Job posting analysis: A company's open job postings reveal technology decisions, expansion plans, financial health indicators, and organizational structure. Archived job postings (via LinkedIn History, Google cache, or job aggregators) show historical priorities.

Relationship mapping: Shared connections and recommendations reveal professional relationships.

Anti-scraping measures: LinkedIn aggressively resists automated collection. It employs sophisticated bot detection, legal action against scrapers, and profile view tracking that alerts subjects when they receive unusual view patterns from investigators.

Manual collection discipline: For LinkedIn, manual collection with thorough documentation is often preferable to automated approaches that risk account bans and are legally contested.

Instagram¶

Instagram's visual focus makes it particularly valuable for geolocation, lifestyle analysis, and identifying associates.

What is accessible (public accounts only): - Photos and videos with captions - Location tags on posts - Stories saved as Highlights - Follower/following counts and lists (partially)

Investigative techniques:

Geolocation from tagged locations: Instagram allows users to tag specific locations on posts. Aggregating tagged locations reveals patterns of activity, confirmed locations, and social venues.

Background analysis: Photographs often contain geolocatable elements in the background — street signs, storefronts, architectural features, terrain — even when no explicit geolocation tag is present.

EXIF data: Some Instagram uploads retain EXIF metadata, including GPS coordinates from the original photograph. Availability varies and Instagram strips metadata in many cases.

Reverse image search: Profile photos and posted images can be reverse-searched to find the same images elsewhere, revealing alternative accounts, dating profiles, or contextual information.

Story archiving: Instagram Stories disappear after 24 hours but may be preserved by third-party story-saving services or archiving at time of collection.

TikTok¶

TikTok's video-centric format and younger user demographic make it valuable for specific investigation types.

What is accessible: - Public video content - Comments on public videos - Profile information - Hashtag-based content discovery

Investigative techniques:

Video metadata analysis: Video upload timestamps, device information sometimes embedded in metadata, and editing artifact analysis.

Sound and audio analysis: Background sounds in TikTok videos can be geolocatable — recognizable languages, accents, ambient sounds, or identifiable locations.

Text in video analysis: OCR processing of text visible in videos can extract useful data.

Hashtag monitoring: TikTok hashtags aggregate content around topics, events, or communities that can be monitored systematically.

Telegram¶

Telegram's large public channels have made it a critical source for threat intelligence, extremism research, and monitoring of specific communities.

What is accessible: - Public channel content - Public group messages - Bot interactions - Public channel member lists (in some configurations)

Investigative techniques:

Channel monitoring: Using Telegram client applications or the Telegram API, public channels can be monitored and archived systematically.

Cross-channel analysis: Messages forwarded between channels reveal network relationships and coordination patterns.

Username-based pivot: Telegram usernames can be searched and linked to accounts on other platforms.

Bot enumeration: Public Telegram bots often reveal backend service information and operational patterns.

Legal and ethical caution: Many Telegram channels contain illegal content, extremist material, and harmful information. Accessing this content for legitimate intelligence purposes requires clear authorization, careful content management policies, and psychological support for analysts exposed to harmful material.

5.3 Account Discovery and Cross-Platform Correlation¶

Identifying that a known account on one platform belongs to the same individual as an account on another platform is a core OSINT capability.

Username-Based Discovery¶

Many people use consistent usernames across platforms. Once a username is identified on one platform, systematic search across other platforms often yields additional accounts.

Tools: - Sherlock: Python tool that searches 300+ websites for a given username - Maigret: Extended Sherlock alternative with richer results - WhatsMyName: Curated list of sites searchable by username with a web interface and CLI tool - Namecheckr: Commercial username search across major platforms

# Sherlock usage example
# Install: pip install sherlock-project
# CLI: sherlock username

# Python integration via subprocess
# Note: --json writes results to {username}.json, not stdout
import subprocess
import json
import os

def search_username(username):
    subprocess.run(
        ['sherlock', '--print-found', '--json', username],
        capture_output=True, text=True
    )
    # Sherlock writes JSON output to {username}.json in the working directory
    json_file = f'{username}.json'
    if os.path.exists(json_file):
        with open(json_file) as f:
            return json.load(f)
    return {}

findings = search_username('targetusername')
for platform, data in findings.items():
    if data['status']['status'] == 'Claimed':
        print(f"Found on {platform}: {data['url_user']}")

Email-Based Discovery¶

Email addresses are powerful pivots for account discovery:

HaveIBeenPwned API: Identifies data breaches containing a given email address
Hunter.io: Email validation and corporate email pattern discovery
EmailRep.io: Email reputation and associated account information
Holehe: Python tool that checks whether an email is registered on 120+ websites

# Holehe usage for email-to-account discovery
# Install: pip install holehe
# CLI: holehe email@example.com

# Python integration via subprocess (holehe's programmatic API changes frequently;
# the CLI is the most stable interface)
import subprocess
import re

def check_email_registrations(email):
    result = subprocess.run(
        ['holehe', '--only-used', email],
        capture_output=True, text=True
    )
    results = []
    for line in result.stdout.splitlines():
        # holehe marks found accounts with [+]
        if '[+]' in line:
            platform = re.sub(r'.*\[\+\]\s*', '', line).strip()
            results.append({'platform': platform, 'registered': True})
    return results

email_results = check_email_registrations('target@example.com')

Profile Photo Cross-Platform Matching¶

The same profile photo used across multiple platforms provides strong cross-platform identity linkage.

Reverse image search services: - Google Reverse Image Search: Best general-purpose coverage - TinEye: Largest photographic index, strong for tracking image reuse - Yandex Image Search: Particularly effective for face matching - Bing Visual Search: Alternative index with different coverage

Face-matching tools: PimEyes (commercial face search across indexed web images) and similar services provide facial similarity search. These have significant privacy and ethical implications — use requires legitimate purpose and authorization.

Behavioral Pattern Correlation¶

When direct identity linkage is unavailable, behavioral patterns can establish correlation:

Writing style analysis: Consistent vocabulary, grammatical patterns, error patterns, and stylistic choices persist across pseudonymous accounts
Activity timing patterns: Posting times correlated with timezone and daily schedule patterns
Topic pattern: Consistent interest patterns across accounts
Image analysis: Consistent photography style, locations appearing in images

These behavioral correlations produce probabilistic rather than definitive linkages and should be described with appropriate uncertainty.

Geolocating social media content — determining where a photograph or video was taken — is one of the most powerful and legally complex OSINT capabilities.

EXIF Metadata Analysis¶

Many photographs contain EXIF metadata including GPS coordinates from the device's location services. When this metadata is present (not stripped by the platform), it provides precise geolocation.

Tools: - ExifTool: The standard tool for EXIF metadata extraction - Jeffrey's Exif Viewer: Web-based EXIF viewer - Jimpl.com: Online image metadata viewer

# ExifTool examples
# Install: brew install exiftool (macOS) or apt install libimage-exiftool-perl

# Extract all EXIF data
exiftool image.jpg

# Extract GPS coordinates only
exiftool -GPSLatitude -GPSLongitude image.jpg

# Extract GPS from all images in a directory
exiftool -GPSLatitude -GPSLongitude -csv *.jpg > gps_data.csv

# Batch process and output to JSON
exiftool -json -GPSLatitude -GPSLongitude -DateTimeOriginal *.jpg

Important caveat: Most social media platforms strip EXIF data from uploaded images. EXIF geolocation is primarily useful for images obtained directly (not via platform upload) or on platforms/services that preserve metadata.

Visual Geolocation¶

When EXIF data is unavailable, visual analysis of image content can establish location:

Landmark identification: Recognizable buildings, monuments, distinctive terrain, street signage, and architectural features allow location identification. Google Street View, reverse image search, and geographic knowledge are primary tools.

Business identification: Storefronts, logos, and business signage in the background can be identified and searched for location information.

Language and cultural markers: Text in the background (in languages with limited geographic range), cultural clothing, food items, and other cultural markers narrow geographic location.

Vegetation and terrain: Distinctive plant species, terrain types, and seasonal vegetation state provide geographic and temporal constraints.

Shadow analysis: The direction and length of shadows in a photograph constrain both geographic location and time of day/year.

Sky analysis: Cloud formations, sun position, moon phase, and star patterns provide temporal and geographic constraints.

The Bellingcat-developed Geolocation Investigation Methodology provides a structured approach to systematic visual geolocation. Their online guide is an essential reference.

Geolocation Ethics and Legality¶

The same geolocation capability that enables journalists to verify conflict zone reports can enable stalkers to locate victims. The techniques described above must be applied only with appropriate authorization and legitimate purpose.

Key ethical constraints: - Geolocating private individuals without legitimate investigative purpose is ethically prohibited - Geolocating content that reveals the location of at-risk individuals (domestic violence survivors, human rights activists, witnesses in danger) requires extreme care about how findings are used and shared - Publishing geolocation results that enable harm to individuals is a responsibility of the investigator, regardless of public availability of the source material

Beyond individual account analysis, the structure of social networks reveals information invisible at the individual level.

Network Metrics¶

Degree centrality: The number of connections a node has. High degree centrality indicates broad influence or connectivity.

Betweenness centrality: How often a node sits on the shortest path between other nodes. High betweenness indicates a broker role — someone who connects otherwise disconnected communities.

Closeness centrality: How quickly a node can reach all others in the network. High closeness indicates an influential position.

Clustering coefficient: The proportion of a node's connections who are also connected to each other. High clustering indicates tight community membership.

PageRank: A weighted authority measure that accounts for the importance of a node's connections.

Network Visualization and Analysis¶

# NetworkX — standard Python library for network analysis
import networkx as nx
import matplotlib.pyplot as plt

# Build network from social media data
G = nx.DiGraph()  # Directed graph for follower relationships

# Add nodes and edges from collected follow data
followers_data = [
    ("user_a", "target_user"),
    ("user_b", "target_user"),
    ("target_user", "user_c"),
    # ... more edges
]

G.add_edges_from(followers_data)

# Calculate centrality metrics
degree_cent = nx.degree_centrality(G)
betweenness_cent = nx.betweenness_centrality(G)
pagerank = nx.pagerank(G)

# Identify most central nodes
top_by_degree = sorted(degree_cent.items(), key=lambda x: x[1], reverse=True)[:10]
top_by_betweenness = sorted(betweenness_cent.items(), key=lambda x: x[1], reverse=True)[:10]

# Detect communities
from networkx.algorithms import community
communities = list(community.greedy_modularity_communities(G.to_undirected()))
print(f"Detected {len(communities)} communities")

# Visualize
plt.figure(figsize=(12, 8))
pos = nx.spring_layout(G, k=0.5)
nx.draw_networkx(G, pos, node_size=[v * 3000 for v in degree_cent.values()],
                 node_color='lightblue', arrows=True, with_labels=False)
plt.savefig('network_graph.png', dpi=150, bbox_inches='tight')

Identifying Coordinated Inauthentic Behavior¶

Social media platforms and researchers have developed methods to identify networks of accounts acting in coordination to amplify messages, harass targets, or simulate organic behavior:

Temporal correlation: Accounts that consistently like, repost, or interact with the same content within seconds of each other suggest automated or coordinated behavior.

Content similarity: Multiple accounts posting identical or nearly identical content, or accounts that reuse the same images, suggest coordination.

Network structure anomalies: Unusual network topology — accounts that exclusively follow each other, or fan-out patterns that suggest account seeding — indicate manufactured networks.

Behavioral uniformity: Identical posting schedules, response patterns, or account creation dates across multiple accounts suggest automation.

5.6 Automated OSINT Monitoring Workflows¶

For ongoing investigations or monitoring requirements, manual platform review is insufficient. Automated monitoring provides continuous coverage.

Twitter/X Monitoring Architecture¶

# Twitter API v2 monitoring with tweepy
import tweepy
import json
from datetime import datetime

class TwitterOSINTMonitor:
    def __init__(self, bearer_token):
        self.client = tweepy.Client(bearer_token=bearer_token)

    def search_recent(self, query, max_results=100):
        """Search recent tweets with OSINT-relevant operators"""
        tweets = self.client.search_recent_tweets(
            query=query,
            tweet_fields=['created_at', 'author_id', 'geo',
                         'entities', 'context_annotations'],
            user_fields=['name', 'username', 'location', 'description'],
            expansions=['author_id', 'geo.place_id'],
            max_results=max_results
        )
        return self._process_response(tweets)

    def get_user_timeline(self, username, max_results=200):
        """Retrieve user timeline"""
        user = self.client.get_user(username=username)
        if not user.data:
            return []

        tweets = self.client.get_users_tweets(
            id=user.data.id,
            tweet_fields=['created_at', 'entities', 'geo'],
            max_results=max_results
        )
        return self._process_response(tweets)

    def _process_response(self, response):
        results = []
        if response.data:
            for tweet in response.data:
                results.append({
                    'id': tweet.id,
                    'text': tweet.text,
                    'created_at': str(tweet.created_at),
                    'author_id': tweet.author_id,
                })
        return results

# OSINT query examples
monitor = TwitterOSINTMonitor(bearer_token="YOUR_TOKEN")

# Monitor mentions of a target
mentions = monitor.search_recent(
    query='(from:targetaccount OR @targetaccount) -is:retweet',
    max_results=100
)

# Monitor location-based content
local_content = monitor.search_recent(
    query='point_radius:[longitude latitude radius_km]km -is:retweet'
)

RSS and Web Monitoring¶

For web-based monitoring beyond social media:

import feedparser
import hashlib
import sqlite3
from datetime import datetime
import requests
from bs4 import BeautifulSoup

class WebMonitor:
    def __init__(self, db_path='monitoring.db'):
        self.conn = sqlite3.connect(db_path)
        self._init_db()

    def _init_db(self):
        self.conn.execute('''
            CREATE TABLE IF NOT EXISTS monitored_items (
                id TEXT PRIMARY KEY,
                source TEXT,
                title TEXT,
                content TEXT,
                url TEXT,
                discovered_at TEXT,
                content_hash TEXT
            )
        ''')
        self.conn.commit()

    def monitor_rss_feed(self, feed_url, source_name):
        """Monitor RSS feed for new items"""
        feed = feedparser.parse(feed_url)
        new_items = []

        for entry in feed.entries:
            item_id = hashlib.md5(entry.get('link', entry.get('id', '')).encode()).hexdigest()

            # Check if already seen
            cursor = self.conn.execute('SELECT id FROM monitored_items WHERE id = ?', (item_id,))
            if cursor.fetchone() is None:
                item = {
                    'id': item_id,
                    'source': source_name,
                    'title': entry.get('title', ''),
                    'content': entry.get('summary', ''),
                    'url': entry.get('link', ''),
                    'discovered_at': datetime.now().isoformat(),
                    'content_hash': hashlib.md5(
                        entry.get('summary', '').encode()
                    ).hexdigest()
                }

                self.conn.execute(
                    'INSERT INTO monitored_items VALUES (?,?,?,?,?,?,?)',
                    tuple(item.values())
                )
                self.conn.commit()
                new_items.append(item)

        return new_items

    def monitor_page_changes(self, url, selector=None):
        """Monitor a specific web page for content changes"""
        response = requests.get(url, timeout=30)
        soup = BeautifulSoup(response.content, 'html.parser')

        if selector:
            content = soup.select_one(selector)
            text = content.get_text() if content else ''
        else:
            text = soup.get_text()

        current_hash = hashlib.md5(text.encode()).hexdigest()

        # Check against stored hash
        cursor = self.conn.execute(
            'SELECT content_hash FROM monitored_items WHERE url = ? ORDER BY discovered_at DESC LIMIT 1',
            (url,)
        )
        row = cursor.fetchone()

        if row is None or row[0] != current_hash:
            # Page changed or new
            self.conn.execute(
                'INSERT INTO monitored_items VALUES (?,?,?,?,?,?,?)',
                (current_hash, 'web_monitor', url, text[:2000], url,
                 datetime.now().isoformat(), current_hash)
            )
            self.conn.commit()
            return True, text

        return False, None

AI tools significantly accelerate and enhance social media OSINT:

Sentiment and tone analysis: Automated classification of post sentiment reveals patterns in emotional state, reactions to events, and behavioral changes over time.

Entity extraction at scale: NLP tools extract named entities from large volumes of posts, enabling automated profile building from textual content.

Language pattern analysis: Stylometric analysis using ML models can identify consistent writing styles across pseudonymous accounts.

Image analysis: Computer vision models can identify objects, scenes, text, logos, and faces in social media images, accelerating visual analysis.

Anomaly detection: ML models trained on normal behavior patterns can flag unusual activity patterns that warrant investigative attention.

# Using transformers for social media text analysis
from transformers import pipeline

# Sentiment analysis
sentiment_analyzer = pipeline("sentiment-analysis")
posts = ["Today was incredible, best day ever", "Company X is completely corrupt"]
sentiments = sentiment_analyzer(posts)

# Named entity recognition
ner_pipeline = pipeline("ner", aggregation_strategy="simple")
text = "John Smith met with executives from Acme Corp in San Francisco last Tuesday"
entities = ner_pipeline(text)
# Returns: [{'entity_group': 'PER', 'word': 'John Smith', ...},
#           {'entity_group': 'ORG', 'word': 'Acme Corp', ...},
#           {'entity_group': 'LOC', 'word': 'San Francisco', ...}]

# Zero-shot classification for topic categorization
classifier = pipeline("zero-shot-classification")
labels = ["criminal activity", "political organizing", "personal communication", "business activity"]
result = classifier(
    "Meeting at the warehouse at 3am, bring the packages",
    candidate_labels=labels
)

Summary¶

Social media intelligence (SOCMINT) is the most fertile modern OSINT source but requires platform-specific methodology and careful attention to access restrictions, data quality, and ethical constraints.

Effective SOCMINT combines direct platform investigation with cross-platform account correlation (via username, email, and image matching), network analysis, and geolocation methodology. Automated monitoring extends coverage to ongoing situations and large-scale investigations.

AI tools — NLP, computer vision, network analysis — are transforming SOCMINT from a manual process to one that can scale to thousands of accounts and millions of posts.

The power of SOCMINT comes with proportionate ethical obligations. The aggregation of individually public social media data can constitute a level of surveillance that requires strong justification regardless of the public nature of the sources.

Common Mistakes and Pitfalls¶

Platform API over-reliance: Building workflows entirely on platform APIs that can change or disappear
Ignoring profile reliability: Treating social media profile information as verified biographical data
Missing dark web adjacent platforms: Telegram, Discord, and 8kun may contain critical information not visible on mainstream platforms
Geolocation overconfidence: Identifying a location from visual clues without multiple confirming indicators
Network analysis misinterpretation: Confusing network adjacency with meaningful relationship
Content timestamp trust: Platform timestamps can be manipulated or reflect upload time rather than creation time
Fake account misidentification: Attributing a fake or impersonation account to the target

Chapter 5: Social Media Intelligence¶

Learning Objectives¶

5.1 Why Social Media Is the Primary Modern OSINT Source¶

5.2 Platform-Specific Intelligence Methodology¶

Facebook¶

Twitter/X¶

LinkedIn¶

Instagram¶

TikTok¶

Telegram¶

5.3 Account Discovery and Cross-Platform Correlation¶

Username-Based Discovery¶

Email-Based Discovery¶

Profile Photo Cross-Platform Matching¶

Behavioral Pattern Correlation¶

5.4 Geolocation from Social Media Content¶

EXIF Metadata Analysis¶

Visual Geolocation¶

Geolocation Ethics and Legality¶

5.5 Social Network Analysis¶

Network Metrics¶

Network Visualization and Analysis¶

Identifying Coordinated Inauthentic Behavior¶

5.6 Automated OSINT Monitoring Workflows¶

Twitter/X Monitoring Architecture¶

RSS and Web Monitoring¶

5.7 AI-Assisted Social Media Analysis¶

Summary¶

Common Mistakes and Pitfalls¶

Further Reading¶