Sitemap generation by streaming from WordPress headless
Optimize sitemap generation by leveraging streaming and chunked transfer encoding.
The sitemap.xml
file is very useful for search engines to understand which URLs are available for crawling. Therefore, when building a blog, generating this file can help boost SEO.
Taking this website as an example, the sitemap includes static URLs like /home
, /projects
, as well as blog posts at /posts/:slug
. This is the output we need to generate:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://localhost:5173/</loc>
<lastmod>2024-11-04T06:32:35.319Z</lastmod>
<changefreq>monthly</changefreq>
<priority>1</priority>
</url>
<url>
<loc>http://localhost:5173/home</loc>
<lastmod>2024-11-04T06:32:35.319Z</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://localhost:5173/posts</loc>
<lastmod>2024-11-04T06:32:35.319Z</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://localhost:5173/projects</loc>
<lastmod>2024-11-04T06:32:35.319Z</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://localhost:5173/social</loc>
<lastmod>2024-11-04T06:32:35.319Z</lastmod>
<changefreq>monthly</changefreq>
<priority>0.6</priority>
</url>
<url>
<loc>http://localhost:5173/slides</loc>
<lastmod>2024-11-04T06:32:35.319Z</lastmod>
<changefreq>monthly</changefreq>
<priority>0.6</priority>
</url>
<url>
<loc>https://sjdonado.com/posts/should-we-leave-behind-the-joy-of-curating-our-own-content</loc>
<lastmod>2024-11-02T12:38:11.000Z</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://sjdonado.com/posts/building-a-fast-and-compact-sqlite-cache-store</loc>
<lastmod>2024-10-31T17:06:21.000Z</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://sjdonado.com/posts/validated-forms-with-usefetcher-in-remix</loc>
<lastmod>2024-10-31T15:15:14.000Z</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://sjdonado.com/posts/fast-intuitive-smart-restaurant-search-engine-with-cloudflare-ai</loc>
<lastmod>2024-10-31T15:31:28.000Z</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://sjdonado.com/posts/htmx-with-bun-a-real-world-app</loc>
<lastmod>2024-11-03T14:02:46.000Z</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://sjdonado.com/posts/tower-of-hanoi-in-p5-js-wasm</loc>
<lastmod>2024-11-03T12:58:11.000Z</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://sjdonado.com/posts/self-hosted-password-manager-with-dokku</loc>
<lastmod>2024-10-31T15:55:16.000Z</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
As you might guess, the challenge of generating this file lies in how to fetch the posts using the WordPress REST API. Initially, we could opt to fetch all posts with GET /wp-json/wp/v2/posts
and iterate over the response in a separate backend:
const response = await fetch(`${siteUrl}/wp-json/wp/v2/posts?per_page=100&_embed`);
const posts = await response.json();
let sitemap = '<?xml version="1.0" encoding="UTF-8"?>\n';
sitemap += '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n';
for (const url of staticUrls) {
sitemap += `
<url>
<loc>${url.loc}</loc>
<lastmod>${url.lastmod}</lastmod>
<changefreq>${url.changefreq}</changefreq>
<priority>${url.priority}</priority>
</url>
`;
}
for (const post of posts) {
sitemap += `
<url>
<loc>${siteUrl}/posts/${post.slug}</loc>
<lastmod>${new Date(post.modified || post.date).toISOString()}</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
`;
}
sitemap += '</urlset>';
return new Response(sitemap, {
headers: {
'Content-Type': 'application/xml',
},
});
That approach works fine, and in the end, the most time-consuming operation is waiting for the API response, which tends to increase linearly as more posts are fetched. The question now is, can this be faster?
The initial answer has to be yes, since we are adding overhead by not rendering this file directly from the WordPress environment, which is closest to the database. Moving from a headless approach to creating a WP action would be the ideal, fastest scenario. However, we don’t want to do that, we want to maintain a separation of concerns, where WP is our CMS, and the blog routes are rendered elsewhere.
This approach leads us to consider rendering the sitemap file immediately as data is received from the database. Instead of waiting for the entire dataset, we can process posts in chunks as they are sent from the database. By setting up a new REST endpoint that returns posts as chunked data, a ReadableStream
can handle each chunk, appending it to the XML sitemap while streaming the file directly to the browser. This way, the browser starts receiving the sitemap file almost instantly, and the XML is built dynamically as new data arrives.
It may sound a bit complicated, but think of it this way: We are streaming from the database to the browser by converting each post into a urlset
entry in the final XML file, each post goes through two streams:
WP (stream 1, chunk post) -> Backend (stream 2, chunk urlset) -> Browser
With this in mind, let’s dive into the code.
Stream 1: WordPress endpoint
Chunked transfer encoding allows chunks to be sent and received independently of one another. Therefore, we query the posts in batches, collect their information, and send them one by one in chunks. We need to ensure that each JSON-encoded chunk includes its size appended at the end.
function stream_posts() {
// Set the HTTP headers to indicate a stream
header('Content-Type: application/json');
header('Transfer-Encoding: chunked');
header('Connection: keep-alive');
header('X-Accel-Buffering: no'); // Disable buffering in Nginx if applicable
// Start output buffering and flush
ob_start();
// Fetch posts in batches
$query = new WP_Query([
'post_type' => 'post',
'posts_per_page' => 10,
]);
while ($query->have_posts()) {
$query->the_post();
$post_data = [
'title' => get_the_title(),
'link' => get_permalink(),
'excerpt' => get_the_excerpt(),
'date' => get_the_date('c'),
'modified' => get_the_modified_date('c'),
'content' => apply_filters('the_content', get_the_content()),
];
$json_chunk = json_encode($post_data) . "\n";
// Output the chunk size in hexadecimal, followed by \r\n
echo dechex(strlen($json_chunk)) . "\r\n";
// Output the chunk data
echo $json_chunk . "\r\n";
// Flush the output buffer to send data immediately
ob_flush();
flush();
}
// Send the final chunk to indicate the end of the response
echo "0\r\n\r\n";
// Clean up
wp_reset_postdata();
ob_end_flush();
exit; // Terminate the script to prevent WordPress from adding any further output
}
Stream 2: XML file
We do the same as the WP endpoint, but on the blog rendering side. We open the second stream to the client while keeping the first stream connection open:
const sitemapStream = new ReadableStream({
async start(controller) {
controller.enqueue(
`<?xml version="1.0" encoding="UTF-8"?>\n` +
`<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n`
);
for (const url of staticUrls) {
const staticUrlItem = `
<url>
<loc>${url.loc}</loc>
<lastmod>${url.lastmod}</lastmod>
<changefreq>${url.changefreq}</changefreq>
<priority>${url.priority}</priority>
</url>
`;
controller.enqueue(staticUrlItem);
}
for await (const post of streamFromWordpress('/stream-posts')) {
const sitemapItem = `
<url>
<loc>${post.link}</loc>
<lastmod>${new Date(post.modified || post.date).toISOString()}</lastmod>
<changefreq>weekly</changefreq>
<priority>0.8</priority>
</url>
`;
controller.enqueue(sitemapItem);
}
controller.enqueue('</urlset>');
controller.close();
},
});
return new Response(sitemapStream, {
headers: {
'Content-Type': 'application/xml',
},
});
And that’s it. The result is an average reduction of more than 100ms.