21.第3部分、五个提示词流程：分析和可视化东亚人口趋势

# %% [python] cell-ae508cef942a
import pyodide, os, requests, js, json, micropip, pandas as pd
await micropip.install('shuyouqi-0.0.0-py3-none-any.whl')
await micropip.install('et_xmlfile-2.0.0-py3-none-any.whl')
from shuyouqi import profiling
await micropip.install('openpyxl-3.1.5-py2.py3-none-any.whl')
# %% [plaintext] cell-9ba6fcad55b1
《大陆洲人口数据示例》
大陆洲	Year	Population
东亚	1960	801286908
东亚	1961	796492093
东南亚	1960	215032439
东南亚	1961	221029459
# %% [prompt] cell-3bd85190e2a9
我需要从内存文件系统根目录“/”加载一个名为“19.第3部分、大陆洲人口.xlsx”的Excel文件。该文件包含从1960年到2023年各个大陆洲的人口数据，其中第一列是大陆洲名称、第二列是具体年份的人口数量。表格示例参看《大陆洲人口数据示例》。请为pyodide环境生成相应的python代码，使用pandas库读取这个Excel文件，并将其存储到一个名为df的DataFrame中，在代码中注释说明DataFrame的数据结构。
# %% [python] cell-cca0c96ebdd1
import pandas as pd

# 从内存文件系统读取Excel文件
# Pyodide环境下文件路径直接指向内存文件系统根目录
df = pd.read_excel('/19.第3部分、大陆洲人口.xlsx')

"""
生成的DataFrame数据结构说明：
列1：大陆洲 (dtype: object)
    存储大陆洲名称的字符串，如'东亚'、'东南亚'
列2：Year (dtype: int64)
    存储年份的整型数字，范围从1960到2023
列3：Population (dtype: int64 或 float64)
    存储对应年份的人口数量，根据数据量级可能为整数或浮点数

数据示例：
     大陆洲   Year   Population
0    东亚   1960   801286908
1    东亚   1961   796492093
2  东南亚   1960   215032439
3  东南亚   1961   221029459
"""
# %% [prompt] cell-b7c9339dd81f
我需要从df中将除了东亚之外的所有大陆洲（即“其它洲”）的人口数量加起来，与东亚的数据一起，存储到一个名为df2的DataFrame中。请生成Python代码，在代码中注释说明df2的数据结构。
# %% [python] cell-64b49fbe9c55
# 《东亚和其它洲的人口统计》
import pandas as pd

# 假设原始DataFrame df 已经存在
# 创建东亚数据子集
df_east_asia = df[df['大陆洲'] == '东亚'].copy()  # 使用copy避免SettingWithCopyWarning

# 处理其他大陆洲数据：过滤非东亚数据并按年份求和
df_other = df[df['大陆洲'] != '东亚']  # 筛选非东亚数据
other_sum = df_other.groupby('Year', as_index=False)['Population'].sum()  # 按年份求和
other_sum['大陆洲'] = '其它洲'  # 添加分类列
other_sum = other_sum[['大陆洲', 'Year', 'Population']]  # 调整列顺序与原数据结构一致

# 合并数据并创建df2
df2 = pd.concat([df_east_asia, other_sum], ignore_index=True)

"""
df2数据结构说明：
列1：大陆洲 (dtype: object)
    包含两类值：'东亚' 和 '其它洲'。其中'其它洲'为所有非东亚大陆洲的人口年度总和
列2：Year (dtype: int64)
    年份数据，保持原始整型格式，范围1960-2023
列3：Population (dtype: int64/float64)
    人口数量，东亚保持原始数据，'其它洲'为对应年份所有非东亚大陆洲的人口总和

数据示例：
     大陆洲   Year   Population
0    东亚   1960   801286908
1    东亚   1961   796492093
2  其它洲   1960   215032439  
3  其它洲   1961   221029459
"""
# %% [prompt] cell-4617ea5774ad
我已经完成了人口数据的加载和处理。现在，我想使用VegaLite语言绘制一个折线图，展示从1960年到2023年大陆洲的人口趋势线。x轴表示年份（Year），y轴表示人口数量（Population），并根据大陆洲名称进行颜色区分。数据来自df2变量。请生成相应的VegaLite JSON代码。
# %% [vegalite] cell-cf47fd177a3b
{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "description": "全球人口趋势（1960-2023）",
  "width": 800,
  "height": 500,
  "data": {
    "values":  "df2"// 您的实际数据应替换此处    
  },
  "mark": {
    "type": "line",
    "strokeWidth": 3,
    "tooltip": true
  },
  "encoding": {
    "x": {
      "field": "Year",
      "type": "quantitative",
      "axis": {"title": "年份", "labelFlush": false}
    },
    "y": {
      "field": "Population",
      "type": "quantitative",
      "title": "人口数量",
      "axis": {"format": "~s"} // 自动转换为百万/十亿单位
    },
    "color": {
      "field": "大陆洲",
      "type": "nominal",
      "title": "国家/大洲",
      "scale": {"scheme": "category10"} // 使用10色分类调色板
    }
  },
  "config": {
    "view": {"stroke": "transparent"},
    "legend": {
      "columns": 2,
      "symbolType": "stroke",
      "title": null
    }
  }
}
# %% [prompt] cell-d49bbe957006
我需要对df2进一步处理（参看《东亚和其它洲的人口统计》)，以1960年的人口为基数，计算每年的差别。请生成Python代码，将结果存储到一个名为df3的DataFrame中，在代码中注释说明df3的数据结构。
# %% [python] cell-8c33577ee4b9
import pandas as pd

# 生成df3，计算每年的差异（相对于1960年）
# 按大陆洲和年份排序，确保每个组的第一个数据是1960年
df3 = df2.sort_values(['大陆洲', 'Year']).copy()

# 计算差异：每个大陆洲的Population减去该组第一个（1960年）的值
df3['Difference'] = df3.groupby('大陆洲')['Population'].transform(lambda x: x - x.iloc[0])

# 选择需要的列，并重置索引
df3 = df3[['大陆洲', 'Year', 'Difference']].reset_index(drop=True)

"""
df3数据结构说明：
- 大陆洲 (object类型)：取值'东亚'或'其它洲'，表示地区分类。
- Year (int64类型)：年份，范围从1960到2023。
- Difference (int64或float64类型)：该年份的人口相对于1960年的变化量，即当年人口减去1960年人口。
"""
# %% [prompt] cell-4a8f48677b13
我已经完成了人口数据的加载和处理。现在，我想使用VegaLite语言绘制一个折线图，展示从1960年到2023年大陆洲的人口趋势线（以1960年为基数）。x轴表示年份（Year），y轴表示人口变化量（Difference），自动转换为百万/十亿单位，并根据大陆洲名称进行颜色区分，给每条趋势线增加一个标注。数据来自df3变量。请生成相应的VegaLite JSON代码。
# %% [vegalite] cell-f9846bc3969d
{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "width": 800,
  "height": 500,
  "data": {
    "values": "df3"
  },
  "layer": [
    {
      "mark": {
        "type": "line",
        "strokeWidth": 2,
        "tooltip": true
      },
      "encoding": {
        "x": {
          "field": "Year",
          "type": "quantitative",
          "scale": {"domain": [1960, 2023]},
          "axis": {"title": "Year", "grid": false}
        },
        "y": {
          "field": "Difference",
          "type": "quantitative",
          "axis": {"title": "人口变化量", "gridColor": "#eee", "format": "~s"}// 自动转换为百万/十亿单位          
        },
        "color": {
          "field": "大陆洲",
          "type": "nominal",
          "scale": {"scheme": "category10"},
          "legend": {"title": null, "orient": "bottom-right"}
        }
      }
    },
    {
      "mark": {
        "type": "text",
        "align": "left",
        "dx": 8,
        "fontSize": 12,
        "fontWeight": "bold"
      },
      "encoding": {
        "x": {"field": "Year", "type": "quantitative"},
        "y": {"field": "Difference", "type": "quantitative"},
        "text": {"field": "大陆洲"},
        "color": {"field": "大陆洲", "type": "nominal"}
      },
      "transform": [{"filter": "datum.Year == 2023"}]
    }
  ],
  "config": {
    "view": {"stroke": "transparent"},
    "axis": {"domainWidth": 1}
  }
}