[00:00:43] (03PS1) 10Dzahn: wikistats: let import job run as root, needs to delete tables [puppet] - 10https://gerrit.wikimedia.org/r/655176 [00:05:01] PROBLEM - puppetmaster backend https on puppetmaster2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [00:09:49] (03PS1) 10Legoktm: codesearch: Remove manual operations/puppet handling [puppet] - 10https://gerrit.wikimedia.org/r/655179 [00:11:35] RECOVERY - puppetmaster backend https on puppetmaster2003 is OK: HTTP OK: Status line output matched 400 - 417 bytes in 1.444 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [00:11:37] !log puppetmaster2003 - restarted apache after spweing 500s [00:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:51] eh..yea..glad that did it [00:12:10] per https://wikitech.wikimedia.org/wiki/Puppet#puppet_master_spewing_500s [00:12:44] guess that would have been more puppet alerts from codfw in a bit otherwies [00:13:40] (03CR) 10Dzahn: [C: 03+2] wikistats: let import job run as root, needs to delete tables [puppet] - 10https://gerrit.wikimedia.org/r/655176 (owner: 10Dzahn) [00:13:47] (03PS2) 10Dzahn: wikistats: let import job run as root, needs to delete tables [puppet] - 10https://gerrit.wikimedia.org/r/655176 [00:14:04] (03CR) 10Dzahn: [C: 03+2] "cloud only - not stats.wm and not on prod DBs either" [puppet] - 10https://gerrit.wikimedia.org/r/655176 (owner: 10Dzahn) [00:15:34] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T262113" [puppet] - 10https://gerrit.wikimedia.org/r/655176 (owner: 10Dzahn) [00:15:39] (03PS2) 10Legoktm: codesearch: Remove manual operations/puppet handling [puppet] - 10https://gerrit.wikimedia.org/r/655179 [00:16:19] (03CR) 10Legoktm: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/27402/" [puppet] - 10https://gerrit.wikimedia.org/r/655179 (owner: 10Legoktm) [00:19:49] (03PS8) 10Legoktm: mailman3: Start mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/608163 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [00:25:06] out. cya [00:31:14] (03CR) 10Legoktm: [C: 03+2] "This is not used anywhere yet, but will soon be in Cloud VPS." [puppet] - 10https://gerrit.wikimedia.org/r/608163 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [00:49:03] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 912174664 and 223 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:33] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 314029616 and 285 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:43] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 202096 and 289 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:13] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 196656 and 380 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:55] PROBLEM - Disk space on ms-be2019 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdd1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2019&var-datasource=codfw+prometheus/ops [01:03:35] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3829998816 and 396 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:15] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 76816 and 266 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:17:33] RECOVERY - Disk space on ms-be2019 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2019&var-datasource=codfw+prometheus/ops [01:50:59] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10Peachey88) [02:03:29] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10ppelberg) [02:04:25] PROBLEM - snapshot of s7 in eqiad on alert1001 is CRITICAL: snapshot for s7 at eqiad taken more than 3 days ago: Most recent backup 2021-01-06 01:58:46 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:09:58] (03PS1) 10Legoktm: zuul: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655182 (https://phabricator.wikimedia.org/T266479) [02:10:01] (03PS1) 10Legoktm: visualdiff: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655183 (https://phabricator.wikimedia.org/T266479) [02:10:03] (03PS1) 10Legoktm: docker_pkg: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655184 (https://phabricator.wikimedia.org/T266479) [02:10:34] (03CR) 10jerkins-bot: [V: 04-1] zuul: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655182 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [02:10:46] :v [02:12:43] (03PS2) 10Legoktm: zuul: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655182 (https://phabricator.wikimedia.org/T266479) [02:12:45] (03PS2) 10Legoktm: visualdiff: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655183 (https://phabricator.wikimedia.org/T266479) [02:12:47] (03PS2) 10Legoktm: docker_pkg: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655184 (https://phabricator.wikimedia.org/T266479) [02:20:57] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [02:27:35] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [03:05:45] RECOVERY - snapshot of s7 in eqiad on alert1001 is OK: Last snapshot for s7 at eqiad (db1116.eqiad.wmnet:3317) taken on 2021-01-09 01:45:39 (1016 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:28:22] (03Abandoned) 10TerraCodes: Complete wmfRealm to wmgRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551778 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [03:53:19] (03CR) 10Gergő Tisza: "Thanks for cleaning up!" [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210109T0800) [10:33:46] (03CR) 10Volans: "I did a pass on the python side, few comments inline, mostly nits to make it nicer. Nothing major." (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [10:49:44] (03CR) 10Volans: "Did a quick pass, LGTM, couple of minor and optional nits inline. I didn't check the certificate handling specific bits." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654418 (owner: 10Jbond) [10:55:22] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/650120 (owner: 10Jbond) [12:17:20] good afternoon, is anyone with logstash access around? looking for stack trace for X-mcXQpAICIAAC0FKbMAAAAK T271618 [12:17:21] T271618: Viewing Vincent Crawford article on English wikipedia returns internal error - https://phabricator.wikimedia.org/T271618 [12:59:56] (03PS1) 10Ladsgroup: cache: Migrate hiera() to lookup() and setting datatype in statsv [puppet] - 10https://gerrit.wikimedia.org/r/655193 (https://phabricator.wikimedia.org/T209953) [13:15:50] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27403/" [puppet] - 10https://gerrit.wikimedia.org/r/655193 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [14:02:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:03:01] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [14:04:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:04:43] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:05:57] (03PS1) 10Joal: dumps::web::fetches::analytics::job fix perms [puppet] - 10https://gerrit.wikimedia.org/r/655200 (https://phabricator.wikimedia.org/T271616) [15:06:01] elukey: --^ [15:07:54] (03CR) 10Elukey: [C: 03+2] dumps::web::fetches::analytics::job fix perms [puppet] - 10https://gerrit.wikimedia.org/r/655200 (https://phabricator.wikimedia.org/T271616) (owner: 10Joal) [15:52:55] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:53:05] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:56:13] 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [16:56:18] 10SRE, 10Graphoid, 10serviceops, 10Platform Engineering (Icebox): Undeploy graphoid for phase 3 wiki's - https://phabricator.wikimedia.org/T259207 (10Jseddon) 05Open→03Resolved p:05Triage→03High [16:56:50] 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [17:03:39] (03PS1) 10Ladsgroup: mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) [17:05:08] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [17:07:02] the Router down interfaces are due to Zayo maintenance, all expected [17:09:15] (03PS2) 10Ladsgroup: mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) [17:17:21] (03PS3) 10Ladsgroup: mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) [17:22:40] (03CR) 10Ladsgroup: "After cherry-picking and running this on mailman-mailman02:" [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [18:23:59] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:24:09] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:30:13] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f296cae64e0: Failed to establish a new connection: [Errno 111] Connection [20:30:13] ://wikitech.wikimedia.org/wiki/Search%23Administration [20:30:53] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:55:45] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f9ee443c518: Failed to establish a new connection: [Errno 111] Connection [20:55:45] ://wikitech.wikimedia.org/wiki/Search%23Administration [20:56:25] PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:03] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, relocating_shards: 0, timed_out: False, number_of_in_flight_fetch: 0, number_of_pending_tasks: 0, unassigned_shards: 0, active_shards_percent_as_number: 100.0, cluster_name: production-logstash-eqiad, active_shards: 916, number_of_da [20:57:03] ializing_shards: 0, number_of_nodes: 5, active_primary_shards: 483, status: green https://wikitech.wikimedia.org/wiki/Search%23Administration [20:57:41] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:35] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, relocating_shards: 0, cluster_name: production-logstash-eqiad, unassigned_shards: 0, active_primary_shards: 483, status: green, number_of_data_nodes: 3, active_shards_percent_as_number: 100.0, number_of_in [21:22:35] timed_out: False, number_of_nodes: 6, initializing_shards: 0, active_shards: 916 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:23:13] RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state