[00:00:43] <wikibugs>	 (03PS1) 10Dzahn: wikistats: let import job run as root, needs to delete tables [puppet] - 10https://gerrit.wikimedia.org/r/655176
[00:05:01] <icinga-wm>	 PROBLEM - puppetmaster backend https on puppetmaster2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[00:09:49] <wikibugs>	 (03PS1) 10Legoktm: codesearch: Remove manual operations/puppet handling [puppet] - 10https://gerrit.wikimedia.org/r/655179
[00:11:35] <icinga-wm>	 RECOVERY - puppetmaster backend https on puppetmaster2003 is OK: HTTP OK: Status line output matched 400 - 417 bytes in 1.444 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[00:11:37] <mutante>	 !log puppetmaster2003 - restarted apache after spweing 500s
[00:11:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:11:51] <mutante>	 eh..yea..glad that did it
[00:12:10] <mutante>	 per https://wikitech.wikimedia.org/wiki/Puppet#puppet_master_spewing_500s
[00:12:44] <mutante>	 guess that would have been more puppet alerts from codfw in a bit otherwies
[00:13:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] wikistats: let import job run as root, needs to delete tables [puppet] - 10https://gerrit.wikimedia.org/r/655176 (owner: 10Dzahn)
[00:13:47] <wikibugs>	 (03PS2) 10Dzahn: wikistats: let import job run as root, needs to delete tables [puppet] - 10https://gerrit.wikimedia.org/r/655176
[00:14:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "cloud only - not stats.wm and not on prod DBs either" [puppet] - 10https://gerrit.wikimedia.org/r/655176 (owner: 10Dzahn)
[00:15:34] <wikibugs>	 (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T262113" [puppet] - 10https://gerrit.wikimedia.org/r/655176 (owner: 10Dzahn)
[00:15:39] <wikibugs>	 (03PS2) 10Legoktm: codesearch: Remove manual operations/puppet handling [puppet] - 10https://gerrit.wikimedia.org/r/655179
[00:16:19] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/27402/" [puppet] - 10https://gerrit.wikimedia.org/r/655179 (owner: 10Legoktm)
[00:19:49] <wikibugs>	 (03PS8) 10Legoktm: mailman3: Start mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/608163 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup)
[00:25:06] <mutante>	 out. cya
[00:31:14] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] "This is not used anywhere yet, but will soon be in Cloud VPS." [puppet] - 10https://gerrit.wikimedia.org/r/608163 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup)
[00:49:03] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 912174664 and 223 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:50:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 314029616 and 285 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:50:43] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 202096 and 289 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:52:13] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 196656 and 380 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:56:55] <icinga-wm>	 PROBLEM - Disk space on ms-be2019 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdd1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2019&var-datasource=codfw+prometheus/ops
[01:03:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3829998816 and 396 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:05:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 76816 and 266 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:17:33] <icinga-wm>	 RECOVERY - Disk space on ms-be2019 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2019&var-datasource=codfw+prometheus/ops
[01:50:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10Peachey88)
[02:03:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10ppelberg)
[02:04:25] <icinga-wm>	 PROBLEM - snapshot of s7 in eqiad on alert1001 is CRITICAL: snapshot for s7 at eqiad taken more than 3 days ago: Most recent backup 2021-01-06 01:58:46 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[02:09:58] <wikibugs>	 (03PS1) 10Legoktm: zuul: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655182 (https://phabricator.wikimedia.org/T266479)
[02:10:01] <wikibugs>	 (03PS1) 10Legoktm: visualdiff: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655183 (https://phabricator.wikimedia.org/T266479)
[02:10:03] <wikibugs>	 (03PS1) 10Legoktm: docker_pkg: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655184 (https://phabricator.wikimedia.org/T266479)
[02:10:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] zuul: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655182 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm)
[02:10:46] <legoktm>	 :v
[02:12:43] <wikibugs>	 (03PS2) 10Legoktm: zuul: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655182 (https://phabricator.wikimedia.org/T266479)
[02:12:45] <wikibugs>	 (03PS2) 10Legoktm: visualdiff: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655183 (https://phabricator.wikimedia.org/T266479)
[02:12:47] <wikibugs>	 (03PS2) 10Legoktm: docker_pkg: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655184 (https://phabricator.wikimedia.org/T266479)
[02:20:57] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[02:27:35] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[03:05:45] <icinga-wm>	 RECOVERY - snapshot of s7 in eqiad on alert1001 is OK: Last snapshot for s7 at eqiad (db1116.eqiad.wmnet:3317) taken on 2021-01-09 01:45:39 (1016 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[03:28:22] <wikibugs>	 (03Abandoned) 10TerraCodes: Complete wmfRealm to wmgRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551778 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes)
[03:53:19] <wikibugs>	 (03CR) 10Gergő Tisza: "Thanks for cleaning up!" [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210109T0800)
[10:33:46] <wikibugs>	 (03CR) 10Volans: "I did a pass on the python side, few comments inline, mostly nits to make it nicer. Nothing major." (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[10:49:44] <wikibugs>	 (03CR) 10Volans: "Did a quick pass, LGTM, couple of minor and optional nits inline. I didn't check the certificate handling specific bits." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654418 (owner: 10Jbond)
[10:55:22] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/650120 (owner: 10Jbond)
[12:17:20] <Majavah>	 good afternoon, is anyone with logstash access around? looking for stack trace for X-mcXQpAICIAAC0FKbMAAAAK T271618
[12:17:21] <stashbot>	 T271618: Viewing Vincent Crawford article on English wikipedia returns internal error - https://phabricator.wikimedia.org/T271618
[12:59:56] <wikibugs>	 (03PS1) 10Ladsgroup: cache: Migrate hiera() to lookup() and setting datatype in statsv [puppet] - 10https://gerrit.wikimedia.org/r/655193 (https://phabricator.wikimedia.org/T209953)
[13:15:50] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/27403/" [puppet] - 10https://gerrit.wikimedia.org/r/655193 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[14:02:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:03:01] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[14:04:19] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:04:43] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:05:57] <wikibugs>	 (03PS1) 10Joal: dumps::web::fetches::analytics::job fix perms [puppet] - 10https://gerrit.wikimedia.org/r/655200 (https://phabricator.wikimedia.org/T271616)
[15:06:01] <joal>	 elukey: --^
[15:07:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] dumps::web::fetches::analytics::job fix perms [puppet] - 10https://gerrit.wikimedia.org/r/655200 (https://phabricator.wikimedia.org/T271616) (owner: 10Joal)
[15:52:55] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:53:05] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:56:13] <wikibugs>	 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon)
[16:56:18] <wikibugs>	 10SRE, 10Graphoid, 10serviceops, 10Platform Engineering (Icebox): Undeploy graphoid for phase 3 wiki's - https://phabricator.wikimedia.org/T259207 (10Jseddon) 05Open→03Resolved p:05Triage→03High
[16:56:50] <wikibugs>	 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon)
[17:03:39] <wikibugs>	 (03PS1) 10Ladsgroup: mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542)
[17:05:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup)
[17:07:02] <elukey>	 the Router down interfaces are due to Zayo maintenance, all expected
[17:09:15] <wikibugs>	 (03PS2) 10Ladsgroup: mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542)
[17:17:21] <wikibugs>	 (03PS3) 10Ladsgroup: mailman3: Add parts for Postorius (web interface) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542)
[17:22:40] <wikibugs>	 (03CR) 10Ladsgroup: "After cherry-picking and running this on mailman-mailman02:" [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup)
[18:23:59] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:24:09] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:30:13] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f296cae64e0: Failed to establish a new connection: [Errno 111] Connection
[20:30:13] <icinga-wm>	 ://wikitech.wikimedia.org/wiki/Search%23Administration
[20:30:53] <icinga-wm>	 PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:55:45] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f9ee443c518: Failed to establish a new connection: [Errno 111] Connection
[20:55:45] <icinga-wm>	 ://wikitech.wikimedia.org/wiki/Search%23Administration
[20:56:25] <icinga-wm>	 PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:57:03] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, relocating_shards: 0, timed_out: False, number_of_in_flight_fetch: 0, number_of_pending_tasks: 0, unassigned_shards: 0, active_shards_percent_as_number: 100.0, cluster_name: production-logstash-eqiad, active_shards: 916, number_of_da
[20:57:03] <icinga-wm>	 ializing_shards: 0, number_of_nodes: 5, active_primary_shards: 483, status: green https://wikitech.wikimedia.org/wiki/Search%23Administration
[20:57:41] <icinga-wm>	 RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:22:35] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, relocating_shards: 0, cluster_name: production-logstash-eqiad, unassigned_shards: 0, active_primary_shards: 483, status: green, number_of_data_nodes: 3, active_shards_percent_as_number: 100.0, number_of_in
[21:22:35] <icinga-wm>	  timed_out: False, number_of_nodes: 6, initializing_shards: 0, active_shards: 916 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:23:13] <icinga-wm>	 RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state