[00:25:31] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={2,3} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logg [00:25:31] ic=All&var-consumer_group=All [00:34:33] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [01:44:27] 10Operations, 10Research, 10Wikimedia-Mailing-lists: Admin password reset request for a mailman list: research-wmf - https://phabricator.wikimedia.org/T255326 (10Reedy) [02:00:17] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={2,3} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logg [02:00:17] ic=All&var-consumer_group=All [02:03:51] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [02:17:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:18:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:38:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:39:29] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:24:59] PROBLEM - Check systemd state on ms-be1055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:37] RECOVERY - Check systemd state on ms-be1055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:01] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:30:11] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is OK: HTTP OK: HTTP/1.0 200 OK - 23533 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [05:40:03] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition=3 site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging- [05:40:03] ll&var-consumer_group=All [05:41:51] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [06:34:55] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:49] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:53] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:49] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200613T0700) [07:03:21] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={atlas_exporter,swagger_check_cxserver_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:05:09] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:01:25] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={2,3} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logg [08:01:25] ic=All&var-consumer_group=All [08:03:54] mmmm [08:04:09] so the max lag seems to be for udp-localhost_info [08:04:23] but the topic didn't seem to have changed its volume size (https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-12h&to=now&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=logging-eqiad&var-kafka_broker=All&var-topic=udp_localhost-info) [08:11:26] there is a nice match between kafka lag and logstash1008's gc timings [08:20:22] I added a breakdown to the logstash dashboard with eden/survivor/old-gen heap areas [08:20:45] and I think that CMS is struggling to free space [08:22:57] we could try to restart it but never done it, logstash1008 is behind LVS etc.. so better not to do anything weird on a saturday [08:23:07] Will Cc: godog,herron,shdubsh --^ [08:23:38] * elukey afk, will check later [09:13:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:15:33] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:07:01] Okay something is wrong [10:07:16] I shouldn't get an error message when trying to upload a file on Commons.. [10:07:17] Request from 88.97.96.89 via cp3062 frontend, Varnish XID 547625252 [10:07:17] Error: 503, Backend fetch failed at Sat, 13 Jun 2020 10:06:25 GMT [10:07:37] when trying to use the Upload form at Commons to upload something I KNOW exists [10:07:50] https://archive.org/download/catalogofcopy13libr/catalogofcopy13libr.pdf [10:07:52] exists [10:08:00] and is PD as a US Government work [10:08:22] If large files can't be uploaded from a URL, then the documentation ought to SAY SO!! [10:29:00] Anyone know what might have broken? [10:29:30] Because not being to upload large PDF's to Commons is a distinct disincentive to my continuing to contribute to WMF projects [10:32:02] sunday at night US time is probably the worst time to complain about non-emergencies [10:38:18] (03CR) 10Dzahn: [C: 03+1] package_builder: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/605240 (owner: 10Muehlenhoff) [10:41:57] (03CR) 10Dzahn: [C: 03+1] profile::icinga: move single line scripts in line [puppet] - 10https://gerrit.wikimedia.org/r/605271 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [10:46:20] (03CR) 10Dzahn: "The profile::openstack::eqiad1::galera::monitoring would have to be included by a role to actually be used?" [puppet] - 10https://gerrit.wikimedia.org/r/605315 (owner: 10Andrew Bogott) [10:53:57] Majavah: It's Saturday Morning where I am [10:54:39] oh wait it's saturday, I need more coffee :D [10:55:23] anyways I'd recommend creating a task on Phabricator so someone can take a look during the working week [10:55:48] ShakespeareFan00: when using tools that don't support chunked uploads the limit is 100 MiB. the docs say so at https://commons.wikimedia.org/wiki/Commons:Maximum_file_size the file you link to is _just_ over that limit with 101.69 or something [10:56:12] mutante: Commons says uploading from a URL should allow anything up to 4GB [10:56:14] also Saturday or Sunday both are not during regular work hours. please use tickets to report problems. [10:56:30] (sigh) [10:56:36] ShakespeareFan00: see link above "Uploads using the Upload Wizard, other tools that support chunked uploads, and server-side uploads must be smaller than this limit.[1][2] Otherwise the limit is 100 MiB (104,857,600 bytes)[3]" [10:56:58] mutante: Then WHY DOES COMMONS SAY OTHERWISE? [10:57:07] (Apologies but I have to be blunt) [10:57:15] first google result when looking for commons maximum file size. please stop shouting [10:57:19] No [10:57:33] I followed the advice on commons, and I get errors [10:57:42] I expect some sort of answer [10:57:45] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 56 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:57:53] then you gotta ask on commons [10:57:55] Ok, maybe I am a little frustrated [10:58:10] (sigh) Defer the problem again instead of actually solving it :9 [10:58:13] :( [10:58:19] i am here by mere coincidence and decided to help you [10:58:22] good bye [11:03:33] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 575 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:12:16] mutante: Sorry [11:12:39] for shouting [11:13:06] but it's get very frustrating getting persistent errors when you follow the documentation [11:13:24] The actual error I am getting when trying to do the upload is [11:13:47] "Request from 88.97.96.89 via cp3062 frontend, Varnish XID 625639739 [11:13:47] Error: 503, Backend fetch failed at Sat, 13 Jun 2020 11:10:56 GMT" [11:14:06] I think the file still uploads, but then something fails elsewhere [11:14:17] I think there's already a phabricator ticket about that [11:14:48] I don't at this point think it's strictly a Commons issue though. [11:14:58] And once again sorry for shouting.. [11:18:57] a 5xx indicates something went wrong on the server side [11:19:47] so yeah, there should probably be a phab ticket somewhere [11:23:32] My aplogies for getting upset [11:24:51] Krenair: https://phabricator.wikimedia.org/T255238 [11:25:06] https://phabricator.wikimedia.org/T254459 [11:25:24] Although those don't mention the 5xxx errors when using Special:Upload directly [11:26:08] Krenair: Can I also borrow your expertise in #wikimedia-tech to help resolve a script problem? [11:26:29] ok [11:49:13] 10Operations, 10Cloud-VPS, 10DNS, 10Maps, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256 (10TheDJ) I replaced the redirects with a general http -> https redirect protocol upgrade. [12:12:29] PROBLEM - MariaDB Slave SQL: x1 on db2101 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1677, Errmsg: Column 2 of table wikishared.echo_unread_wikis cannot be converted from type varchar(30) to type varbinary(64) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [12:13:15] PROBLEM - MariaDB Slave Lag: x1 on db2101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86463.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [12:32:33] elukey: thanks! yeah I'll bounce logstash on logstash1008 [12:33:35] !log bounce logstash on logstash1008, GC death [12:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:10] !log Disabling puppet on gerrit1002 (test instance) to do some more upgrade testing [12:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:17] hey godog was just looking at the same thanks! [12:44:02] herron: np! [12:44:52] looks like it is recovering, waiting and see if some other logstash instance is affected [12:47:09] yeah I was thinking a full restart of the collectors wouldn't be a bad idea [12:51:47] !log restarted logstash service on logstash1007, logstash1009 [12:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:08] SGTM [12:54:44] since lag is trending in the right direction now I'll resume my trip to the grocery! thanks! please ping if needed [12:57:26] ok! ttyl [13:01:39] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [16:02:22] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) It works now, you can try it in https://lists-beta.wmflabs.org I haven't managed to get the archive working but you c... [19:00:01] PROBLEM - Check systemd state on dumpsdata1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:19] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:02:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:04:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:06:05] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:20:41] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Continuous-Integration-Config, and 2 others: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Daimona) [19:37:30] 10Operations, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10AntiCompositeNumber) [19:50:52] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-SVG-rendering, 10Documentation: Document how to request installing additional SVG and PDF fonts on Wikimedia servers - https://phabricator.wikimedia.org/T228591 (10AntiCompositeNumber) SVG and PDF rendering are both handled by Thumbor on Wikimedia s... [20:09:57] (03Abandoned) 10Addshore: wikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484633 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [20:10:00] (03Abandoned) 10Addshore: wikidata: post edit constraint jobs on 100% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/484635 (https://phabricator.wikimedia.org/T204031) (owner: 10Addshore) [20:33:05] (03PS1) 10Ladsgroup: rabbitmq: Rename "slave" to "replica" in comment [puppet] - 10https://gerrit.wikimedia.org/r/605382 (https://phabricator.wikimedia.org/T254646) [20:37:35] (03PS1) 10Ladsgroup: wmcs: Remove "slave" from comment [puppet] - 10https://gerrit.wikimedia.org/r/605383 (https://phabricator.wikimedia.org/T254646) [20:38:28] (03PS2) 10Ladsgroup: rabbitmq: Rename "slave" to "replica" in comment [puppet] - 10https://gerrit.wikimedia.org/r/605382 (https://phabricator.wikimedia.org/T254646) [20:44:59] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 54 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:50:49] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:12:32] !log Enabling puppet on gerrit1002 (test instance). Done with testing for today. [21:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 66 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:19:11] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 52 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:23:35] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 48 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:25:01] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 46 probes of 574 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas