[00:05:21] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10MMiller_WMF) Okay, we can talk about this for next week's plan. [00:18:42] (03PS1) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [00:19:46] (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [00:35:34] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 140836000 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:50] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 87232 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:07:12] fyi I'm crawling lists.wikimedia.org right now with a concurrency of 2 [02:15:27] 10Operations, 10Wikimedia-Mailing-lists: Figure out a way to sync old and new mailman - https://phabricator.wikimedia.org/T256539 (10Legoktm) To be clear, we can import all of the legacy mailman 2 archives into hyperkitty, right? If so, I have a rough idea on how to set up a redirector from mailman2 URLs to h... [02:28:28] (03PS1) 10Dzahn: wikistats: add a function to identify files with local hacks [puppet] - 10https://gerrit.wikimedia.org/r/623673 [02:29:22] (03CR) 10Dzahn: [C: 03+2] wikistats: add a function to identify files with local hacks [puppet] - 10https://gerrit.wikimedia.org/r/623673 (owner: 10Dzahn) [03:35:14] (03PS1) 10Andrew Bogott: wmcs-ceph-migrate: use new 'generation 2' flavor names in flavor map [puppet] - 10https://gerrit.wikimedia.org/r/623676 (https://phabricator.wikimedia.org/T261252) [04:23:40] (03PS2) 10Andrew Bogott: wmcs-ceph-migrate: use new 'generation 2' flavor names in flavor map [puppet] - 10https://gerrit.wikimedia.org/r/623676 (https://phabricator.wikimedia.org/T261252) [05:01:46] (03PS2) 10KartikMistry: Update cxserver to 2020-08-30-011854-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/623475 (https://phabricator.wikimedia.org/T253439) [05:01:50] 10Puppet, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10Marostegui) p:05Triage→03High [05:05:55] (03PS1) 10Marostegui: production-m5: Remove unused grants. [puppet] - 10https://gerrit.wikimedia.org/r/623681 (https://phabricator.wikimedia.org/T261152) [05:06:34] (03PS1) 10ArielGlenn: update locale settings for dumps stats emailer [puppet] - 10https://gerrit.wikimedia.org/r/623682 [05:41:28] PROBLEM - Host analytics1059 is DOWN: PING CRITICAL - Packet loss = 100% [05:43:11] (03CR) 10ArielGlenn: [C: 03+2] update locale settings for dumps stats emailer [puppet] - 10https://gerrit.wikimedia.org/r/623682 (owner: 10ArielGlenn) [05:54:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:56:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:58:45] (03PS1) 10ArielGlenn: disable category rdf dumps for now [puppet] - 10https://gerrit.wikimedia.org/r/623692 (https://phabricator.wikimedia.org/T260430) [05:59:23] (03CR) 10ArielGlenn: [C: 03+2] disable category rdf dumps for now [puppet] - 10https://gerrit.wikimedia.org/r/623692 (https://phabricator.wikimedia.org/T260430) (owner: 10ArielGlenn) [06:02:12] 10Puppet, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10Joe) restarting all confds before switching DC seems overkill and frankly useless. We should rat... [06:09:17] (03Abandoned) 10Giuseppe Lavagetto: profile::services_proxy: allow using envoy [puppet] - 10https://gerrit.wikimedia.org/r/571682 (owner: 10Giuseppe Lavagetto) [06:10:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "> Patch Set 4: Code-Review-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/622584 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [06:11:54] (03Merged) 10jenkins-bot: Convert termbox to the new layout using the convert script [deployment-charts] - 10https://gerrit.wikimedia.org/r/622584 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [06:18:26] (03PS1) 10Giuseppe Lavagetto: termbox: re-add the service proxy values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/623695 [06:19:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] termbox: re-add the service proxy values file [deployment-charts] - 10https://gerrit.wikimedia.org/r/623695 (owner: 10Giuseppe Lavagetto) [06:21:30] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'production' . [06:21:30] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [06:21:30] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [06:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:50] <_joe_> uhm this needs to be improved :P [06:24:56] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [06:29:04] checking the hadoop worker down.. [06:29:55] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'staging' . [06:29:55] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'test' . [06:29:55] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [06:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:33] !log reboot kafka-jumbo1001 to pick up new kernel settings [06:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Convert citoid to new layout using the conversion script [deployment-charts] - 10https://gerrit.wikimedia.org/r/622585 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [06:31:30] (03CR) 10jerkins-bot: [V: 04-1] Convert citoid to new layout using the conversion script [deployment-charts] - 10https://gerrit.wikimedia.org/r/622585 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [06:32:19] (03PS5) 10Giuseppe Lavagetto: Convert citoid to new layout using the conversion script [deployment-charts] - 10https://gerrit.wikimedia.org/r/622585 (https://phabricator.wikimedia.org/T258572) [06:38:54] !log powercycle analytics1059 - cpu soft locks on multiple CPUs [06:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:52] RECOVERY - Host analytics1059 is UP: PING WARNING - Packet loss = 60%, RTA = 40.98 ms [06:44:18] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on kafka-jumbo1001 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [06:58:17] _joe_: should I convert cxserver to new helmfile layout? [06:58:38] <_joe_> kart_: I am in the process of doing it for y'all [06:58:49] OK. Nice. Thanks! [06:58:50] <_joe_> but if you feel adventurous, it's explained in the README :) [06:58:56] :) [06:59:14] I'll wait :) [06:59:31] <_joe_> yeah it should be done today, hopefully [07:00:07] Thanks! I'll hold deployment for cxserver. Not urgent. [07:00:14] !log deactivate Telia BGP in eqiad [07:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:21] !log reboot kafka-jumbo1002 to pick up new kernel settings [07:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:17] !log Drop unused grants on m5 T261152 [07:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:23] T261152: Drop openstack databases from m5-master - https://phabricator.wikimedia.org/T261152 [07:09:11] (03CR) 10Marostegui: "Grants removed from the db: https://phabricator.wikimedia.org/T261152#6429284" [puppet] - 10https://gerrit.wikimedia.org/r/623681 (https://phabricator.wikimedia.org/T261152) (owner: 10Marostegui) [07:11:04] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10hashar) [07:12:41] !log configure cr2-eqiad:ae5 as single LACP link to Telia [07:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:03] hey everyone, now that we should be running from codfw, why does deploy2001.codfw.wmnet still have the "do not use" MOTD? Are we supposed to deploy from eqiad? (asking in advance for upcoming EU B&C ) [07:14:11] Urbanecm: as far as I know it wasn't switched to codfw yes, but I am not sure what's the plan [07:14:18] _joe_ volans do you know what's the plan? ^ [07:14:37] marostegui: thanks. Note mwmaint1002 says "do not use", and mwmaint2001 doesn't, so that was switched [07:17:30] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 2, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:50] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:19:04] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:21:21] (03CR) 10Marostegui: [C: 03+2] production-m5: Remove unused grants. [puppet] - 10https://gerrit.wikimedia.org/r/623681 (https://phabricator.wikimedia.org/T261152) (owner: 10Marostegui) [07:22:00] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 2, dormant: 0, excluded: 1, unused: 0: ayounsi Planned maintenance, working on it with Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:23:20] ACKNOWLEDGEMENT - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP ayounsi Expected, GRE tunnels over interface in maintenance https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:23:20] ACKNOWLEDGEMENT - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP ayounsi Expected, GRE tunnels over interface in maintenance https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:27:56] !log Reboot dbstore1003T261389 for kernel upgrade - [07:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:07] !log Reboot dbstore1003 for kernel upgrade - T261389 [07:28:09] elukey: ^ [07:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:54] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:33] marostegui: <3 [07:29:42] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on kafka-jumbo1002 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [07:32:53] (03CR) 10Marostegui: [C: 03+1] "Let's try deploying with puppet stopped on all of them, and manually enabling it on one of the hosts, just in case" [puppet] - 10https://gerrit.wikimedia.org/r/623623 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [07:36:42] <_joe_> Urbanecm: we've switched mediawiki, not the deployment server [07:36:52] <_joe_> mediawiki and the services [07:36:59] marostegui, Urbanecm: the deployment server is a separate thing, see also https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Other_miscellaneous [07:37:39] elukey: dbstore1003 is all done [07:37:43] super thanks [07:38:50] !log reimage kafka-jumbo1003 to buster [07:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:41] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623531 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [07:43:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623533 (https://phabricator.wikimedia.org/T261487) (owner: 10Ema) [07:43:46] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:44:26] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 90, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:46:37] _joe_: volans: I see. IIRC last switchover, the deployment server was switched hours after MediaWiki was, so I connected that together I guess. Does that mean that when I need to deploy, it should be done from the eqiad srv, right? [07:46:51] <_joe_> Urbanecm: for now, yes [07:47:14] Thanks. [07:47:31] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: redirect to https if not already proxied [puppet] - 10https://gerrit.wikimedia.org/r/622566 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [07:47:38] (03PS5) 10Filippo Giunchedi: icinga: redirect to https if not already proxied [puppet] - 10https://gerrit.wikimedia.org/r/622566 (https://phabricator.wikimedia.org/T258948) [07:48:56] (03CR) 10Elukey: "I am fine with this, but sqoop is running as it is the beginning of the month so please don't restart anything on labsdb1012 :)" [puppet] - 10https://gerrit.wikimedia.org/r/623623 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [07:54:29] (03CR) 10Ema: [C: 03+2] varnish: give CAP_DAC_OVERRIDE back to root [puppet] - 10https://gerrit.wikimedia.org/r/623531 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [07:54:39] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 122 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [07:54:45] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 123 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [07:54:53] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 118 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [07:55:32] this is me, expected --^ [07:55:59] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: minimal default alerts for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/622557 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [07:56:07] (03PS1) 10Giuseppe Lavagetto: Convert mobileapps to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/623739 [07:56:14] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [07:56:17] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: 84 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [07:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:38] (03PS2) 10Giuseppe Lavagetto: Convert mobileapps to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/623739 [07:58:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:52] (03PS1) 10Giuseppe Lavagetto: mobileapps: use the reserved port for TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/623740 [08:02:26] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add 'alertmanagers' setting to all instances [puppet] - 10https://gerrit.wikimedia.org/r/622558 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [08:02:38] (03PS6) 10Filippo Giunchedi: prometheus: add 'alertmanagers' setting to all instances [puppet] - 10https://gerrit.wikimedia.org/r/622558 (https://phabricator.wikimedia.org/T258948) [08:07:15] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 32 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [08:07:49] <_joe_> elukey: ^^ you're aware i guess [08:08:24] yep yep already commented above [08:08:51] <_joe_> sorry, I should read scrollback :) [08:09:16] nono sorry for the noise, I have a little issue with the first puppet run of the reimage that slows down all [08:11:12] (03PS1) 10Elukey: role::kafka::jumbo::broker: move down inclusion of profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/623742 (https://phabricator.wikimedia.org/T255123) [08:13:41] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 92, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:13:55] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/24862/kafka-jumbo1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/623742 (https://phabricator.wikimedia.org/T255123) (owner: 10Elukey) [08:14:23] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [08:14:31] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [08:14:39] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [08:14:43] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [08:15:45] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [08:16:10] they may re-fire since the host got rebooted as final step of reimage [08:17:08] elukey: are not part of the downtimed checks? [08:17:19] (03CR) 10Ema: [C: 03+2] varnish: do not explicitly install libvarnishapi1 [puppet] - 10https://gerrit.wikimedia.org/r/623533 (https://phabricator.wikimedia.org/T261487) (owner: 10Ema) [08:17:44] volans: those are all from other brokers that complain about replicas not in sync (in this case, the ones in the host under reimage) [08:17:56] ack [08:18:55] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:20:07] (03CR) 10Elukey: [C: 03+2] role::kafka::jumbo::broker: move down inclusion of profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/623742 (https://phabricator.wikimedia.org/T255123) (owner: 10Elukey) [08:20:25] !log activate Telia BGP in eqiad [08:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:16] (03PS2) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T256972) [08:26:22] (03CR) 10jerkins-bot: [V: 04-1] role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T256972) (owner: 10Jbond) [08:33:11] (03PS8) 10Alexandros Kosiaris: k8s: Migrate eqiad to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558355 (https://phabricator.wikimedia.org/T239835) [08:34:05] (03CR) 10Ema: [C: 03+2] varnish: stop installing libvmod-tbf [puppet] - 10https://gerrit.wikimedia.org/r/623583 (https://phabricator.wikimedia.org/T261632) (owner: 10Ema) [08:34:32] (03PS2) 10Ema: varnish: stop installing libvmod-tbf [puppet] - 10https://gerrit.wikimedia.org/r/623583 (https://phabricator.wikimedia.org/T261632) [08:36:23] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:38:30] (03PS1) 10Arturo Borrero Gonzalez: labtestvirt2003: reimage as buster [puppet] - 10https://gerrit.wikimedia.org/r/623745 (https://phabricator.wikimedia.org/T261724) [08:40:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestvirt2003: reimage as buster [puppet] - 10https://gerrit.wikimedia.org/r/623745 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [08:43:01] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:45:34] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [08:46:39] (03PS1) 10Marostegui: instances.yaml: Add db1128 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/623748 (https://phabricator.wikimedia.org/T260324) [08:48:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Align all exsiting new-style helmfiles to example [deployment-charts] - 10https://gerrit.wikimedia.org/r/622827 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [08:49:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/622806 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [08:49:39] (03Merged) 10jenkins-bot: Align all exsiting new-style helmfiles to example [deployment-charts] - 10https://gerrit.wikimedia.org/r/622827 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [08:49:46] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/623748 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [08:50:37] (03PS3) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T256972) [08:50:49] !log drain cr2-eqiad transport links - T259621 [08:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:29] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1128 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/623748 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [08:52:01] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, but change the release names to make this a noop." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/621286 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [08:52:25] !log deactivate cr2-eqiad transit/IX - T259621 [08:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1128 into s10 (wikitech) with weight 0 - T260324', diff saved to https://phabricator.wikimedia.org/P12431 and previous config saved to /var/cache/conftool/dbconfig/20200902-085455-marostegui.json [08:55:06] marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [08:55:16] T260324: Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 [08:55:47] (03PS6) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [08:56:36] I broke wikitech [08:56:38] reverting [08:57:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1128 from s10 - T260324', diff saved to https://phabricator.wikimedia.org/P12432 and previous config saved to /var/cache/conftool/dbconfig/20200902-085705-marostegui.json [08:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:13] fixed [08:58:10] (03PS2) 10Jbond: role::mx: parameterise otrs db variables [puppet] - 10https://gerrit.wikimedia.org/r/623607 (https://phabricator.wikimedia.org/T244792) [08:58:19] (03PS4) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T256972) [08:58:33] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:58:45] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [08:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:53] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 93 probes of 564 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:00:12] (03PS1) 10Giuseppe Lavagetto: Convert cxserver to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/623749 (https://phabricator.wikimedia.org/T258572) [09:00:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:01:41] !log reimage kafka-jumbo1004 to Buster [09:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:10] (03PS7) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [09:04:45] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 49 probes of 564 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:05:18] (03PS3) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 [09:05:40] (03CR) 10jerkins-bot: [V: 04-1] role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:06:35] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.prepare-upgrade (exit_code=97) [09:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:32] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [09:07:33] (03PS5) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T256972) [09:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:45] XioNoX: any error that should be fixed in the cookbook? [09:08:19] volans: I manually stopped it because it was too slow :) [09:08:21] so no [09:08:23] lol [09:08:54] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=1) [09:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:37] volans: here something went wrong :) [09:09:58] now I don't believe you :-P [09:10:48] (03PS1) 10Jcrespo: Remove wmfmariadbpy code from wmfbackups repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623751 [09:10:59] (03PS2) 10Jcrespo: Remove wmfmariadbpy code from wmfbackups repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623751 [09:11:18] (03PS6) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T256972) [09:11:20] (03PS8) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [09:11:50] !log aborrero@cumin2001 START - Cookbook sre.hosts.downtime [09:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:57] volans: the copy step went way too fast and didn't copy anything [09:12:07] yeah the checksum failed [09:12:13] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [09:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:41] (03CR) 10jerkins-bot: [V: 04-1] role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:13:00] with No such file or directory [09:13:04] volans: checksum failed because the file doesn't exist, running it again to see if it's consistent [09:13:34] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=1) [09:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:41] do you get pinged each time a cookbook fails? :) [09:13:48] ack, lmk if I can help [09:13:51] (03PS1) 10Jcrespo: Remove wmfbackups from wmfmariadbpy repo [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623752 [09:13:59] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:13:59] ahaah nah, it's not that specific [09:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:14] XioNoX: of course yes, why do you ask such trivial questions [09:14:15] (03PS7) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T256972) [09:14:19] :) [09:14:49] (03PS4) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 [09:16:23] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime [09:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:22] (03PS9) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [09:18:04] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is CRITICAL: 95 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [09:18:28] downtiming [09:18:36] (03PS1) 10Jcrespo: backup_mariadb: Use path to find backup_mariadb.py [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623754 (https://phabricator.wikimedia.org/T165358) [09:18:56] !log reboot cr2-eqiad:re1 (backup) - T259621 [09:18:57] (03Abandoned) 10Jcrespo: backup_mariadb: Use path to find backup_mariadb.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620315 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [09:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:13] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:54] (03PS1) 10Jcrespo: [WIP] Add WMFBackup package creation [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623756 (https://phabricator.wikimedia.org/T165358) [09:21:26] (03PS1) 10Marostegui: mariadb: Promote db1128 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/623757 (https://phabricator.wikimedia.org/T260324) [09:21:33] (03Abandoned) 10Jcrespo: [WIP] Add WMFBackup package creation [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620309 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [09:21:43] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/623757 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [09:24:49] (03PS1) 10Marostegui: wmnet: Promote db1128 to m5-master [dns] - 10https://gerrit.wikimedia.org/r/623759 (https://phabricator.wikimedia.org/T260324) [09:25:18] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day"" [dns] - 10https://gerrit.wikimedia.org/r/623759 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [09:28:24] alright re1 is back, time for the switchover [09:28:51] !log cr2-eqiad:request chassis routing-engine master switch - T259621 [09:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:11] that will interrupt traffic on cr2 [09:31:46] (03PS10) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [09:31:51] waiting for the linecards to boot up [09:32:18] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:32:20] PROBLEM - OSPF status on mr1-eqiad is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:32:24] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:32:39] (03PS8) 10Jbond: role:mx: add script to generate otrs aliases [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T256972) [09:32:54] (03PS11) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [09:33:16] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:34:02] interfaces coming back up [09:34:46] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:35:04] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:35:08] RECOVERY - OSPF status on mr1-eqiad is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:35:12] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:35:17] installing the OS on re0 (now backup) [09:35:50] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [09:36:37] (03CR) 10Jbond: [C: 03+2] role::mx: parameterise otrs db variables [puppet] - 10https://gerrit.wikimedia.org/r/623607 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:38:21] (03PS5) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 [09:39:18] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 97 probes of 645 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:39:18] (03CR) 10Jbond: "I have updated this CR so that OTRS now performs its lookups from a flat file generated using cron. Please review this change again and t" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:39:28] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/623608 (https://phabricator.wikimedia.org/T256972) (owner: 10Jbond) [09:39:48] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 102 probes of 564 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:39:51] (03CR) 10Hnowlan: api-gateway: Add mappings for ratelimit service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/623624 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [09:40:14] (03PS3) 10Hnowlan: api-gateway: Add mappings for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/623624 (https://phabricator.wikimedia.org/T254910) [09:40:20] PROBLEM - Router interfaces on pfw3-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 67, down: 1, dormant: 0, excluded: 3, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:44:38] hm, that one is not expected ^ [09:44:53] It's admin down on cr2 for no good reasons [09:45:06] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 7 probes of 645 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:45:36] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 49 probes of 564 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:46:03] !log reboot cr2-eqiad:re0 (backup) - T259621 [09:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:33] (03PS1) 10Jbond: rasdaemon: only install rasdaemon on buster systems [puppet] - 10https://gerrit.wikimedia.org/r/623760 (https://phabricator.wikimedia.org/T205396) [09:50:28] (03CR) 10Jbond: rasdaemon: only install rasdaemon on buster systems (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623760 (https://phabricator.wikimedia.org/T205396) (owner: 10Jbond) [09:50:42] (03CR) 10Jbond: [C: 03+2] rasdaemon: only install rasdaemon on buster systems [puppet] - 10https://gerrit.wikimedia.org/r/623760 (https://phabricator.wikimedia.org/T205396) (owner: 10Jbond) [09:53:19] 10Operations, 10observability, 10Patch-For-Review: Evaluate/integrate rasdaemon as a replacement for mcelog - https://phabricator.wikimedia.org/T205396 (10jbond) > It's been a while since I've had context here but I think it's fine to just let this happen with the buster migration. ack i have cleaned up pupp... [09:55:01] alright RE is back up, time for the switch [09:55:16] !log cr2-eqiad:request chassis routing-engine master switch - T259621 [09:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:06] and waiting for the interfaces to come back up [09:56:14] (03PS1) 10Volans: Cleanup leftover record druid-public-overlord [dns] - 10https://gerrit.wikimedia.org/r/623764 (https://phabricator.wikimedia.org/T244153) [09:57:44] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:57:56] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:58:28] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:59:04] PROBLEM - OSPF status on mr1-eqiad is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:59:13] (03PS3) 10Jcrespo: Remove wmfmariadbpy code from wmfbackups repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623751 [10:02:13] RECOVERY - Router interfaces on pfw3-eqiad is OK: OK: host 208.80.154.219, interfaces up: 69, down: 0, dormant: 0, excluded: 3, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:02:13] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:02:13] RECOVERY - OSPF status on mr1-eqiad is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:02:13] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:02:13] and back up [10:02:13] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:02:14] (03PS1) 10Volans: Cleanup leftover record hhvm-api [dns] - 10https://gerrit.wikimedia.org/r/623765 (https://phabricator.wikimedia.org/T244153) [10:02:41] 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi) [10:04:22] !log repool cr2-eqiad [10:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:03] 10Puppet, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10jcrespo) [10:09:09] 10Operations, 10Analytics: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10Joe) [10:09:34] 10Operations, 10Analytics: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10Joe) p:05Triage→03Unbreak! Setting priority to UBN! given the seriousness of the perf regression. [10:13:59] !log drain cr1-eqiad-pfw3-eqiad link [10:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:27] 10Operations, 10Analytics: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10Joe) It looks like kafka2003 is the culprit - its broker latencies are in the order of 1 seconds. [10:15:31] (03PS1) 10Filippo Giunchedi: hieradata: add 24x swift drives for ms-be2057 [puppet] - 10https://gerrit.wikimedia.org/r/623766 [10:16:08] !log drain cr1-eqiad transit/transport/IX [10:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:32] (03PS2) 10Arturo Borrero Gonzalez: cloud: bootstrap the cloudgw role/profile [puppet] - 10https://gerrit.wikimedia.org/r/623618 [10:18:00] !log move VRRP master from cr1 to cr2 [10:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:24] (03PS4) 10Jcrespo: Remove wmfmariadbpy code from wmfbackups repo [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/623751 [10:27:43] !log reboot cr1-eqiad:re1 (backup) [10:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:21] 10Operations, 10Analytics: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10Joe) So this is probably due to all the purges going through the codfw kafka2003 server, and that we still haven't partitioned the purge topic. In normal conditions, the pur... [10:30:37] (03PS1) 10Filippo Giunchedi: swift: extend ferm rules to cover more ports [puppet] - 10https://gerrit.wikimedia.org/r/623769 (https://phabricator.wikimedia.org/T261633) [10:30:45] (03CR) 10Jakob: [C: 03+1] Add `wmgWikibaseClientItemAndPropertySourceName` to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622612 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [10:31:13] !log install apache updates on jessie [10:31:13] (03CR) 10Jakob: [C: 03+1] Use `wmgWikibaseClientItemAndPropertySourceName` instead of `wmgWikibaseClientLocalEntitySourceName` in Wikibase.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622993 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [10:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:36] (03CR) 10Jakob: [C: 03+1] Remove `wmgWikibaseClientLocalEntitySourceName` from InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622994 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [10:32:06] 04Critical Alert for device cr3-knams.wikimedia.org - Traffic bill over quota [10:32:40] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-main [10:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:06] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Traffic bill over quota [10:33:23] <_joe_> XioNoX: ^^ [10:33:34] (03PS3) 10Hnowlan: Add title for apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623325 (https://phabricator.wikimedia.org/T246945) [10:33:45] ^ that's due to traffic shifting to the backup circuit the time of the upgrade, it's only about billing [10:34:18] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=restbase-async [10:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:24] (03PS2) 10Filippo Giunchedi: swift: extend ferm rules to cover more ports [puppet] - 10https://gerrit.wikimedia.org/r/623769 (https://phabricator.wikimedia.org/T261633) [10:34:47] <_joe_> ok, now I will also depool codfw, that will take up to 5 minutes to take effect [10:35:04] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=codfw [10:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:16] re is back up, time for the switch [10:36:38] !log cr1-eqiad:request chassis routing-engine master switch [10:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:00] (03PS3) 10Arturo Borrero Gonzalez: cloud: bootstrap the cloudgw role/profile [puppet] - 10https://gerrit.wikimedia.org/r/623618 [10:37:34] and waiting for interfaces to boot up [10:41:02] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:41:08] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:41:14] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:41:39] and up [10:42:35] installing it on re0... [10:42:54] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:43:02] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:43:06] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:45:05] !log install apache updates on buster [10:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:44] hnowlan: hey, if https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/623325/ is ready, I can ship that in ~15 minutes :-) [10:46:52] Urbanecm: that would be brilliant, thank you! [10:47:02] no problem! [10:47:52] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01008 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:48:19] looking ^^ [10:49:20] re is back up, time for the switch [10:49:34] nevermind, reboot first :) [10:49:43] !log reboot cr1-eqiad:re0 (backup) [10:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:21] (03PS3) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) [10:52:05] Urbanecm: shall I put it in the deployments calendar? [10:52:15] hnowlan: already did so, under my nick :) [10:52:23] https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=1879944&oldid=1879883 [10:52:25] (03PS2) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/623632 (https://phabricator.wikimedia.org/T256973) [10:52:35] nice, thanks [10:52:38] np :) [10:53:28] (03PS2) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/623634 (https://phabricator.wikimedia.org/T256973) [10:53:43] (03PS4) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) [10:53:54] (03PS3) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/623632 (https://phabricator.wikimedia.org/T256973) [10:54:09] (03PS3) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/623634 (https://phabricator.wikimedia.org/T256973) [10:55:16] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.00252 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:55:51] (03PS5) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 1/4 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) [10:56:25] (03PS4) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/623632 (https://phabricator.wikimedia.org/T256973) [10:58:11] re is back up for real, time for the switch [10:58:21] (03PS5) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 2/4 [puppet] - 10https://gerrit.wikimedia.org/r/623632 (https://phabricator.wikimedia.org/T256973) [10:58:28] !log cr1-eqiad:request chassis routing-engine master switch [10:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:30] (03PS4) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 3/4 [puppet] - 10https://gerrit.wikimedia.org/r/623634 (https://phabricator.wikimedia.org/T256973) [10:59:55] (03PS4) 10Arturo Borrero Gonzalez: cloud: bootstrap the cloudgw role/profile [puppet] - 10https://gerrit.wikimedia.org/r/623618 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200902T1100) [11:00:04] Urbanecm and duesen__: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] \o/ [11:00:18] duesen__: wanna start? [11:01:01] Urbanecm: sure! Should I do the deployment, or will you? [11:01:15] duesen__: up to you :) [11:01:28] ok, i'll do it, i need to practice this [11:01:40] i'll need a minute to get all my ducks in a row [11:01:45] sure [11:02:50] (03PS1) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 3/4 [puppet] - 10https://gerrit.wikimedia.org/r/623773 (https://phabricator.wikimedia.org/T256973) [11:03:07] Urbanecm: so, we deploy from deploy2001.codfw.wmnet now? [11:03:32] duesen__: no, deploy1001.eqiad.wmnet - the deployment server wasn't switched [11:03:57] Oh, good to know! [11:04:13] all interfaces are up, time to check everything and cleanup [11:04:16] (03PS2) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 4/4 [puppet] - 10https://gerrit.wikimedia.org/r/623773 (https://phabricator.wikimedia.org/T256973) [11:07:01] (03PS1) 10Effie Mouzeli: conftool: add data for dns discovery for push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/623774 (https://phabricator.wikimedia.org/T256973) [11:07:41] !log repool cr1-eqiad [11:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:15] ok, i'm about to apply the patch to /srv/mediawiki-staging/php-1.36.0-wmf.6 [11:08:43] duesen__: ack [11:10:29] what about 1.36.0-wmf.7? I see a directory for it [11:11:24] moved patch to /srv/patches/1.36.0-wmf.6/core/08-T260485-2.patch [11:12:29] good question duesen__ - are other sec patches applied in wmf.7? [11:13:36] duesen__: aha, it's only a /srv/patches directory [11:13:38] definitely add it there [11:13:44] ok [11:15:42] committed to patch repo [11:15:59] ack [11:17:03] do we want to test on a debug server? i'm strying to find the instructions for that... it's just scap pull, right? [11:17:11] duesen__: yup, scap pull should work [11:17:22] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug is the docs [11:17:30] section stagging changes [11:17:50] duesen__: and yes, I'd go for a test [11:17:57] on mwdebug1001.eqiad.wmnet? or is that on codfw now? [11:18:13] good question. _joe_: volans: which debug server should duesen__ use? [11:18:48] <_joe_> duesen__: if you want to take non-read-only-actions, use one of the two codfw mwdebugs [11:18:55] <_joe_> effie: are they in good shape? [11:19:07] _joe_: no, this needs db modificatiojn [11:19:19] <_joe_> duesen__: mwdebug2002 then :) [11:19:24] writes, not schema changem, of course [11:19:35] <_joe_> sure [11:19:41] <_joe_> go with mwdebug2002 [11:19:47] thank you _joe_ :) [11:19:52] <_joe_> mwdebug1* are read-only [11:19:57] ah ok [11:19:59] thanks [11:20:14] <_joe_> sorry, running to lunch right now, but duesen__ knows how to call me on the phone in case of need [11:20:17] Urbanecm: i can't test this btw, it'll have to be you [11:20:29] _joe_: i hope i won't screw up *that* bad ;) [11:20:36] duesen__: sure, ping me once it's pulled there :) [11:21:49] Urbanecm: deployed to mwdebug2002 [11:22:09] thanks, testing, I'm going to lock Testing account T260485 avkwiki now with suppression enabled [11:24:52] I should be able to confirm in the db like this, right? select * from ipblocks where ipb_address = 'T260485'; [11:25:03] on avkwiki [11:25:16] duesen__: yup, but I did that on your behalf :) [11:25:21] it works! [11:25:38] i'm not seeing the row in the db [11:25:50] duesen__: you need to be querying for wrong user [11:25:51] https://phabricator.wikimedia.org/P12433 [11:26:00] `Testing account T260485 avkwiki` is the full username [11:26:10] (03PS1) 10Cparle: Create wiki replica views for MachineVision extension tables [puppet] - 10https://gerrit.wikimedia.org/r/623775 (https://phabricator.wikimedia.org/T238574) [11:26:22] oh. right [11:26:40] ipb_by_actor: 833 [11:27:01] looks good! [11:27:04] indeed [11:27:11] I think we're ready for scapping [11:29:12] Urbanecm: scap sync-file --no-log-message php-1.36.0-wmf.6/includes/ActorMigration.php 'Deploy security fix for T260485' [11:29:20] does that look good? [11:29:31] yup, through it kinda breaks the purpose of --no-log-message :-) [11:29:57] ok, removing the ticket id [11:30:11] scap running [11:30:32] thanks [11:31:38] duesen__: seems we're live? [11:31:44] !log Deployed second security fix for T260485 [11:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:12] Urbanecm: cheers! [11:32:16] thanks! [11:32:28] I¨m going to do the config patch from calendar then [11:32:30] (03CR) 10Urbanecm: [C: 03+2] Add title for apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623325 (https://phabricator.wikimedia.org/T246945) (owner: 10Hnowlan) [11:33:16] (03Merged) 10jenkins-bot: Add title for apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623325 (https://phabricator.wikimedia.org/T246945) (owner: 10Hnowlan) [11:33:49] deploy1001's /srv/mediawiki-stagging is dirty [11:34:53] !log Fetched extra commits to deploy1001's stagging dir, commit messages explains it's an accident, continuing; cc Krinkle [11:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:45] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1128 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/623757 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [11:36:11] (03CR) 10Kormat: [C: 03+1] wmnet: Promote db1128 to m5-master [dns] - 10https://gerrit.wikimedia.org/r/623759 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [11:36:18] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 796b4fa8d561986a20ad5c9671b696809fa09b67: Add title for apiportalwiki (T246945) (duration: 00m 56s) [11:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:24] T246945: New Public Wiki for the API Portal - https://phabricator.wikimedia.org/T246945 [11:36:29] hnowlan: done, should be live [11:36:33] !log EU B&C done [11:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:59] Urbanecm: thx [11:37:02] Urbanecm: great, thank you! [11:37:33] I'll investgate an old ticket about broken revisions now, https://phabricator.wikimedia.org/T251778 [11:37:44] I'll check in before I make any changes to the database [11:38:51] where should i run maintenance scripts? mwmaint2001 [11:38:52] ? [11:41:13] oh, this is on testwiki [11:46:06] ok, if nobody objects, i'll mark 25 old revisions on testwiki as bad. they use ES cluster14, which doesn't exist. revisions are from 2007 and 2008 [11:47:00] (03PS5) 10Arturo Borrero Gonzalez: cloud: bootstrap the cloudgw role/profile [puppet] - 10https://gerrit.wikimedia.org/r/623618 [11:49:59] (03PS6) 10Arturo Borrero Gonzalez: cloud: bootstrap the cloudgw role/profile [puppet] - 10https://gerrit.wikimedia.org/r/623618 [11:51:26] (03PS7) 10Arturo Borrero Gonzalez: cloud: bootstrap the cloudgw role/profile [puppet] - 10https://gerrit.wikimedia.org/r/623618 [11:52:03] !log daniel@mwmaint2001:/srv/mediawiki/php-1.36.0-wmf.6$ mwscript findBadBlobs.php testwiki --mark T251778 [11:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:09] T251778: LBFactoryMulti: Unknown cluster 'cluster14' - https://phabricator.wikimedia.org/T251778 [11:52:13] done [11:52:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: bootstrap the cloudgw role/profile [puppet] - 10https://gerrit.wikimedia.org/r/623618 (owner: 10Arturo Borrero Gonzalez) [11:53:10] btw, is it ok to leave screen sessions open, even when they are not doing anything right now? [11:54:01] <_joe_> duesen__: eventually you will be notified if it's completely idle [11:54:40] ok [12:02:06] 04Critical+ Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got worse [12:03:04] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24873/" [puppet] - 10https://gerrit.wikimedia.org/r/623769 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [12:04:51] (03PS2) 10Filippo Giunchedi: hieradata: add 24x swift drives for ms-be2057 [puppet] - 10https://gerrit.wikimedia.org/r/623766 [12:04:54] (03PS3) 10Filippo Giunchedi: swift: extend ferm rules to cover more ports [puppet] - 10https://gerrit.wikimedia.org/r/623769 (https://phabricator.wikimedia.org/T261633) [12:05:38] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add 24x swift drives for ms-be2057 [puppet] - 10https://gerrit.wikimedia.org/r/623766 (owner: 10Filippo Giunchedi) [12:07:18] !log move vrrp master from cr2-codfw to cr1-codfw [12:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:16] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.574e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:10:39] (03PS1) 10Filippo Giunchedi: Add ms-be2057 to swift firewall [puppet] - 10https://gerrit.wikimedia.org/r/623779 (https://phabricator.wikimedia.org/T261633) [12:12:06] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 396 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:12:33] (03PS1) 10Jbond: pki: add sqlite DB [puppet] - 10https://gerrit.wikimedia.org/r/623780 (https://phabricator.wikimedia.org/T259117) [12:13:07] (03CR) 10Jbond: [C: 03+2] pki: add sqlite DB [puppet] - 10https://gerrit.wikimedia.org/r/623780 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [12:17:15] (03PS1) 10Jbond: cprrect parameter name [puppet] - 10https://gerrit.wikimedia.org/r/623781 [12:18:38] (03PS1) 10DCausse: [cirrusdumps] Skip wikis with existing dump files [puppet] - 10https://gerrit.wikimedia.org/r/623783 (https://phabricator.wikimedia.org/T260986) [12:19:02] (03CR) 10jerkins-bot: [V: 04-1] [cirrusdumps] Skip wikis with existing dump files [puppet] - 10https://gerrit.wikimedia.org/r/623783 (https://phabricator.wikimedia.org/T260986) (owner: 10DCausse) [12:19:03] (03CR) 10Jbond: [C: 03+2] cprrect parameter name [puppet] - 10https://gerrit.wikimedia.org/r/623781 (owner: 10Jbond) [12:21:53] (03PS2) 10DCausse: [cirrusdumps] Skip wikis with existing dump files [puppet] - 10https://gerrit.wikimedia.org/r/623783 (https://phabricator.wikimedia.org/T260986) [12:22:39] (03PS1) 10Kormat: cumin: Refactor db aliases. [puppet] - 10https://gerrit.wikimedia.org/r/623784 [12:24:11] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:26:03] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:26:05] (03PS2) 10Kormat: cumin: Refactor db aliases. [puppet] - 10https://gerrit.wikimedia.org/r/623784 [12:26:05] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 114.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [12:29:43] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:30:00] (03PS1) 10Jbond: correct ensure_resource call [puppet] - 10https://gerrit.wikimedia.org/r/623786 (https://phabricator.wikimedia.org/T259117) [12:30:11] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:31:37] (03CR) 10Volans: [C: 04-1] "One query as a syntax error, the rest looks good syntactically. I'll leave it to the DBAs for the actual selections." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623784 (owner: 10Kormat) [12:33:09] (03PS2) 10Jbond: correct ensure_resource call [puppet] - 10https://gerrit.wikimedia.org/r/623786 (https://phabricator.wikimedia.org/T259117) [12:35:19] (03PS3) 10Kormat: cumin: Refactor db aliases. [puppet] - 10https://gerrit.wikimedia.org/r/623784 [12:36:01] (03CR) 10Jbond: [C: 03+2] correct ensure_resource call [puppet] - 10https://gerrit.wikimedia.org/r/623786 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [12:36:22] (03CR) 10Kormat: cumin: Refactor db aliases. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623784 (owner: 10Kormat) [12:38:32] (03CR) 10Volans: [C: 03+1] "Syntactically correct, I didn't check if all aliases return at least 1 host though." [puppet] - 10https://gerrit.wikimedia.org/r/623784 (owner: 10Kormat) [12:40:49] (03CR) 10Marostegui: [C: 03+1] "Looks good, if you have time, let's test before or right after merging to make sure they return the intended hosts" [puppet] - 10https://gerrit.wikimedia.org/r/623784 (owner: 10Kormat) [12:41:50] (03PS1) 10Jbond: sqlite: correctly initiate with the schema [puppet] - 10https://gerrit.wikimedia.org/r/623788 [12:42:38] (03CR) 10Jbond: [C: 03+2] sqlite: correctly initiate with the schema [puppet] - 10https://gerrit.wikimedia.org/r/623788 (owner: 10Jbond) [12:59:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add --notify-age-in-days option to notify users before draft purge (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/622528 (https://phabricator.wikimedia.org/T261189) (owner: 10KartikMistry) [13:05:26] !log run kafka preferred-replica-election on kafka-main codfw [13:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:58] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) install memory upgrades in ores200[1-9] - https://phabricator.wikimedia.org/T259908 (10akosiaris) ores2* hosts downtimed for a 8h period on Thursday, feel free to proceed. [13:08:51] (03PS1) 10Effie Mouzeli: services_proxy: add push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/623790 (https://phabricator.wikimedia.org/T256973) [13:10:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/623541 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [13:11:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] lvs::configuration: add push-notifications patch 1/4 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [13:12:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] lvs::configuration: add push-notifications patch 2/4 [puppet] - 10https://gerrit.wikimedia.org/r/623632 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [13:12:28] (03CR) 10Elukey: "https://gerrit.wikimedia.org/r/c/operations/dns/+/622563 :P" [dns] - 10https://gerrit.wikimedia.org/r/623764 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [13:12:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] lvs::configuration: add push-notifications patch 3/4 [puppet] - 10https://gerrit.wikimedia.org/r/623634 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [13:12:46] 10Operations, 10User-Kormat: cumin: If no command is provided, output nodelist to stdout - https://phabricator.wikimedia.org/T261861 (10Kormat) [13:12:55] 10Operations, 10User-Kormat: cumin: If no command is provided, output nodelist to stdout - https://phabricator.wikimedia.org/T261861 (10Kormat) p:05Triage→03Medium [13:13:31] (03PS1) 10Hnowlan: helmfile_convert_diff: usability fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/623791 [13:13:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] "You probably also want an entry for profile::lvs::realserver::pools: in hieradata/role/common/kubernetes/worker.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [13:14:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] lvs::configuration: add push-notifications patch 4/4 [puppet] - 10https://gerrit.wikimedia.org/r/623773 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [13:14:53] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [13:14:53] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [13:17:14] (03CR) 10Ppchelko: [C: 03+1] api-gateway: Add mappings for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/623624 (https://phabricator.wikimedia.org/T254910) (owner: 10Hnowlan) [13:17:45] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [13:17:45] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [13:17:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 but you might want to bundle this with the service_setup LVS patch." [puppet] - 10https://gerrit.wikimedia.org/r/623774 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [13:18:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:21:25] (03PS1) 10Jbond: pki: add db config and oscp profile/certifiacte [puppet] - 10https://gerrit.wikimedia.org/r/623792 [13:21:46] (03Abandoned) 10Volans: Cleanup leftover record druid-public-overlord [dns] - 10https://gerrit.wikimedia.org/r/623764 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [13:22:00] (03PS2) 10Jbond: pki: add db config and oscp profile/certifiacte [puppet] - 10https://gerrit.wikimedia.org/r/623792 [13:22:15] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:22:33] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/622563 (owner: 10Elukey) [13:22:50] (03CR) 10jerkins-bot: [V: 04-1] pki: add db config and oscp profile/certifiacte [puppet] - 10https://gerrit.wikimedia.org/r/623792 (owner: 10Jbond) [13:24:44] (03PS3) 10Jbond: pki: add db config and oscp profile/certifiacte [puppet] - 10https://gerrit.wikimedia.org/r/623792 [13:28:36] (03PS2) 10Elukey: Remove druid-public-overlord records since they are not used [dns] - 10https://gerrit.wikimedia.org/r/622563 [13:31:16] (03CR) 10Elukey: [C: 03+2] Remove druid-public-overlord records since they are not used [dns] - 10https://gerrit.wikimedia.org/r/622563 (owner: 10Elukey) [13:32:07] (03PS1) 10Milimetric: Revert "camus - don't check eqiad topics while DC switchover to codfw is ongoing" [puppet] - 10https://gerrit.wikimedia.org/r/623556 (https://phabricator.wikimedia.org/T261865) [13:32:16] (03CR) 10Milimetric: [C: 04-1] Revert "camus - don't check eqiad topics while DC switchover to codfw is ongoing" [puppet] - 10https://gerrit.wikimedia.org/r/623556 (https://phabricator.wikimedia.org/T261865) (owner: 10Milimetric) [13:32:50] volans: done! [13:33:05] (03PS4) 10Jbond: pki: add db config and oscp profile/certifiacte [puppet] - 10https://gerrit.wikimedia.org/r/623792 [13:33:11] (03CR) 10jerkins-bot: [V: 04-1] Revert "camus - don't check eqiad topics while DC switchover to codfw is ongoing" [puppet] - 10https://gerrit.wikimedia.org/r/623556 (https://phabricator.wikimedia.org/T261865) (owner: 10Milimetric) [13:33:14] thanks elukey ! [13:35:19] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [13:36:17] (03PS5) 10Jbond: pki: add db config and oscp profile/certifiacte [puppet] - 10https://gerrit.wikimedia.org/r/623792 [13:43:36] (03PS6) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 1/4 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) [13:44:16] (03Abandoned) 10Effie Mouzeli: conftool: add data for dns discovery for push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/623774 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [13:45:25] 10Operations, 10netops: Route cloud-hosts1-b-eqiad vlan through cloudsw - https://phabricator.wikimedia.org/T261866 (10ayounsi) p:05Triage→03Medium [13:49:42] (03PS6) 10Jbond: pki: add db config and oscp profile/certifiacte [puppet] - 10https://gerrit.wikimedia.org/r/623792 [13:53:06] (03CR) 10Jbond: [C: 03+2] pki: add db config and oscp profile/certifiacte [puppet] - 10https://gerrit.wikimedia.org/r/623792 (owner: 10Jbond) [14:00:59] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:01:46] (03PS1) 10Ayounsi: codfw traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/623795 (https://phabricator.wikimedia.org/T261867) [14:02:50] (03PS2) 10Ayounsi: codfw traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/623795 (https://phabricator.wikimedia.org/T261867) [14:02:51] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:03:37] (03PS2) 10JMeybohm: Add entries for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623541 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:03:58] (03CR) 10jerkins-bot: [V: 04-1] Add entries for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623541 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:10:09] (03PS1) 10Filippo Giunchedi: prometheus: remove 'for' clause for IcingaServiceProblem [puppet] - 10https://gerrit.wikimedia.org/r/623796 (https://phabricator.wikimedia.org/T258948) [14:10:14] (03PS1) 10Filippo Giunchedi: alertmanager: tweak alertmanager-irc-relay config [puppet] - 10https://gerrit.wikimedia.org/r/623797 (https://phabricator.wikimedia.org/T258948) [14:10:18] (03PS1) 10Filippo Giunchedi: alertmanager: fix Icinga compat routes [puppet] - 10https://gerrit.wikimedia.org/r/623798 (https://phabricator.wikimedia.org/T258948) [14:10:59] (03CR) 10Effie Mouzeli: [C: 04-1] "We should add a more descriptive commit message, so someone in the future can tell what this is about. Additionally, a PCC would be nice " [puppet] - 10https://gerrit.wikimedia.org/r/599751 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [14:14:56] (03PS2) 10Filippo Giunchedi: alertmanager: fix Icinga compat routes [puppet] - 10https://gerrit.wikimedia.org/r/623798 (https://phabricator.wikimedia.org/T258948) [14:15:45] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: fix Icinga compat routes [puppet] - 10https://gerrit.wikimedia.org/r/623798 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [14:17:13] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: tweak alertmanager-irc-relay config [puppet] - 10https://gerrit.wikimedia.org/r/623797 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [14:17:19] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove 'for' clause for IcingaServiceProblem [puppet] - 10https://gerrit.wikimedia.org/r/623796 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [14:17:56] (03PS2) 10JMeybohm: Add discovery records for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623544 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:18:27] (03CR) 10jerkins-bot: [V: 04-1] Add discovery records for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623544 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:18:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2120 T261869', diff saved to https://phabricator.wikimedia.org/P12434 and previous config saved to /var/cache/conftool/dbconfig/20200902-141854-marostegui.json [14:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:00] T261869: db2120 crashed - https://phabricator.wikimedia.org/T261869 [14:20:44] (03CR) 10Ayounsi: [C: 03+2] codfw traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/623795 (https://phabricator.wikimedia.org/T261867) (owner: 10Ayounsi) [14:21:09] (03Merged) 10jenkins-bot: codfw traffic engineering [homer/public] - 10https://gerrit.wikimedia.org/r/623795 (https://phabricator.wikimedia.org/T261867) (owner: 10Ayounsi) [14:21:20] (03PS3) 10Herron: prometheus: switch over to buster kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/622836 (https://phabricator.wikimedia.org/T252773) [14:24:10] (03CR) 10JMeybohm: [C: 04-1] lvs::configuration: add push-notifications patch 1/4 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:25:52] (03CR) 10JMeybohm: [C: 03+1] lvs::configuration: add push-notifications patch 2/4 [puppet] - 10https://gerrit.wikimedia.org/r/623632 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:26:45] (03PS5) 10Hnowlan: mediawiki: Add api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/599751 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [14:27:28] (03CR) 10Hnowlan: "> Patch Set 4: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/599751 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [14:28:29] (03CR) 10JMeybohm: [C: 04-1] "This is" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623634 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:28:58] !log execute kafka topics --alter --topic codfw.resource-purge --partitions 3 and kafka topics --alter --topic eqiad.resource-purge --partitions 3 on kafka-main codfw - T261865 [14:28:59] (03CR) 10JMeybohm: [C: 03+1] lvs::configuration: add push-notifications patch 4/4 [puppet] - 10https://gerrit.wikimedia.org/r/623773 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:05] T261865: Undo any temporary changes made while running in codfw - https://phabricator.wikimedia.org/T261865 [14:30:27] (03CR) 10JMeybohm: [C: 03+1] services_proxy: add push-notifications [puppet] - 10https://gerrit.wikimedia.org/r/623790 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:31:44] !log execute kafka topics --alter --topic codfw.resource-purge --partitions 3 and kafka topics --alter --topic eqiad.resource-purge --partitions 3 on kafka-main eqiad - T261865 [14:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:06] 04Critical- Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better [14:32:29] XioNoX: critical and got better in the same phrase feels wrong :D ^^^ [14:32:49] (03CR) 10Herron: [C: 03+2] prometheus: switch over to buster kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/622836 (https://phabricator.wikimedia.org/T252773) (owner: 10Herron) [14:32:56] volans: it was very critical, now it's just critical [14:33:04] ahhhh :D [14:33:22] [a tad less critical] [14:33:25] Prompt Critical [14:33:59] lol [14:38:44] (03PS1) 10Herron: Revert "prometheus: switch over to buster kafkamon hosts" [puppet] - 10https://gerrit.wikimedia.org/r/623558 [14:39:47] (03CR) 10jerkins-bot: [V: 04-1] Revert "prometheus: switch over to buster kafkamon hosts" [puppet] - 10https://gerrit.wikimedia.org/r/623558 (owner: 10Herron) [14:40:20] (03PS2) 10Herron: Revert "prometheus: switch over to buster kafkamon hosts" [puppet] - 10https://gerrit.wikimedia.org/r/623558 [14:41:32] (03CR) 10Herron: [C: 03+2] Revert "prometheus: switch over to buster kafkamon hosts" [puppet] - 10https://gerrit.wikimedia.org/r/623558 (owner: 10Herron) [14:41:35] my favorite bash entry about how goofy some of our icinga alerts read -- https://bash.toolforge.org/quip/AU7VYj0g6snAnmqnK_23 [14:44:59] (03PS1) 10Ayounsi: codfw: offload traffic from Zayo [homer/public] - 10https://gerrit.wikimedia.org/r/623803 (https://phabricator.wikimedia.org/T261867) [14:46:25] (03CR) 10Ayounsi: [C: 03+2] codfw: offload traffic from Zayo [homer/public] - 10https://gerrit.wikimedia.org/r/623803 (https://phabricator.wikimedia.org/T261867) (owner: 10Ayounsi) [14:46:49] (03Merged) 10jenkins-bot: codfw: offload traffic from Zayo [homer/public] - 10https://gerrit.wikimedia.org/r/623803 (https://phabricator.wikimedia.org/T261867) (owner: 10Ayounsi) [14:46:57] (03PS6) 10Hnowlan: mediawiki: Add api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/599751 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [14:46:59] FYI I'm moving swiftrepl to codfw and will force a run, no impact expected [14:48:22] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: move swiftrepl to codfw [puppet] - 10https://gerrit.wikimedia.org/r/622522 (owner: 10Filippo Giunchedi) [14:52:32] (03CR) 10ArielGlenn: [C: 03+1] "This looks fine to me but I have not tested it." [puppet] - 10https://gerrit.wikimedia.org/r/623783 (https://phabricator.wikimedia.org/T260986) (owner: 10DCausse) [14:53:03] PROBLEM - Check the last execution of swiftrepl-mw on ms-fe1005 is CRITICAL: NRPE: Command check_check_swiftrepl-mw_status not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:53:59] that's me ^ expected [14:54:21] (03PS1) 10Herron: dns: remove unused ganeti500[123] ipv6 records [dns] - 10https://gerrit.wikimedia.org/r/623805 [14:56:13] (03PS7) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 1/4 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) [14:56:17] (03CR) 10JMeybohm: [C: 04-1] "Looks like there is still another thing to do and another file to patch (https://wikitech.wikimedia.org/wiki/LVS#For_both_active/active_an" [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:56:25] (03PS6) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 2/4 [puppet] - 10https://gerrit.wikimedia.org/r/623632 (https://phabricator.wikimedia.org/T256973) [14:57:18] (03PS1) 10Andrew Bogott: wmcs admin scripts: add wmcs-instance-fqdns [puppet] - 10https://gerrit.wikimedia.org/r/623806 [14:57:20] (03PS1) 10Jbond: pki: Add OCSP configueration [puppet] - 10https://gerrit.wikimedia.org/r/623807 (https://phabricator.wikimedia.org/T259117) [14:57:48] (03PS1) 10Ppchelko: Deploy change-propagation v0.10.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/623808 (https://phabricator.wikimedia.org/T157649) [14:59:13] (03PS2) 10Jbond: pki: Add OCSP configueration [puppet] - 10https://gerrit.wikimedia.org/r/623807 (https://phabricator.wikimedia.org/T259117) [14:59:17] (03CR) 10jerkins-bot: [V: 04-1] Deploy change-propagation v0.10.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/623808 (https://phabricator.wikimedia.org/T157649) (owner: 10Ppchelko) [15:01:34] (03CR) 10Giuseppe Lavagetto: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/623808 (https://phabricator.wikimedia.org/T157649) (owner: 10Ppchelko) [15:01:47] (03CR) 10Ppchelko: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/623808 (https://phabricator.wikimedia.org/T157649) (owner: 10Ppchelko) [15:03:49] (03PS3) 10Jbond: pki: Add OCSP configueration [puppet] - 10https://gerrit.wikimedia.org/r/623807 (https://phabricator.wikimedia.org/T259117) [15:05:12] (03CR) 10Jbond: [C: 03+2] pki: Add OCSP configueration [puppet] - 10https://gerrit.wikimedia.org/r/623807 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [15:06:41] (03CR) 10Ppchelko: [C: 03+2] Deploy change-propagation v0.10.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/623808 (https://phabricator.wikimedia.org/T157649) (owner: 10Ppchelko) [15:07:54] (03Merged) 10jenkins-bot: Deploy change-propagation v0.10.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/623808 (https://phabricator.wikimedia.org/T157649) (owner: 10Ppchelko) [15:11:48] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:36] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-main [15:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:48] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-main [15:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:00] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-main,name=eqiad [15:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:07] <_joe_> sigh [15:17:30] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:01] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-main,name=eqiad [15:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:31] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=restbase-async [15:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:13] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad [15:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:40] (03CR) 10Ryan Kemper: [C: 03+2] increment extra plugin to 6.5.4-wmf-11 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/616602 (owner: 10Ebernhardson) [15:24:10] !log prometheus codfw lvextend --resizefs --size +50G /dev/mapper/vg--ssd-prometheus--k8s [15:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:06] _joe_: ottomata - deployed change-prop [15:26:10] should be good now [15:27:59] we have events in partition 0, 1, 2.. [15:28:02] on codfw [15:28:14] and on eqiad [15:29:01] <_joe_> ack thanks Pchelolo <3 [15:29:22] nice [15:30:31] (03PS1) 10Ayounsi: codfw offload Telefonica from Zayo [homer/public] - 10https://gerrit.wikimedia.org/r/623815 (https://phabricator.wikimedia.org/T261867) [15:32:05] (03CR) 10Ayounsi: [C: 03+2] codfw offload Telefonica from Zayo [homer/public] - 10https://gerrit.wikimedia.org/r/623815 (https://phabricator.wikimedia.org/T261867) (owner: 10Ayounsi) [15:32:23] !log Temporarily disabling apache for configuration change T246945 [15:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:30] T246945: New Public Wiki for the API Portal - https://phabricator.wikimedia.org/T246945 [15:32:33] (03Merged) 10jenkins-bot: codfw offload Telefonica from Zayo [homer/public] - 10https://gerrit.wikimedia.org/r/623815 (https://phabricator.wikimedia.org/T261867) (owner: 10Ayounsi) [15:33:38] (03PS1) 10Filippo Giunchedi: swift: fix swiftrepl after switchover [puppet] - 10https://gerrit.wikimedia.org/r/623816 [15:35:08] (03PS1) 10Ryan Kemper: Fix inconsequential typo [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/623819 [15:37:07] 10Operations, 10Analytics: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10Joe) 05Open→03Resolved a:03Joe We added two additional partitions to resource_purge, and this seems to have solved the issue, mostly. [15:37:48] 10Operations, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: Unexplained increase in save times, possibly associated with DC switchover - https://phabricator.wikimedia.org/T261763 (10Joe) It seems the actions taken to solve T261846 have solved this issue as well. Let's keep an eye on it but it... [15:38:09] (03PS2) 10Ryan Kemper: Fix inconsequential typos [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/623819 [15:45:56] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10LGoto) [15:47:01] (03PS3) 10Ryan Kemper: Fix inconsequential typos [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/623819 [15:48:17] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki: Add api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/599751 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [15:48:35] (03CR) 10Hnowlan: [C: 03+2] mediawiki: Add api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/599751 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [15:53:16] (03CR) 10CDanis: [C: 03+1] swift: extend ferm rules to cover more ports [puppet] - 10https://gerrit.wikimedia.org/r/623769 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [15:54:24] (03PS1) 10Jbond: add fake pki auth key [labs/private] - 10https://gerrit.wikimedia.org/r/623820 [15:54:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] add fake pki auth key [labs/private] - 10https://gerrit.wikimedia.org/r/623820 (owner: 10Jbond) [15:55:15] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-main [15:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:37] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=restbase-async [15:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:49] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=codfw [15:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:00] (03PS1) 10Jbond: pki: add authkey support and keys [puppet] - 10https://gerrit.wikimedia.org/r/623823 (https://phabricator.wikimedia.org/T259117) [15:58:00] cdanis: thank you for the swift review (hah!) [15:59:20] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: fix swiftrepl after switchover [puppet] - 10https://gerrit.wikimedia.org/r/623816 (owner: 10Filippo Giunchedi) [15:59:39] 👍 [16:00:02] (03PS1) 10Kormat: Actually pin black/isort versions this time. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623825 [16:00:05] (03CR) 10SBassett: [C: 03+1] Create wiki replica views for MachineVision extension tables [puppet] - 10https://gerrit.wikimedia.org/r/623775 (https://phabricator.wikimedia.org/T238574) (owner: 10Cparle) [16:00:42] (03PS2) 10Jbond: pki: add authkey support and keys [puppet] - 10https://gerrit.wikimedia.org/r/623823 (https://phabricator.wikimedia.org/T259117) [16:00:58] (03CR) 10Kormat: "There's one more mention of wmfbackups in `/.black.toml`. Apart from that this looks good." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623752 (owner: 10Jcrespo) [16:02:04] (03CR) 10jerkins-bot: [V: 04-1] pki: add authkey support and keys [puppet] - 10https://gerrit.wikimedia.org/r/623823 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [16:02:34] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/623623 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [16:05:26] 10Operations, 10SRE-Access-Requests: Requesting access to Production for lsobanski - https://phabricator.wikimedia.org/T261760 (10LSobanski) I am still unable to access Icinga, reportedly this requires a separate patch. [16:05:43] 10Operations, 10SRE-Access-Requests: Requesting access to Production for lsobanski - https://phabricator.wikimedia.org/T261760 (10LSobanski) 05Resolved→03Open [16:06:15] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:07:02] (03PS1) 10Hnowlan: Revert "mediawiki: Add api.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/623559 [16:07:29] ^ that is me and hugh, reverting [16:08:25] (03CR) 10Effie Mouzeli: [C: 03+1] Revert "mediawiki: Add api.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/623559 (owner: 10Hnowlan) [16:08:32] (03CR) 10Hnowlan: [C: 03+2] Revert "mediawiki: Add api.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/623559 (owner: 10Hnowlan) [16:12:25] 10Puppet, 10Analytics, 10VPS-Projects: Puppet failing on wikistats.analytics.eqiad.wmflabs due to statistics::user - https://phabricator.wikimedia.org/T259307 (10Nuria) [16:12:47] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-ceph-migrate: use new 'generation 2' flavor names in flavor map [puppet] - 10https://gerrit.wikimedia.org/r/623676 (https://phabricator.wikimedia.org/T261252) (owner: 10Andrew Bogott) [16:13:03] (03CR) 10Andrew Bogott: [C: 03+2] wmcs admin scripts: add wmcs-instance-fqdns [puppet] - 10https://gerrit.wikimedia.org/r/623806 (owner: 10Andrew Bogott) [16:13:41] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:19:58] (03PS1) 10Bartosz Dziewoński: Re-apply new reply API patches (again) [extensions/DiscussionTools] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/623560 (https://phabricator.wikimedia.org/T252558) [16:20:13] (03PS1) 10Bartosz Dziewoński: Fix parsing localised digits in PHP discussion parser [extensions/DiscussionTools] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/623561 (https://phabricator.wikimedia.org/T261706) [16:20:23] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:21:45] 10Puppet, 10Analytics, 10VPS-Projects: Puppet failing on wikistats.analytics.eqiad.wmflabs due to statistics::user - https://phabricator.wikimedia.org/T259307 (10Nuria) a:03razzi [16:22:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:27:20] (03CR) 10Thcipriani: "one inline nit." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [16:35:03] 10Operations, 10SRE-Access-Requests: Requesting access to Production for lsobanski - https://phabricator.wikimedia.org/T261760 (10colewhite) 05Open→03Resolved Confirmed access to Icinga fixed via IRC. [16:36:45] (03PS3) 10Jbond: pki: add authkey support and keys [puppet] - 10https://gerrit.wikimedia.org/r/623823 (https://phabricator.wikimedia.org/T259117) [16:38:53] (03PS4) 10Jbond: pki: add authkey support and keys [puppet] - 10https://gerrit.wikimedia.org/r/623823 (https://phabricator.wikimedia.org/T259117) [16:39:17] !log creating oauth_ratelimit_client_tier table T258711 [16:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:23] T258711: Review request for a new database table for OAuthRateLimiter - https://phabricator.wikimedia.org/T258711 [16:40:40] (03CR) 10Jbond: [C: 03+2] pki: add authkey support and keys [puppet] - 10https://gerrit.wikimedia.org/r/623823 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [16:42:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10Cmjohnson) The fiber has been run and is connected to cr2- xe-3/3/7 once Equinix does their part it's a simple as plugging it in at the demarc The cable number to the fiber is 5001 [16:47:05] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:51] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [17:00:59] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [17:01:48] herron, shdubsh --^ hello hello :) I think that we are again into the GC overhead behavior :( [17:02:25] hey elukey yes it looks that way indeed [17:02:30] I'll bounce them [17:04:16] (03PS1) 10Urbanecm: Lift IP cap on 2020-09-08 for Senior Citizen Write Wikipedia course - cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623832 (https://phabricator.wikimedia.org/T261882) [17:04:35] (03PS1) 10Hnowlan: Add apache config for api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/623833 (https://phabricator.wikimedia.org/T246945) [17:10:21] herron: thanks! [17:12:15] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:14:09] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:17:45] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [17:18:08] !log restarted elasticsearch on logstash1012 [17:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:45] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10Cmjohnson) 05Open→03Resolved a decom task has been created to track the old-msw1-eqiad. all ports have been updated Resolving this task [17:23:58] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson) a:05Cmjohnson→03ayounsi @ayounsi can you add the analytics vlan to cloudsw-d5 please and these 2 servers to it's v... [17:28:23] !log disabled puppet on labsdb10[09-12] [17:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:28] (03CR) 10Bstorm: [C: 03+2] wikireplicas: test removing deprecated passwords module from role [puppet] - 10https://gerrit.wikimedia.org/r/623623 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [17:31:50] (03CR) 10Bstorm: "It's a noop! 🎉" [puppet] - 10https://gerrit.wikimedia.org/r/623623 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [17:34:33] !log re-enabled puppet on labsdb10[09-12] [17:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200902T1800) [18:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200902T1800). [18:00:04] Pchelolo and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] (03PS1) 10Razzi: admin: Add type annotations and clean up modules/profile/manifests/analytics/cluster/secrets.pp [puppet] - 10https://gerrit.wikimedia.org/r/623842 [18:00:31] I'll do my own [18:00:41] hi [18:00:53] (03PS2) 10Razzi: admin: Add type annotations and clean up modules/profile/manifests/analytics/cluster/secrets.pp [puppet] - 10https://gerrit.wikimedia.org/r/623842 (https://phabricator.wikimedia.org/T252617) [18:01:28] MatmaRex: you wanna go fist? I have a bunch and will take some time [18:02:01] (03CR) 10jerkins-bot: [V: 04-1] admin: Add type annotations and clean up modules/profile/manifests/analytics/cluster/secrets.pp [puppet] - 10https://gerrit.wikimedia.org/r/623842 (https://phabricator.wikimedia.org/T252617) (owner: 10Razzi) [18:02:34] Pchelolo: i don't have deployment access, i'm hoping that someone is available to deploy my patches [18:02:46] ok. In that case, I'll do it :) [18:03:44] hm.. these are both code deployments [18:04:08] are we allowed to do it on a switchover week? [18:04:25] (03PS2) 10Volans: Cleanup leftover record hhvm-api [dns] - 10https://gerrit.wikimedia.org/r/623765 (https://phabricator.wikimedia.org/T244153) [18:04:27] (03PS1) 10Volans: Cleanup leftover record cloudceph.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/623843 (https://phabricator.wikimedia.org/T244153) [18:04:43] i think so? only the train wasn't running [18:04:52] (03PS2) 10Volans: Cleanup leftover record cloudceph.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/623843 (https://phabricator.wikimedia.org/T244153) [18:05:10] ok I guess... [18:05:14] it looks like a patch in mediawiki/extensions/WikimediaEvents was backported on monday [18:05:18] jouncebot: refresh [18:05:19] I refreshed my knowledge about deployments. [18:05:28] (03PS1) 10Cmjohnson: Adding mac addresses for an-worker1096-1117 to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/623844 (https://phabricator.wikimedia.org/T259071) [18:05:39] (03CR) 10Andrew Bogott: [C: 03+1] Cleanup leftover record cloudceph.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/623843 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [18:06:26] MatmaRex: Pchelolo: I'm here if needed [18:07:07] (03CR) 10Ppchelko: [C: 03+2] Re-apply new reply API patches (again) [extensions/DiscussionTools] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/623560 (https://phabricator.wikimedia.org/T252558) (owner: 10Bartosz Dziewoński) [18:07:17] ok. let's begin with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/623560 [18:07:35] MatmaRex: wishing you luck! [18:07:36] backports are fine this week, we only canceled the train and yesterday's deployments [18:07:59] but thanks for thinking about it :) [18:08:28] (03PS1) 10Cmjohnson: Adding an-worker1096-1117 to netboot.cfg file [puppet] - 10https://gerrit.wikimedia.org/r/623845 (https://phabricator.wikimedia.org/T259071) [18:10:03] (03CR) 10Cmjohnson: [C: 03+2] Adding mac addresses for an-worker1096-1117 to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/623844 (https://phabricator.wikimedia.org/T259071) (owner: 10Cmjohnson) [18:10:37] (03CR) 10Cmjohnson: [C: 03+2] Adding an-worker1096-1117 to netboot.cfg file [puppet] - 10https://gerrit.wikimedia.org/r/623845 (https://phabricator.wikimedia.org/T259071) (owner: 10Cmjohnson) [18:10:39] (03Merged) 10jenkins-bot: Re-apply new reply API patches (again) [extensions/DiscussionTools] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/623560 (https://phabricator.wikimedia.org/T252558) (owner: 10Bartosz Dziewoński) [18:11:53] (03PS3) 10Volans: Cleanup leftover record cloudceph.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/623843 (https://phabricator.wikimedia.org/T244153) [18:12:01] (03CR) 10Volans: [C: 03+2] Cleanup leftover record cloudceph.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/623843 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [18:13:58] MatmaRex: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/623560 is on mwdebug2001 [18:14:29] please test if you can [18:14:46] looking [18:16:00] Pchelolo: seems good [18:16:07] ok. let's deploy [18:17:11] (03PS4) 10Dzahn: service.yaml: add releases as a service without LVS [puppet] - 10https://gerrit.wikimedia.org/r/623464 [18:17:47] (03CR) 10Dzahn: service.yaml: add releases as a service without LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623464 (owner: 10Dzahn) [18:19:11] (03PS1) 10Cmjohnson: Adding nodes an-worker1096-1117 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/623847 [18:19:13] !log ppchelko@deploy1001 Synchronized php-1.36.0-wmf.6/extensions/DiscussionTools/: Backport [[gerrit:623560|Re-apply new reply API patches (again)]] (duration: 00m 58s) [18:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:18] (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/623464 (owner: 10Dzahn) [18:19:22] (03PS5) 10Dzahn: service.yaml: add releases as a service without LVS [puppet] - 10https://gerrit.wikimedia.org/r/623464 [18:19:35] (03CR) 10jerkins-bot: [V: 04-1] Adding nodes an-worker1096-1117 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/623847 (owner: 10Cmjohnson) [18:19:42] MatmaRex: done. onto the next one. [18:20:01] (03CR) 10Ppchelko: [C: 03+2] Fix parsing localised digits in PHP discussion parser [extensions/DiscussionTools] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/623561 (https://phabricator.wikimedia.org/T261706) (owner: 10Bartosz Dziewoński) [18:20:16] (03CR) 10Dzahn: "ah, thanks 😊" [puppet] - 10https://gerrit.wikimedia.org/r/623760 (https://phabricator.wikimedia.org/T205396) (owner: 10Jbond) [18:20:49] (03PS2) 10Cmjohnson: Adding nodes an-worker1096-1117 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/623847 [18:20:57] (03PS2) 10Dzahn: scap: add data types, lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/623078 [18:23:25] (03PS3) 10Cmjohnson: Adding nodes an-worker1096-1117 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/623847 (https://phabricator.wikimedia.org/T254892) [18:25:11] (03PS4) 10Cmjohnson: Adding nodes an-worker1096-1117 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/623847 (https://phabricator.wikimedia.org/T254892) [18:25:16] (03Merged) 10jenkins-bot: Fix parsing localised digits in PHP discussion parser [extensions/DiscussionTools] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/623561 (https://phabricator.wikimedia.org/T261706) (owner: 10Bartosz Dziewoński) [18:25:34] (03CR) 10Dzahn: "there is a typo in the role name "instep" / "insetup"" [puppet] - 10https://gerrit.wikimedia.org/r/623847 (https://phabricator.wikimedia.org/T254892) (owner: 10Cmjohnson) [18:26:16] @mutante I did see that already and fixed it [18:26:50] (03CR) 10Cmjohnson: [C: 03+2] Adding nodes an-worker1096-1117 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/623847 (https://phabricator.wikimedia.org/T254892) (owner: 10Cmjohnson) [18:27:16] MatmaRex: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/623561 is on wmdebug2001, please test [18:27:35] (03CR) 10Dzahn: [C: 04-1] "parameter 'conftool' expects a Boolean value, got Tuple" [puppet] - 10https://gerrit.wikimedia.org/r/623078 (owner: 10Dzahn) [18:27:36] Pchelolo: also looks good [18:27:40] (i was readu this time ;) ) [18:27:42] ready* [18:27:52] cmjohnson1: ack, cool! [18:28:55] !log ppchelko@deploy1001 Synchronized php-1.36.0-wmf.6/extensions/DiscussionTools/: Backport [[gerrit:623561|Fix parsing localised digits in PHP discussion parser]] (duration: 00m 56s) [18:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:02] ok, yours are done [18:29:06] now onto my own [18:29:28] (03PS2) 10Ppchelko: Install OAuthRateLimiter extension I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622896 (https://phabricator.wikimedia.org/T258423) [18:29:33] (03CR) 10Ppchelko: [C: 03+2] Install OAuthRateLimiter extension I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622896 (https://phabricator.wikimedia.org/T258423) (owner: 10Ppchelko) [18:30:17] (03Merged) 10jenkins-bot: Install OAuthRateLimiter extension I: Add i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622896 (https://phabricator.wikimedia.org/T258423) (owner: 10Ppchelko) [18:30:31] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson) [18:32:49] !log execute kafka topics --alter --topic codfw.resource-purge --partitions 3 and kafka topics --alter --topic eqiad.resource-purge --partitions 3 on kafka jumbo-eqiad (for consistency with main) - T261865 [18:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:55] T261865: Undo any temporary changes made while running in codfw - https://phabricator.wikimedia.org/T261865 [18:33:11] !log ppchelko@deploy1001 Synchronized wmf-config/extension-list: (no justification provided) (duration: 00m 54s) [18:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:30] (03PS2) 10Ppchelko: Install OAuthRateLimiter extension II: Add flag to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622897 (https://phabricator.wikimedia.org/T258423) [18:33:33] (03CR) 10Ppchelko: [C: 03+2] Install OAuthRateLimiter extension II: Add flag to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622897 (https://phabricator.wikimedia.org/T258423) (owner: 10Ppchelko) [18:34:21] (03Merged) 10jenkins-bot: Install OAuthRateLimiter extension II: Add flag to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622897 (https://phabricator.wikimedia.org/T258423) (owner: 10Ppchelko) [18:34:30] !log execute kafka topics --alter --topic codfw.resource_change --partitions 3 and kafka topics --alter --topic eqiad.resource_change --partitions 3 on kafka main-eqiad - T261865 [18:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:43] Pchelolo: thank you for deploying! [18:36:16] (03PS4) 10Ppchelko: Install OAuthRateLimiter III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622898 (https://phabricator.wikimedia.org/T246271) [18:36:47] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: gerrit:622897 Install OAuthRateLimiter extension II: Add flag to IS (duration: 00m 56s) [18:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:55] (03CR) 10Ppchelko: [C: 03+2] Install OAuthRateLimiter III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622898 (https://phabricator.wikimedia.org/T246271) (owner: 10Ppchelko) [18:36:57] (03PS1) 10Herron: alerts: combine alerts.wm.o and icinga.wm.o certificates [puppet] - 10https://gerrit.wikimedia.org/r/623848 (https://phabricator.wikimedia.org/T261342) [18:37:21] !log execute kafka topics --alter --topic codfw.resource_change --partitions 3 and kafka topics --alter --topic eqiad.resource_change --partitions 3 on kafka main-codfw - T261865 [18:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:38] (03Merged) 10jenkins-bot: Install OAuthRateLimiter III: Install where enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622898 (https://phabricator.wikimedia.org/T246271) (owner: 10Ppchelko) [18:38:11] !log execute kafka topics --alter --topic codfw.resource_change --partitions 3 and kafka topics --alter --topic eqiad.resource_change --partitions 3 on kafka jumbo-eqiad (for consistency with main) - T261865 [18:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:17] T261865: Undo any temporary changes made while running in codfw - https://phabricator.wikimedia.org/T261865 [18:40:01] (03PS3) 10Dzahn: scap: add data types, lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/623078 [18:40:02] (03PS4) 10Ppchelko: Install OAuthRateLimiter extension IV: Enable on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622899 [18:40:06] !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings.php: gerrit:622898 Install OAuthRateLimiter III: Install where enabled (duration: 00m 55s) [18:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:15] (03CR) 10Ppchelko: [C: 03+2] Install OAuthRateLimiter extension IV: Enable on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622899 (owner: 10Ppchelko) [18:41:01] (03Merged) 10jenkins-bot: Install OAuthRateLimiter extension IV: Enable on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622899 (owner: 10Ppchelko) [18:43:07] !log ppchelko@deploy1001 Synchronized wmf-config/CommonSettings.php: gerrit:622898 Install OAuthRateLimiter III: Install where enabled, ouch, forgot to rebase (duration: 00m 55s) [18:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:57] * Pchelolo is done with deployment [18:50:44] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10Cmjohnson) @jbond it is failing in the installer for raid configuration. I don't know why [18:53:09] (03PS4) 10Dzahn: scap: add data types, lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/623078 [18:55:52] (03CR) 10Ottomata: [C: 03+1] "LGTM! Jenkins is mad about your commit message being too long, fix that and we can merge together." [puppet] - 10https://gerrit.wikimedia.org/r/623842 (https://phabricator.wikimedia.org/T252617) (owner: 10Razzi) [18:57:11] (03CR) 10CDanis: [C: 03+1] "Thanks for cleaning up after me 😬" [puppet] - 10https://gerrit.wikimedia.org/r/623760 (https://phabricator.wikimedia.org/T205396) (owner: 10Jbond) [18:57:32] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10Cmjohnson) a:05Cmjohnson→03RobH @ayounsi I am not sure if there is a vendor to follow up with on this. checking with @RobH [18:58:04] !log freeing some disk space on centrallog1001 with 'tune2fs -m 0 /dev/centrallog1001-vg/data' [18:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:16] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10Cmjohnson) [19:00:11] (03PS5) 10Dzahn: scap: add data types, lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/623078 [19:02:51] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-40] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @elukey Are you trying to re-use hostnames? We should be using an-worker1118+ [19:04:16] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10Cmjohnson) @Dzahn Can you take a look at this, it's failing in the installer for raid. [19:04:28] * Urbanecm is going to deploy a security patch [19:08:06] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 00m 54s) [19:08:08] urbanecm@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:08:20] ^^not sure who manages stashbot ^^ [19:08:27] sbassett: fyi ^ [19:08:36] bd808: is that you? [19:08:37] 10Operations, 10Analytics: eventgate-main latencies very high since the failover to codfw - https://phabricator.wikimedia.org/T261846 (10Ottomata) FYI I also increased partitions to 3 for resource_change as well. [19:09:10] Urbanecm: fwiw in the meantime you can retry or just edit https://wikitech.wikimedia.org/wiki/Server_Admin_Log directly [19:09:11] rzl: yeah. I'll see what happened [19:09:32] !log 21:08 <+logmsgbot> !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 00m 54s) [19:09:34] Urbanecm: Failed to log message to wiki. Somebody should check the error logs. [19:09:39] retrying didn't help [19:09:41] thanks bd808 [19:12:13] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10Cmjohnson) @jbond with the help of @dzahn we realized there are only 2 disks in this server and the partman configuration is for 4. [19:12:43] (03PS11) 10Jeena Huneidi: Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) [19:12:46] !log updating firmware on scs-c1-eqiad via T238036 [19:12:50] robh: Failed to log message to wiki. Somebody should check the error logs. [19:12:52] T238036: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 [19:13:23] (03CR) 10Ottomata: [C: 03+1] Increase timeouts for connection to eventgate to match envoy config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622863 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [19:13:45] (03CR) 10Jeena Huneidi: Script to update image versions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [19:14:14] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:14:15] cmjohnson@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [19:14:40] cmjohnson1: see other channel bd is investigating why bot is failing to log [19:14:44] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Revert "Update T250887 mitigations" (duration: 00m 32s) [19:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:52] !log updating firmware on scs-c1-eqiad via T238036 [19:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:01] cmjohnson1: relog, its working [19:16:24] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission weblog1001 (unrack or return to spares) - https://phabricator.wikimedia.org/T259217 (10Cmjohnson) [19:16:51] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission weblog1001 (unrack or return to spares) - https://phabricator.wikimedia.org/T259217 (10Cmjohnson) 05Open→03Resolved removed from rack, switch and ran decom script [19:17:03] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10RobH) scs-a1-eqiad firmware was 3.16.6u4, newest stable at this time is 4.9.0u1, updating [19:19:24] (03CR) 10Cwhite: [C: 03+1] alerts: combine alerts.wm.o and icinga.wm.o certificates [puppet] - 10https://gerrit.wikimedia.org/r/623848 (https://phabricator.wikimedia.org/T261342) (owner: 10Herron) [19:19:25] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24897/" [puppet] - 10https://gerrit.wikimedia.org/r/623078 (owner: 10Dzahn) [19:19:27] (03CR) 10Dzahn: [C: 03+2] scap: add data types, lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/623078 (owner: 10Dzahn) [19:20:14] !log scs-c1-eqiad firmware update complete and back online T238036 [19:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:20] T238036: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 [19:23:39] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:04] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:10] (03CR) 10Dzahn: "noop on deploy1001 and compiler showed noop on all these other hosts using scap classes" [puppet] - 10https://gerrit.wikimedia.org/r/623078 (owner: 10Dzahn) [19:25:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10Cmjohnson) [19:25:55] 10Operations, 10Performance-Team, 10observability, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 (10Cmjohnson) [19:26:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Performance-Team (Radar): decom tungsten - https://phabricator.wikimedia.org/T260395 (10Cmjohnson) 05Open→03Resolved removed from rack, switch port and script update [19:26:26] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:53] (03PS6) 10Dzahn: labstore: add data types and some other style fixes [puppet] - 10https://gerrit.wikimedia.org/r/622666 [19:29:33] (03CR) 10Dzahn: labstore: add data types and some other style fixes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/622666 (owner: 10Dzahn) [19:38:53] (03CR) 10Dzahn: "compiler thinks openstack::base::puppetmaster::frontend is not used by anything ... ?!" [puppet] - 10https://gerrit.wikimedia.org/r/621779 (owner: 10Dzahn) [19:44:44] 10Operations, 10ops-eqiad, 10decommission-hardware: Decommission old-msw1 - https://phabricator.wikimedia.org/T261449 (10Cmjohnson) 05Open→03Resolved wiped switch, swith is offline in netbox and removed from the rack. [19:47:05] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1096.eqiad.wmnet ` The log can be found in `/va... [19:49:04] (03CR) 10Dzahn: "ah.. right. the roles using these classes are not in production, so gotta compile it on https://openstack-browser.toolforge.org/puppetclas" [puppet] - 10https://gerrit.wikimedia.org/r/621779 (owner: 10Dzahn) [19:50:54] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad CPU usage over 85% - https://phabricator.wikimedia.org/T238036 (10RobH) 05Open→03Resolved a:05RobH→03None Firmware updated to the newest version. If it happens again, we can reopen and investigate with OpenGear. [19:53:38] (03PS1) 10Andrew Bogott: Openstack Nova: add a hack that validates server names [puppet] - 10https://gerrit.wikimedia.org/r/623856 (https://phabricator.wikimedia.org/T207538) [19:54:03] (03CR) 10jerkins-bot: [V: 04-1] Openstack Nova: add a hack that validates server names [puppet] - 10https://gerrit.wikimedia.org/r/623856 (https://phabricator.wikimedia.org/T207538) (owner: 10Andrew Bogott) [19:54:52] (03CR) 10Dzahn: [C: 03+2] "noop on cloud puppetmasters https://puppet-compiler.wmflabs.org/compiler1001/24900/ (the hosts shown as failing are 404 and not failing d" [puppet] - 10https://gerrit.wikimedia.org/r/621779 (owner: 10Dzahn) [19:55:38] (03PS2) 10Andrew Bogott: Openstack Nova: add a hack that validates server names [puppet] - 10https://gerrit.wikimedia.org/r/623856 (https://phabricator.wikimedia.org/T207538) [19:57:45] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:04] (03PS3) 10Andrew Bogott: Openstack Nova: add a hack that validates server names [puppet] - 10https://gerrit.wikimedia.org/r/623856 (https://phabricator.wikimedia.org/T207538) [19:59:44] (03CR) 10Dzahn: "wmcs people: could you confirm there is no more openstack::cumin on jessie?" [puppet] - 10https://gerrit.wikimedia.org/r/621369 (owner: 10Dzahn) [20:00:04] halfak and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200902T2000). [20:00:28] (03CR) 10jerkins-bot: [V: 04-1] Openstack Nova: add a hack that validates server names [puppet] - 10https://gerrit.wikimedia.org/r/623856 (https://phabricator.wikimedia.org/T207538) (owner: 10Andrew Bogott) [20:01:09] (03CR) 10Dzahn: [C: 04-1] "too early" [puppet] - 10https://gerrit.wikimedia.org/r/621368 (owner: 10Dzahn) [20:01:11] (03PS4) 10Andrew Bogott: Openstack Nova: add a hack that validates server names [puppet] - 10https://gerrit.wikimedia.org/r/623856 (https://phabricator.wikimedia.org/T207538) [20:02:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:00] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Nova: add a hack that validates server names [puppet] - 10https://gerrit.wikimedia.org/r/623856 (https://phabricator.wikimedia.org/T207538) (owner: 10Andrew Bogott) [20:05:19] (03PS3) 10Razzi: Add type annotation to profile::analytics::cluster::packages::common [puppet] - 10https://gerrit.wikimedia.org/r/623428 (https://phabricator.wikimedia.org/T252617) (owner: 10Ottomata) [20:06:17] (03PS4) 10Ottomata: Add type annotation to profile::analytics::cluster::packages::common [puppet] - 10https://gerrit.wikimedia.org/r/623428 (https://phabricator.wikimedia.org/T252617) [20:07:22] (03PS3) 10Razzi: admin: Add type annotation [puppet] - 10https://gerrit.wikimedia.org/r/623842 (https://phabricator.wikimedia.org/T252617) [20:07:42] (03CR) 10Andrew Bogott: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/621369 (owner: 10Dzahn) [20:08:07] (03PS1) 10Cmjohnson: adding production dns ipv4/ipv6 for kubernetes1017 [dns] - 10https://gerrit.wikimedia.org/r/623859 (https://phabricator.wikimedia.org/T258747) [20:08:28] 10Operations, 10DC-Ops, 10fundraising-tech-ops: RAID controller failing on frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T261221 (10Jclark-ctr) [20:08:33] (03CR) 10Ottomata: [C: 03+2] Add type annotation to profile::analytics::cluster::packages::common [puppet] - 10https://gerrit.wikimedia.org/r/623428 (https://phabricator.wikimedia.org/T252617) (owner: 10Ottomata) [20:08:43] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install new controller into frdb1001 OR add to spares - https://phabricator.wikimedia.org/T261348 (10Jclark-ctr) [20:09:11] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install new controller into frdb1001 OR add to spares - https://phabricator.wikimedia.org/T261348 (10Jclark-ctr) received raid controller [20:10:12] (03PS3) 10Dzahn: cache::base: replace hiera() with lookup(), add data types [puppet] - 10https://gerrit.wikimedia.org/r/623662 [20:11:04] (03PS1) 10Cmjohnson: Adding mac address for kubernetes1017 to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/623860 (https://phabricator.wikimedia.org/T258747) [20:12:21] (03PS2) 10Cmjohnson: adding production dns ipv4/ipv6 for kubernetes1017 [dns] - 10https://gerrit.wikimedia.org/r/623859 (https://phabricator.wikimedia.org/T258747) [20:13:16] (03CR) 10Cmjohnson: [C: 03+2] adding production dns ipv4/ipv6 for kubernetes1017 [dns] - 10https://gerrit.wikimedia.org/r/623859 (https://phabricator.wikimedia.org/T258747) (owner: 10Cmjohnson) [20:15:20] (03CR) 10Cmjohnson: [C: 03+2] Adding mac address for kubernetes1017 to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/623860 (https://phabricator.wikimedia.org/T258747) (owner: 10Cmjohnson) [20:15:22] (03CR) 10Ottomata: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/24904/" [puppet] - 10https://gerrit.wikimedia.org/r/623842 (https://phabricator.wikimedia.org/T252617) (owner: 10Razzi) [20:15:27] (03PS4) 10Ottomata: admin: Add type annotation [puppet] - 10https://gerrit.wikimedia.org/r/623842 (https://phabricator.wikimedia.org/T252617) (owner: 10Razzi) [20:16:18] ottomata you can merge my change with yours [20:16:46] (03CR) 10Ottomata: [C: 03+2] admin: Add type annotation [puppet] - 10https://gerrit.wikimedia.org/r/623842 (https://phabricator.wikimedia.org/T252617) (owner: 10Razzi) [20:16:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:17:19] ^ looks real [20:20:39] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [20:20:47] probably anyway, I am unable to get Kibana to tell me anything useful about it [20:21:28] Wikimedia\Rdbms\LoadBalancer::waitForMasterPos: timed out waiting on {dbserver} pos {pos} [{seconds}s] [20:21:31] Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode [20:21:33] uhh [20:23:15] that seems to be almost exclusively apiservers [20:23:31] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [20:23:59] (03PS1) 10Cmjohnson: Adding kubernetes1017 to netboot.cfg file [puppet] - 10https://gerrit.wikimedia.org/r/623861 (https://phabricator.wikimedia.org/T258747) [20:24:35] s2 in particular stands out [20:24:42] (03CR) 10Cmjohnson: [C: 03+2] Adding kubernetes1017 to netboot.cfg file [puppet] - 10https://gerrit.wikimedia.org/r/623861 (https://phabricator.wikimedia.org/T258747) (owner: 10Cmjohnson) [20:25:03] https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?viewPanel=2&orgId=1&var-site=codfw looks fine tho [20:25:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10Cmjohnson) [20:25:47] rzl: I am similarly baffled and am wondering if things are being complicated by whatever is going on with rsyslog [20:26:14] /11/ [20:26:16] query traffic on s2 spike? https://tendril.wikimedia.org/host/view/db2107.codfw.wmnet/3306 [20:26:53] mutante: sorry, which time range are you pointing towards there? [20:27:37] thanks cmjohnson1 merged, [20:27:38] (03PS1) 10Cmjohnson: Adding kubernetes1017 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/623862 (https://phabricator.wikimedia.org/T258747) [20:28:11] (03PS2) 10Cmjohnson: Adding kubernetes1017 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/623862 (https://phabricator.wikimedia.org/T258747) [20:28:48] cdanis: i was looking at the "Query Traffic - 7d /1 h" but ignore me, that is just the switch yesterday i guess [20:28:55] :) [20:28:55] (03CR) 10Cmjohnson: [C: 03+2] Adding kubernetes1017 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/623862 (https://phabricator.wikimedia.org/T258747) (owner: 10Cmjohnson) [20:31:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:32:48] (03PS11) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) [20:33:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1017.eqi... [20:33:44] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [20:35:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:39:57] (03PS2) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [20:39:59] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:41:00] (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [20:42:31] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but [20:42:31] kitech.wikimedia.org/wiki/PyBal [20:43:14] looking at `wdqs` now [20:43:17] same [20:43:29] wdqs2007 up for me [20:43:57] ryankemper: so from the past the fix i know was "restart blazegraph" [20:44:05] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [20:44:09] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [20:44:28] yup it was, looks like all of codfw is impacted by the above [20:44:36] going to restart blazegraph across all affected nodes [20:44:40] ack [20:44:58] it has high load and blazegraph is top process using all the CPU [20:46:28] !log `sudo cumin -b10 'P{wdqs2*} and not A:wdqs-test and not A:wdqs-internal and not P{wdqs2001.codfw.wmnet}' "sudo systemctl restart wdqs-blazegraph.service"` (restarted everything but 2001, will restart 2001 next) [20:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:49] !log restarted blazegraph on `wdqs2001` as well [20:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:37] 2007 is MUCH more busy than 2008 [20:49:29] ok, that makes sense since odd numbers = public, even numbers = internal [20:50:25] Yeah cpu % for blazegraph is similar across 2001 and 2007 [20:50:56] 2007 happens to be working harder but not by an order of magnitude or anything [20:51:01] ok [20:51:14] I'd expect the checks to resolve really soon if the problem isn't still currently occuring [20:51:17] Going to look at some graphs [20:53:02] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1096.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1096.eqiad.wmnet'] ` [20:54:59] Graphite's still got null values for all the codfw nodes' triple counts: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&orgId=1&refresh=1m&var-cluster_name=wdqs [20:55:25] next [20:55:45] jouncebot: next [20:55:45] In 0 hour(s) and 4 minute(s): Striker (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200902T2100) [20:56:13] We should probably route traffic to eqiad instead of codfw [20:56:22] Let me see if the codfw nodes complete test queries real quick [20:58:20] The curls are just hanging - i'd imagine that means nginx is responding but blazegraph isn't picking up on the other end [20:58:34] Let's do a dns depool to cut over to eqiad [20:58:57] my only concern is if this is a query of death type scenario (which is likely based off our last incident) eqiad might get toppled as well, but that's no worse than where we're at right now [21:00:01] mutante had the idea of restarting nginx so I'll try that first to rule that out before trying a dns depool [21:00:04] bd808: It is that lovely time of the day again! You are hereby commanded to deploy Striker. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200902T2100). [21:00:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [21:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:15] * bd808 starts prepping for his deploy [21:01:54] !log Restarted nginx on `wdqs2007` [21:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:11] https://www.irccloud.com/pastebin/jYR5H6xT/Here's%20nginx%20status%20on%20a%20random%20node%20(%60wdqs2001%60).%20The%20%60Process%3A%2027515%20ExecStop%3D%2Fsbin%2Fstart-stop-daemon%20--quiet%20--stop%20--retry%20QUIT%2F5%20--pidfile%20%2Frun%2Fnginx.pid%20(code%3Dexited%2C%20status%3D2)%60%20is%20jumping%20out%20at%20me [21:03:24] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:04:24] Restarting nginx on 2007 didn't help, its status looks like the above `wdqs2001`. This is interesting from the systemctl service logs though: [21:04:27] https://www.irccloud.com/pastebin/mEn7069D/ [21:04:47] `nginx.service: Unit entered failed state.` / `nginx.service: Failed with result 'timeout'.` [21:05:24] bd808: depending on when you're finished prepping, if we could delay the striker deploy a bit that'd help cut down on noise [21:05:42] ryankemper: *nod* I'm ready whenever the road is clear :) [21:05:53] * ryankemper frantically tries to clear road [21:05:56] but fix your stuffs. I can wait [21:05:59] thanks [21:06:28] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:07:56] So I think nginx is working fine and blazegraph is just never responding [21:08:26] Queries on query.wikidata.org are just hanging forever as well [21:08:28] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:08:56] (03PS1) 10Ppchelko: Beta: expose experimental OAuth routes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623867 (https://phabricator.wikimedia.org/T257982) [21:09:06] (03CR) 10Ppchelko: [C: 03+2] Beta: expose experimental OAuth routes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623867 (https://phabricator.wikimedia.org/T257982) (owner: 10Ppchelko) [21:09:11] I might be misinterpreting this logs but we might be getting hit with repeated malformed queries? [21:09:15] most queries on 2007 seemed to come from https://labs.minutelabs.io/Tree-of-Life-Explorer/#/ but it stopped now [21:09:29] here's a snippet from `journalctl -u wdqs-blazegraph` https://www.irccloud.com/pastebin/bEzuSLru/ [21:09:46] (03Merged) 10jenkins-bot: Beta: expose experimental OAuth routes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623867 (https://phabricator.wikimedia.org/T257982) (owner: 10Ppchelko) [21:10:00] Okay, I'm going to try restarting blazegraph everywhere again, in the hopes that whoever is hammering it as stopped [21:10:25] !log `sudo cumin -b10 'P{wdqs2*} and not A:wdqs-test and not A:wdqs-internal' "sudo systemctl restart nginx.service"` [21:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:34] Meh, copy-paste error [21:10:47] !log `sudo cumin -b10 'P{wdqs2*} and not A:wdqs-test and not A:wdqs-internal' "sudo systemctl restart wdqs-blazegraph.service"` [21:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:00] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 6.546 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:11:01] ryankemper: yes, that does look like malformed queries [21:11:29] Not clear why malformed queries would lock blazegraph up, but could be triggering the deadlock bug we have a (long-living) ticket open up for [21:11:58] RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.665 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:12:04] RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.346 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:12:10] yay [21:12:26] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:13:47] ryankemper: no more "Malformed" showing up now in the log.. but earlier there were a lot [21:14:01] so looks like you were right [21:14:57] * mutante reschedules remaining icinga checks [21:15:12] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:15:31] Now to see if we spot malformed queries again [21:15:36] If so it's IP ban time [21:15:56] what changed now is that it is 2002/2003 and not 2007/2008 [21:16:10] Well it was all 4 of them not long ago [21:16:14] ok [21:16:20] mutante: btw it's not quite even/odd for internal/external [21:16:22] good sign then [21:16:29] our externals are 2001, 2002, 2003, 2007 [21:16:36] oh, ok [21:16:42] (per https://config-master.wikimedia.org/pybal/codfw/wdqs since I always forget) [21:17:22] *nod*. nothing here: root@wdqs2001:~# tail -f /var/log/wdqs/wdqs-blazegraph* | grep Malformed [21:17:37] 10Operations, 10Mail: Changing the name of the tm_enforcement@wikimedia.org email address - https://phabricator.wikimedia.org/T261903 (10Reedy) [21:18:52] Seeing malformed again on 2007, I'd expect 2007 to go back to critical soon if the theory holds up [21:18:54] https://www.irccloud.com/pastebin/SzQpjLTC/ [21:19:12] `ERROR c.b.r.sail.webapp.BigdataRDFServlet - cause=java.util.concurrent.ExecutionException: org.openrdf.query.MalformedQueryException: Encountered " ")" ") "" at line 7, column 23.` is the relevant line from the above [21:20:02] oops accidentally captured a user agent in the above, my bad :x [21:20:30] ryankemper: you have a way to ban them? [21:20:48] 10Operations, 10Mail: Changing the name of the tm_enforcement@wikimedia.org email address - https://phabricator.wikimedia.org/T261903 (10Dzahn) That address is on Google: ` [mx1001:~] $ sudo exim4 -bt tm_enforcement@wikimedia.org tm_enforcement@wikimedia.org router = ldap_account, transport = remote_smtp... [21:21:51] mutante: Yeah we can set a ban in conftool, someone else did it last time so currently trying to dig up the command [21:22:14] And need to make sure we get the right source IP and/or user agent [21:24:08] ryankemper: I don't believe it's in conftool, I believe we edited the nginx configuration in Puppet [21:24:29] ack, so whatever we did last time wouldn't be in the SAL? [21:24:38] that's where I was about to look [21:25:05] I don't think so, it was a Puppet patch IIRC [21:25:30] well -- there have been Puppet patches in the past to block specific UA strings, and I think also private Puppet patches to block specific IP ranges (at the Traffic layer) [21:26:04] for instance https://gerrit.wikimedia.org/r/c/operations/puppet/+/518691 [21:26:26] https://gerrit.wikimedia.org/r/c/operations/puppet/+/552540 [21:27:23] cdanis: last time you did `conftool action : set/pooled=false; selector: dnsdisc=wdqs.*,name=codfw` [21:27:39] that depooled wdqs@codfw [21:27:59] before we realized it was bad traffic, and thought it was just the codfw machines mysteriously misbehaving [21:28:21] doh, yup [21:28:23] and then when the eqiad machines started acting similarly, we realized it must be bad traffic that then got re-loadbalanced ther [21:28:39] yeah, wasn't thinking [21:28:42] the patch mutante dug up makes sense [21:28:49] so now just need to figure out who to ban [21:28:51] by far the most client connections come from mwdebug2001 [21:28:52] yeah I think last time we banned the IPs in puppet private [21:29:05] if there's a specific UA string at fault, though, I'd try that first [21:30:10] 10Operations, 10Mail: Changing the name of the tm_enforcement@wikimedia.org email address - https://phabricator.wikimedia.org/T261903 (10bcampbell) 05Open→03Resolved a:03bcampbell Thanks. I found it. Sorry about that. Resolved. [21:30:25] ryankemper: last time it was 93fa542e in the puppet private repo [21:35:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1017.eqiad.wmnet'] ` and were **ALL** successful. [21:37:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:38:58] PROBLEM - ping-offload grafana alert on icinga1001 is CRITICAL: CRITICAL: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is alerting: target IP missing on hosts loopback. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [21:39:02] PROBLEM - Check systemd state on mwmaint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:40:10] PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:40:24] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [21:40:50] ryankemper: I *think* my deploy will take about 60s, and it is a scap3 thing unrelated to anything you are working on. Do you mind if I give it a shot? [21:40:52] RECOVERY - ping-offload grafana alert on icinga1001 is OK: OK: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is not alerting. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [21:40:52] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:40:54] RECOVERY - Check systemd state on mwmaint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:41:56] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [21:42:04] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:42:36] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 47.71 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:43:20] PROBLEM - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [21:44:34] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 90.22 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:44:34] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:45:06] PROBLEM - Prometheus prometheus2003/ops restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [21:45:26] prometheus are you okay [21:45:47] I asked in the o11y channel, maybe there was a configuration change that caused it [21:45:54] PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:45:58] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:46:17] I do think that the Prometheus blips caused the traffic drop alerts [21:46:18] * bd808 jumps the queue [21:46:25] !log bd808@deploy1001 Started deploy [striker/deploy@3c2090a]: Deploying r20200902 tag (T198114, T223610, T245804, T144111, T261810) [21:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:38] T144111: Allow self-service creation of Maniphest projects for Tools - https://phabricator.wikimedia.org/T144111 [21:46:39] T198114: Allow tool maintainers to delete toolinfo via ToolsAdmin - https://phabricator.wikimedia.org/T198114 [21:46:39] T223610: Tool creation and toolinfo create/edit forms are the same causing confusion about field mutability/use - https://phabricator.wikimedia.org/T223610 [21:46:40] T245804: Reassign base URLs for toolinfo records' web service links - https://phabricator.wikimedia.org/T245804 [21:46:40] T261810: Striker not compatible with mwoauth>=0.3.5 - https://phabricator.wikimedia.org/T261810 [21:47:46] cdanis: yeah almost deifnitely [21:47:59] !log bd808@deploy1001 Finished deploy [striker/deploy@3c2090a]: Deploying r20200902 tag (T198114, T223610, T245804, T144111, T261810) (duration: 01m 34s) [21:48:03] it really looks like it [21:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:56] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:50:04] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:50:24] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.330 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:54:08] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.218 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:04:14] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:05:02] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [22:05:02] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [22:06:52] RECOVERY - Prometheus prometheus2003/ops restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [22:07:48] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [22:08:12] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [22:08:18] Sorry for the delay all, so quick update we've found a set of user agents responsible for malformed queries [22:08:52] ryankemper: I snuck my deploy in, so you are all good :) [22:08:55] Working on editing the nginx config on a node to block a regex matching the agents (they're all very similarly named), then once that's working on one we'll open up a puppet patch, deploy it, restart blazegraph everywhere and we hopefully should be good [22:09:03] bd808: perfect [22:09:19] jouncebot: next [22:09:20] In 0 hour(s) and 50 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200902T2300) [22:11:44] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:13:04] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [22:13:26] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [22:15:10] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:15:56] RECOVERY - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [22:17:56] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:20:19] (03PS1) 10Ryan Kemper: wdqs: ban UA sending malformed queries [puppet] - 10https://gerrit.wikimedia.org/r/623876 [22:20:50] * gehel is looking at ^ [22:20:55] ryankemper: where was this tested? [22:21:02] PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:21:26] gehel: tested on `wdqs2007` [22:21:36] I just tested that nginx comes back up, not that it's banning properly [22:21:43] (But not sure how often we get malformed queries to be able to test that functionality) [22:21:44] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:22:09] the 503s are new ? [22:22:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:22:34] (03CR) 10Gehel: [C: 03+2] wdqs: ban UA sending malformed queries [puppet] - 10https://gerrit.wikimedia.org/r/623876 (owner: 10Ryan Kemper) [22:22:35] yeah, they are [22:22:45] (03CR) 10CDanis: [C: 03+1] wdqs: ban UA sending malformed queries [puppet] - 10https://gerrit.wikimedia.org/r/623876 (owner: 10Ryan Kemper) [22:23:36] ryankemper: merged, I'll let you run puppet on the wdqs servers [22:23:45] Cool, doing that now [22:23:54] nginx should be reloaded automatically by puppet, but please check that this is really the case [22:24:08] (and if it's not, make a note to fix it :) [22:24:30] !log `sudo cumin -b10 'P{wdqs2*} and not A:wdqs-test and not A:wdqs-internal' "sudo run-puppet-agent"` [22:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:06] PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:25:28] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:26:15] ryankemper: you'll probably need to restart blazegraph as well [22:26:16] PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:26:36] gehel: Will definitely need to [22:26:59] !log Puppet finished on all external wdqs codfw nodes, nginx automatically reloaded as intended [22:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:19] !log `sudo cumin -b10 'P{wdqs2*} and not A:wdqs-test and not A:wdqs-internal' "sudo systemctl restart wdqs-blazegraph.service"` [22:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:28:02] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:28:02] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:28:08] RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:28:16] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:28:24] that was quick after the merge, nice [22:28:26] RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:28:44] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 77 probes of 565 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:28:48] RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:29:12] RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:29:14] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:31:33] ryankemper: \o/ [22:33:45] woot [22:33:52] we should start sending out tshirts to users who topple wdqs [22:34:34] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 51 probes of 565 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:36:14] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [22:36:38] ryankemper: maybe the t-shirt to the guy who actually put a proper UA with email and none to "My User Agent" :) [22:37:12] plus we'd end up sending the tshirts to one of aws' datacenters anyway :P [22:37:23] heh, yea [22:39:04] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [22:55:33] !log restart rsyslog on centrallog[12]001 [22:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200902T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:02:49] (03PS1) 10Cwhite: admin: move holger to deployers [puppet] - 10https://gerrit.wikimedia.org/r/623878 (https://phabricator.wikimedia.org/T261754) [23:13:23] (03CR) 10Dzahn: [C: 03+1] "looks good. I am not sure if it's really a written rule but usually we get a +1 from somebody in releng for these." [puppet] - 10https://gerrit.wikimedia.org/r/623878 (https://phabricator.wikimedia.org/T261754) (owner: 10Cwhite) [23:16:34] (03PS3) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [23:17:39] (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [23:19:11] (03PS2) 10Ppchelko: Increase timeouts for connection to eventgate to match envoy config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622863 (https://phabricator.wikimedia.org/T249745) [23:20:52] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for UG Greece - https://phabricator.wikimedia.org/T261607 (10colewhite) 05Open→03Resolved The list is now available. Administrative interface can be found [[ https://lists.wikimedia.org/mailman/admin/wikimedia-gr | here ]]. Subscription interface can... [23:41:52] (03CR) 10Thcipriani: [C: 03+1] admin: move holger to deployers [puppet] - 10https://gerrit.wikimedia.org/r/623878 (https://phabricator.wikimedia.org/T261754) (owner: 10Cwhite) [23:59:36] (03PS4) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [23:59:52] (03CR) 10Dzahn: [C: 03+2] admin: move holger to deployers [puppet] - 10https://gerrit.wikimedia.org/r/623878 (https://phabricator.wikimedia.org/T261754) (owner: 10Cwhite)