[00:00:04] Deploy window No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190611T0000) [00:37:36] (03PS2) 10Smalyshev: Set up dumps for mediainfo RDF generation [puppet] - 10https://gerrit.wikimedia.org/r/516444 (https://phabricator.wikimedia.org/T221917) [01:11:05] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [01:21:11] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [02:41:07] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 27747352 and 0 seconds [02:44:03] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 68232 and 45 seconds [02:46:31] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 854.83 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:56:55] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [03:21:33] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [04:05:05] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:14:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:15:03] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:25:51] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:26:49] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:38:11] the cr2 alerts are for planned maintenance, window runs for another 3+ hours [04:41:53] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:52:57] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:53:27] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [04:57:21] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:58:01] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:10:21] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:10:31] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:10:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [06:12:03] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [06:14:43] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:14:53] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:19:19] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [06:20:47] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [06:26:33] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:33:35] PROBLEM - puppet last run on db2085 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [07:00:45] RECOVERY - puppet last run on db2085 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:34:32] 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) I have started a compare for main tables on s3 wikis. [07:44:45] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational [07:50:15] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10elukey) >>! In T219544#5248264, @Ottomata wrote: > Ok! Creds deployed, and oozie job merged. Refinery will be deployed this week and we can tr... [07:51:31] 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10ArielGlenn) p:05Triage→03Normal [07:52:32] 10Operations, 10OTRS: Migrate mendelevium/OTRS host to Stretch/Buster - https://phabricator.wikimedia.org/T224590 (10ArielGlenn) p:05Triage→03Normal [07:52:46] 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10ArielGlenn) p:05Triage→03Normal [07:53:00] 10Operations: Migrate fermium to stretch/buster - https://phabricator.wikimedia.org/T224586 (10ArielGlenn) p:05Triage→03Normal [07:53:19] 10Operations, 10cloud-services-team: Migrate labmon* to Stretch - https://phabricator.wikimedia.org/T224585 (10ArielGlenn) p:05Triage→03Normal [07:53:40] 10Operations, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10ArielGlenn) p:05Triage→03Normal [07:53:57] 10Operations, 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10ArielGlenn) p:05Triage→03Normal [07:54:08] 10Operations, 10Wikimedia-Etherpad, 10serviceops: Migrate etherpad1001 to Stretch/Buster - https://phabricator.wikimedia.org/T224580 (10ArielGlenn) p:05Triage→03Normal [07:54:44] 10Operations: Migrate irc.wikimedia.org/kraz to Stretch/Buster - https://phabricator.wikimedia.org/T224579 (10ArielGlenn) p:05Triage→03Normal [07:55:02] 10Operations, 10serviceops, 10Kubernetes: Migrate etcd networking cluster to Stretch/Buster - https://phabricator.wikimedia.org/T224577 (10ArielGlenn) p:05Triage→03Normal [07:55:22] 10Operations: Upgrade install servers to Stretch/Buster - https://phabricator.wikimedia.org/T224576 (10ArielGlenn) p:05Triage→03Normal [07:55:38] 10Operations, 10serviceops, 10Kubernetes: Migrate Kubernetes etcd clusters to Stretch/Buster - https://phabricator.wikimedia.org/T224574 (10ArielGlenn) p:05Triage→03Normal [07:55:57] 10Operations, 10serviceops: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10ArielGlenn) p:05Triage→03Normal [07:56:11] 10Operations: Migrate auth* servers to Stretch/Buster - https://phabricator.wikimedia.org/T224571 (10ArielGlenn) p:05Triage→03Normal [07:56:25] 10Operations, 10Pybal, 10Traffic: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 (10ArielGlenn) p:05Triage→03Normal [07:56:44] 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: Migrate ORES Redis servers to Stretch/Buster - https://phabricator.wikimedia.org/T224569 (10ArielGlenn) p:05Triage→03Normal [07:56:53] 10Operations, 10Kubernetes: Migrate etcd cluster for Kubernetes staging cluster to Stretch/Buster - https://phabricator.wikimedia.org/T224568 (10ArielGlenn) p:05Triage→03Normal [07:57:18] 10Operations, 10serviceops: Migrate debug proxies to Stretch/Buster - https://phabricator.wikimedia.org/T224567 (10ArielGlenn) p:05Triage→03Normal [07:57:28] 10Operations: Migrate mwlog/udp2log servers to Stretch/Buster - https://phabricator.wikimedia.org/T224565 (10ArielGlenn) p:05Triage→03Normal [07:57:42] 10Operations: Reimage wezen to Stretch (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10ArielGlenn) p:05Triage→03Normal [07:58:02] 10Operations: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) p:05Triage→03Normal a:03ArielGlenn [07:58:16] 10Operations, 10cloud-services-team: Migrate remaining cloudvirt hosts to Stretch/Mitaka - https://phabricator.wikimedia.org/T224561 (10ArielGlenn) p:05Triage→03Normal [07:58:25] 10Operations, 10serviceops: Migrate Zookeeper/etcd conf cluster in codfw to Stretch - https://phabricator.wikimedia.org/T224560 (10ArielGlenn) p:05Triage→03Normal [07:58:38] 10Operations, 10Traffic, 10serviceops: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 (10ArielGlenn) p:05Triage→03Normal [07:58:53] 10Operations: Migrate ldap/corp replicas to Stretch/Buster - https://phabricator.wikimedia.org/T224557 (10ArielGlenn) p:05Triage→03Normal [07:59:11] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10ArielGlenn) p:05Triage→03Normal [07:59:37] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10ArielGlenn) p:05Triage→03Normal [07:59:47] 10Operations: Migrate URL downloaders to Stretch/Buster - https://phabricator.wikimedia.org/T224551 (10ArielGlenn) p:05Triage→03Normal [08:03:21] 10Operations, 10Gerrit, 10Traffic: When downloading from git using HTTPS: HTTP 500 / GnuTLS recv error (-110) - https://phabricator.wikimedia.org/T225347 (10ArielGlenn) p:05Triage→03Normal [08:04:39] 10Operations, 10ops-codfw, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10ArielGlenn) p:05Triage→03Normal [08:05:41] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [08:07:35] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [08:18:54] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2003.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906110818_gehel_7... [08:20:42] 10Operations, 10Gerrit, 10Traffic: When downloading from git using HTTPS: HTTP 500 / GnuTLS recv error (-110) - https://phabricator.wikimedia.org/T225347 (10hashar) The TLS stack is just fine and the query does reach the Apache in front of Gerrit,. The reason is the OVH one is being rejected by our configura... [08:21:31] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:22:20] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:23:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:23:24] <_joe_> what's up with kartotherian? [08:23:25] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:23:27] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:23:29] <_joe_> gehel, onimisionipe? [08:23:49] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [08:24:09] looking [08:24:14] codfw still depooled, no direct impact [08:24:29] but looks like not enough servers repooled in that clsuter [08:25:13] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [08:25:16] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1288 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [08:26:02] gehel: logstash is showing lots of 500s for maps [08:26:08] <_joe_> no gehel it's pooled [08:26:28] !log repooling maps200[124] [08:26:29] <_joe_> see https://config-master.wikimedia.org/discovery/services.yaml [08:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:34] shoudl be good in a second [08:26:59] 2003 was the lat pooled node and now we reimaging [08:27:04] *last [08:27:05] are their space issues all set then? (it was those hosts with space issues after reimage right?) [08:27:19] apergos: we're mostly good now [08:27:25] ah great news [08:27:33] <_joe_> onimisionipe: so you need to depool codfw when you do something like that [08:27:33] yea [08:27:52] _joe_: noted! [08:28:35] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) Just as an FYI, everything looks ok on this end, but there's a train freeze this week, so we have to wait before dep... [08:30:00] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Tarrow) @akosiaris Great! Thanks for that. We look forward to seeing how it all goes forward post-offsite :) [08:30:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:30:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:30:41] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [08:30:41] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:31:41] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:33:05] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [08:35:19] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [08:35:22] 10Operations, 10Gerrit, 10Traffic: When downloading from git using HTTPS: HTTP 500 / GnuTLS recv error (-110) - https://phabricator.wikimedia.org/T225347 (10hashar) Since I am not familiar with that specific configuration and there are private data involved (IP address of the machine), I have filled **a priv... [08:35:23] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [08:37:43] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Gehel) @Cmjohnson any news on this? Do you need anything from our side? [08:39:23] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:10:45] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:19:33] RECOVERY - Disk space on ms-be2018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [09:20:00] !log free up space wrongly allocated onto / with sdc1 umounted on ms-be2018 [09:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:12] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2003.codfw.wmnet'] ` and were **ALL** successful. [09:22:18] oh it was sdc1! heh [09:22:24] thanks go dog [09:26:37] apergos: np! yeah as you mentioned space got wrongly written there while unmounted [09:26:53] how did you prove it? depool, umount, clean up? [09:27:04] (and what about rebalancing the rings and etc?) [09:27:26] yeah exactly, stop swift/rsync, umount, delete, mount [09:27:38] I'm writing it down on wikitech [09:28:01] oh cool, thanks! I had a feeling the docs there were pretty outa date [09:38:04] yeah some are indeed outdated :| [09:41:22] bug 1: in perpetuity [09:43:03] (03CR) 10DCausse: "I have a patch chain which includes the same patch + some refactoring which are now possible: https://gerrit.wikimedia.org/r/q/topic:%2251" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514994 (https://phabricator.wikimedia.org/T87892) (owner: 10Smalyshev) [09:56:46] !log gehel@cumin2001 START - Cookbook sre.postgresql.postgres-init [09:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:56] (03PS2) 10Cparle: Add 'sms' langcode to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) [10:21:26] (03PS3) 10Cparle: Add 'smn' and 'sms' langcodes to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) [10:29:30] (03PS2) 10CRusnov: Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 [10:31:48] (03PS1) 10Ema: admin: add ema's new yubikey [puppet] - 10https://gerrit.wikimedia.org/r/516469 [10:34:42] (03CR) 10Vgutierrez: [C: 03+2] "key verified via email && in person :)" [puppet] - 10https://gerrit.wikimedia.org/r/516469 (owner: 10Ema) [10:35:49] (03CR) 10jerkins-bot: [V: 04-1] Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (owner: 10CRusnov) [10:54:12] !log wipe fs on ms-be1033 data partitions - T223518 [10:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:24] T223518: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 [11:29:49] (03CR) 10Matthias Mullie: [C: 03+2] Add 'smn' and 'sms' langcodes to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) (owner: 10Cparle) [11:30:54] (03Merged) 10jenkins-bot: Add 'smn' and 'sms' langcodes to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) (owner: 10Cparle) [11:32:35] (03CR) 10jenkins-bot: Add 'smn' and 'sms' langcodes to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) (owner: 10Cparle) [11:50:53] (03PS1) 10Michael Große: Set EntityUsageTable addUsage batch size to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516478 (https://phabricator.wikimedia.org/T225500) [11:52:20] !log gehel@cumin2001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [11:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:36] PROBLEM - cassandra CQL 10.192.32.146:9042 on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [11:52:42] PROBLEM - tilerator on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:53:26] PROBLEM - tileratorui on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:53:30] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:53:36] PROBLEM - cassandra service on maps2003 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:58:24] I'm looking [12:00:32] ^ downtime expired, just extended [12:05:36] RECOVERY - tilerator on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [12:06:22] RECOVERY - tileratorui on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [12:06:26] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1288 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:08:00] RECOVERY - cassandra service on maps2003 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:12:25] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Release-Engineering-Team (Backlog): Request: add awight to contint-docker - https://phabricator.wikimedia.org/T223262 (10ArielGlenn) Still waiting on @Tobi_WMDE_SW [12:21:02] (03PS1) 10Marostegui: Revert "db1077: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/516481 [12:21:09] (03CR) 10Marostegui: [C: 04-2] "Not ready yet" [puppet] - 10https://gerrit.wikimedia.org/r/516481 (owner: 10Marostegui) [12:52:58] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10WMDE-leszek) Apologies for the late reaction. It's been a longer weekend for me. Thanks a lot @RStallman-legalteam for checking, and reaching out... [12:53:37] 10Operations, 10ops-eqiad, 10User-fgiunchedi: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 (10fgiunchedi) 05Resolved→03Open a:05Cmjohnson→03fgiunchedi [12:54:13] !log swift eqiad-prod: put back ms-be1033 - T223518 [12:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:22] T223518: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 [13:18:44] (03PS2) 10Michael Große: Set EntityUsageTable addUsage batch size to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516478 (https://phabricator.wikimedia.org/T225500) [13:20:09] (03PS3) 10Michael Große: Set EntityUsageTable addUsage batch size to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516478 (https://phabricator.wikimedia.org/T225500) [13:26:48] (03PS3) 10CRusnov: Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 [13:26:57] (03CR) 10Elukey: "Thanks Andrew! Followed up with more questions :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [13:34:12] (03CR) 10jerkins-bot: [V: 04-1] Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (owner: 10CRusnov) [13:40:43] (03PS5) 10Mathew.onipe: add WDQS reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) [13:40:55] (03CR) 10Mathew.onipe: add WDQS reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) (owner: 10Mathew.onipe) [13:41:29] 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Krinkle) +1 to close :) Back when the list was still private, it also had some overlap "ops-l", which continues to be a private list for anyone with production access (volunteers, WMDE, W... [13:41:40] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Ottomata) Already done in T220081 :) [13:48:15] (03PS1) 10Paladox: Raise recieve.maxBatchChanges To 20 [puppet] - 10https://gerrit.wikimedia.org/r/516492 [13:48:49] (03PS2) 10Paladox: Raise recieve.maxBatchChanges To 20 [puppet] - 10https://gerrit.wikimedia.org/r/516492 [13:49:14] To -> to ? [13:49:20] (03PS3) 10Paladox: Gerrit: raise recieve.maxBatchChanges To 20 [puppet] - 10https://gerrit.wikimedia.org/r/516492 [13:50:12] hauskatze: oh yes [13:50:32] (03PS4) 10Paladox: Gerrit: raise recieve.maxBatchChanges to 20 [puppet] - 10https://gerrit.wikimedia.org/r/516492 [13:50:49] Ortography C+1 :D [13:51:15] (03CR) 10Reedy: [C: 03+1] "+1 from me" [puppet] - 10https://gerrit.wikimedia.org/r/516492 (owner: 10Paladox) [13:54:01] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10alaa_wmde) hi @Nuria Yeap I actually have access to logstash already. I must have confused it somehow into thinking that there's another logsta... [13:55:01] (03CR) 10Ottomata: Allow Hadoop-related profiles to deploy Kerberos keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [13:55:58] (03CR) 10Thcipriani: "Per the docs:" [puppet] - 10https://gerrit.wikimedia.org/r/516492 (owner: 10Paladox) [13:58:37] 10Operations, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10hashar) Removing #release-engineering-team since there is already deployment/logstash access :] @alaa_wmde you might already have access to http://pivot.wikimedia.org/ which... [14:09:08] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:09:20] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:11:55] (03CR) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [14:12:14] that's telia down ulsfo/eqord btw [14:13:45] what's the circuit? I couldn't find planned maintenance for telia just now [14:14:17] (03Abandoned) 10Paladox: Gerrit: raise recieve.maxBatchChanges to 20 [puppet] - 10https://gerrit.wikimedia.org/r/516492 (owner: 10Paladox) [14:14:31] ah yeah found it, telia sent a maint-announce@ email, suspected card failure [14:14:38] half an hour ago that is [14:14:40] oh the one that says [14:14:49] Please note that if your service is a protected service, you should not experience any issues as your service is running on a protected path. [14:14:54] and of course ours isn't? nice [14:15:24] no eta either. oh well [14:17:31] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) After the last round of rolling restarts (happened... [14:18:05] in theory also codfw/eqord might be affected, I'm assuming it is some problem on the eqord side [14:18:45] godog: if you guys want to open a task with all the info I can live-ping Arzhel :) [14:20:37] heheh not sure if there's anything actionable ATM elukey though [14:20:52] they've opened a case with the vendor, they say [14:20:53] so ... [14:20:57] ah! [14:20:59] good :) [14:21:10] and we already seem to be in the email loop for notifications [14:22:18] (03CR) 10Krinkle: ":D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514193 (owner: 10Bartosz Dziewoński) [14:24:46] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:25:41] as you were saying, go god [14:25:44] *dog !! [14:28:00] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) [14:29:21] heheh indeed, standing by for the eqord side [14:35:37] 10Operations, 10serviceops, 10PHP 7.2 support, 10Patch-For-Review, and 3 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) >>! In T224491#5235501, @Joe wrote: > No that's completely unrelated to opcache corruption. We're not res... [14:39:25] 10Operations, 10Deployments, 10Release-Engineering-Team: Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207 (10ArielGlenn) p:05Triage→03Normal [14:48:05] (03CR) 10Gehel: [C: 04-1] add WDQS reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) (owner: 10Mathew.onipe) [14:49:34] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) So it appears that there is a fix upstream in sshd but it hasn't made it's wa... [15:00:51] (03PS6) 10Mathew.onipe: add WDQS reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) [15:01:41] (03CR) 10Mathew.onipe: add WDQS reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) (owner: 10Mathew.onipe) [15:03:56] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10ArielGlenn) I'd like to see us test with a locally patched sshd and see if that's inde... [15:04:45] (03CR) 10Krinkle: "In what way would removing this broken entry upset mediawiki? Is the array used for sharding keys in a way that we rely on for something?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) (owner: 10Effie Mouzeli) [15:09:35] (03CR) 10Ottomata: "It would not break anything :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) (owner: 10Effie Mouzeli) [15:14:17] 10Operations, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10Nuria) @alaa_wmde please check if you have access to turnilo (before known as pivot) as @harshar mentioned this is probably a good tool to find answers to your questions. Pl... [15:35:14] 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Cmjohnson) 05Open→03Resolved @marostegui that log entry may have been old. The server has both power supplies connected and does not report any current errors. Resolving the task. [15:37:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Cmjohnson) @gehel you will need to take the server offline for a day so I can reseat the DIMM. The server logs do not indicate... [15:41:10] !log shutting down elastic1029 for investigation - T214283 [15:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:15] T214283: Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 [15:42:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Gehel) @Cmjohnson elastic1029 is shut down and downtimed in icinga, do whatever you need to do and restart whenever it is done. [15:43:48] jouncebot: now [15:43:48] For the next 8 hour(s) and 16 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190611T0000) [15:43:50] jouncebot: next [15:43:51] In 8 hour(s) and 16 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190612T0000) [15:45:50] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T220880 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson @elukey I found a spare disk and added the disk back, it's now online Adapter #0 Enclosure Device ID: 32 Slot Number: 0 Drive's position: DiskGroup: 11,... [15:46:20] RECOVERY - cassandra CQL 10.192.32.146:9042 on maps2003 is OK: TCP OK - 0.036 second response time on 10.192.32.146 port 9042 https://phabricator.wikimedia.org/T93886 [15:49:57] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T223825 (10Cmjohnson) a:03RobH This server's SSD's are not part of the original build and under HP warranty. They are intel SSDs that I believe came from restbase1001-1003. Assigning to @RobH to order new SSDs... [15:50:16] 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) @Cmjohnson thank you!. The record looks like from 10th June but might be related to your maintenance actually: ` /system1/log1/record19 Targets Properties number=19 severity=Caution dat... [15:50:31] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T223825 (10Cmjohnson) 05Open→03Declined this is a duplicate task declining [15:54:27] (03CR) 10Jbond: [C: 03+2] Revert "Matrix wikimedia.org IDs domain authorization" [dns] - 10https://gerrit.wikimedia.org/r/516056 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza) [15:54:36] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/516056 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza) [15:57:13] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Cmjohnson) @Andrew what parts? There is nothing that suggests that it is CPU on the server side of things. I reseated and... [16:01:17] 10Operations, 10ops-eqiad: wmf7622 wont powercycle (cannot be allocated from spares) - https://phabricator.wikimedia.org/T222922 (10Cmjohnson) [16:01:30] 10Operations, 10ops-eqiad: wmf7622 wont powercycle (cannot be allocated from spares) - https://phabricator.wikimedia.org/T222922 (10Cmjohnson) [16:02:54] 10Operations, 10ops-eqiad: wmf7622 wont powercycle (cannot be allocated from spares) - https://phabricator.wikimedia.org/T222922 (10Cmjohnson) 05Open→03Resolved This server accepts all the racadm commands successfully. I verified on-site that these things actually happened /admin1-> racadm serveraction po... [16:11:16] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) Hello @Dzahn and @CRoslof, what can we do to move forward with this? This is becoming more and more an obstacle for our work and we certainly didn't exp... [16:13:00] ^^ can anyone help us with this, please? [16:15:14] 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Cmjohnson) they declined my ticket...says I didn't isolate the problem well enough. [16:21:08] 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) Is there anything I can do from my side to help on that? [16:22:13] (03PS1) 10Ottomata: Fix incorrect to_emails parameter for analytics Refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/516515 (https://phabricator.wikimedia.org/T225343) [16:23:24] tramm: Dzahn is out on sick leave atm [16:25:17] (03CR) 10Ottomata: [C: 03+2] Fix incorrect to_emails parameter for analytics Refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/516515 (https://phabricator.wikimedia.org/T225343) (owner: 10Ottomata) [16:25:25] Reedy: can i ping someone else with this? i also mailed croslof and didn't get a reply [16:25:25] (03PS2) 10Ottomata: Fix incorrect to_emails parameter for analytics Refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/516515 (https://phabricator.wikimedia.org/T225343) [16:25:29] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix incorrect to_emails parameter for analytics Refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/516515 (https://phabricator.wikimedia.org/T225343) (owner: 10Ottomata) [16:26:21] tramm: I don't really know. It seems from the ticket that croslof (at least) has access to the control panel. Other people in legal may too, but I have no idea [16:28:16] Chuck doesn't seem to be off or anything... When did you mail him? [16:28:46] (03CR) 10Nuria: [C: 03+1] Fix incorrect to_emails parameter for analytics Refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/516515 (https://phabricator.wikimedia.org/T225343) (owner: 10Ottomata) [16:29:38] Reedy: 2019-05-27 19:05 to croslof@wikimedia.org [16:29:50] Have you followed up? [16:31:34] we collectively followed up on phabricator as you can see: https://phabricator.wikimedia.org/T204056 [16:34:37] Sure, but the lawyers don't actually work day to day in phabricator [16:34:55] It's hard to know whether he's necessarily seen the pings, or even got email notifications (because they can be turned off) [16:36:00] there seem to be some activity and reactions in the past at least [16:36:15] Sure, but peoples work schedules and activities change [16:36:25] i'll reply my email and see what happens [16:36:38] Seems a reasonable place to start [16:37:01] Maybe cc legal@ which might get more attention from other people who can help [16:37:18] As like I say, I think it needs to be legal to sort it, SRE/Ops can't really help [16:40:49] did it, thanks Reedy [16:55:05] 10Operations, 10serviceops: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10Legoktm) I don't remember if there was a reason I didn't build it for stretch-backports at the time, but that should be relatively straightforward if we decide to go with stretch instead of buster. [17:09:42] (03PS7) 10Mathew.onipe: add WDQS reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) [17:39:16] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [17:45:39] (03PS1) 10Mathew.onipe: A more flexible approach for mjolnir update lag [puppet] - 10https://gerrit.wikimedia.org/r/516526 (https://phabricator.wikimedia.org/T214494) [17:54:40] (03PS1) 10Ottomata: Add 'Z' suffix to webrequest log dt format [puppet] - 10https://gerrit.wikimedia.org/r/516528 (https://phabricator.wikimedia.org/T217040) [18:03:52] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [18:06:22] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:06:40] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:07:26] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:14:03] telia just reported that the card has been replaced [18:14:07] so that ought to be the end of that [18:26:36] Hmm, I just got temp removed from mediawiki-l due to apparent bounces O_o, is there anywhere to see what these bounces were? mailman wont tell me (as a user), but I can't see anything obviously wrong with my account... [18:27:02] I mean, mailman managed to send me the email telling me I have been removed from the list.... :P [18:27:15] addshore: you're the third person complaining, i think [18:27:23] but the others were in other channels [18:28:11] MatmaRex: I see [18:28:36] (03PS2) 10Mathew.onipe: A more flexible approach for mjolnir update lag [puppet] - 10https://gerrit.wikimedia.org/r/516526 (https://phabricator.wikimedia.org/T214494) [18:38:28] (03PS6) 10Bstorm: dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104 [18:38:53] addshore: you use gmail right? [18:39:01] yarp [18:40:41] 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10Legoktm) [18:40:46] ^^ [18:41:46] (03CR) 10Bstorm: dologmsg: move this little script out of toolforge profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [18:43:34] (03PS7) 10Bstorm: dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104 [18:43:48] (03CR) 10Bstorm: dologmsg: move this little script out of toolforge profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [18:44:11] addshore: are you subcribed to other wm mailing lists fro that address? I am guessing yes? [18:44:17] and no bounces from them? [18:44:34] I am indeed subscribed to many other ones [18:44:41] any you received mail today? [18:44:48] it's just another data point for the ticket [18:45:04] not received any emails from mailing lists today [18:45:12] hm rats [18:45:16] (03CR) 10Bstorm: "No why is the blasted submodule and such showing up." [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm) [18:45:39] the email from mediawiki-l said it was getting excessive bounces sending me emails, but didnt say how many [18:46:18] addshore: over 9000 [18:46:31] I got the last mw-l email fine, to my staff email of course [18:46:42] couple hours ago [18:46:55] https://lists.wikimedia.org/pipermail/mediawiki-l/2019-June/048020.html [18:47:12] There's been a couple more since that [18:47:13] https://lists.wikimedia.org/pipermail/mediawiki-l/2019-June/date.html [18:47:45] I think the list admin should be able to look into mailman to see bounce reasons... [18:51:02] just now there are, yeah [18:51:07] when I went to look there weren't [18:51:39] oh [18:51:48] well there were but I sorted by thread doncha know. oops [18:53:20] the most recent mail I have is 2 hours ago, it seems [18:53:25] that's interesting [18:54:18] (03CR) 10EBernhardson: "i think this will work, but my prometheus-fu isn't super amazing." [puppet] - 10https://gerrit.wikimedia.org/r/516526 (https://phabricator.wikimedia.org/T214494) (owner: 10Mathew.onipe) [18:54:26] (03CR) 10EBernhardson: [C: 03+1] A more flexible approach for mjolnir update lag [puppet] - 10https://gerrit.wikimedia.org/r/516526 (https://phabricator.wikimedia.org/T214494) (owner: 10Mathew.onipe) [19:17:40] 10Operations, 10Citoid, 10Security-Team, 10Traffic, and 4 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632 (10sbassett) [19:24:50] !log Removing four (4) files for legal compliance [19:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:52] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 2 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [19:38:46] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [19:51:09] I finally got those two mediawiki-l emails [19:51:17] sure took a long time. anyways, no bounce, no nothing [20:07:06] FYI I got a notice that I was unsub'ed from mediawiki-l for bounce issues [20:10:49] Betacommand: there is a task you should probably subscribe to [20:10:58] https://phabricator.wikimedia.org/T225553 [20:11:11] if you are not a gmail user, add a note to that effect too [20:14:54] apergos: ah thanks [20:15:00] yw [20:48:38] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [20:55:12] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 801 days) https://wikitech.wikimedia.org/wiki/Logs [21:46:30] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:47:08] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [22:04:20] (03PS1) 10Bartosz Dziewoński: Ensure no lossy WTE→VE switching in public wikis (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516567 [22:12:32] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [22:13:10] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [22:27:20] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [22:41:51] 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10Aklapper) I know that a lot of folks received ` Your membership in the mailing list MediaWiki-l has been disabled due to excessive bounces The last bounc... [23:00:00] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:03:00] 10Operations, 10serviceops: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10MoritzMuehlenhoff) >>! In T224572#5250436, @Legoktm wrote: > I don't remember if there was a reason I didn't build it for stretch-backports at the time, but that should be relatively straightforwa... [23:07:47] (03CR) 10Jforrester: [C: 03+1] "Looks good. Let's deploy first-thing on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516567 (owner: 10Bartosz Dziewoński) [23:21:24] PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [23:35:00] (03CR) 10Elukey: [C: 03+1] Add 'Z' suffix to webrequest log dt format [puppet] - 10https://gerrit.wikimedia.org/r/516528 (https://phabricator.wikimedia.org/T217040) (owner: 10Ottomata) [23:46:34] PROBLEM - puppet last run on alcyone is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [23:54:00] RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:57:48] (03CR) 10EBernhardson: [C: 03+1] [cirrus] Enable UTR30 as a lookup method for ns prefixes on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512195 (owner: 10DCausse)