[00:00:04] <jouncebot>	 Deploy window No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190611T0000)
[00:37:36] <wikibugs>	 (03PS2) 10Smalyshev: Set up dumps for mediainfo RDF generation [puppet] - 10https://gerrit.wikimedia.org/r/516444 (https://phabricator.wikimedia.org/T221917)
[01:11:05] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[01:21:11] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[02:41:07] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 27747352 and 0 seconds
[02:44:03] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 68232 and 45 seconds
[02:46:31] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 854.83 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[02:56:55] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[03:21:33] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[04:05:05] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[04:14:21] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:15:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:25:51] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[04:26:49] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[04:38:11] <apergos>	 the cr2 alerts are for planned maintenance, window runs for another 3+ hours
[04:41:53] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[04:52:57] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[04:53:27] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[04:57:21] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:58:01] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:10:21] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[06:10:31] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[06:10:37] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[06:12:03] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[06:14:43] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[06:14:53] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[06:19:19] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[06:20:47] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[06:26:33] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:33:35] <icinga-wm>	 PROBLEM - puppet last run on db2085 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[07:00:45] <icinga-wm>	 RECOVERY - puppet last run on db2085 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:34:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) I have started a compare for main tables on s3 wikis.
[07:44:45] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational
[07:50:15] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10elukey) >>! In T219544#5248264, @Ottomata wrote: > Ok!  Creds deployed, and oozie job merged.  Refinery will be deployed this week and we can tr...
[07:51:31] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure (phase-out-jessie): Migrate contint* hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224591 (10ArielGlenn) p:05Triage→03Normal
[07:52:32] <wikibugs>	 10Operations, 10OTRS: Migrate mendelevium/OTRS host to Stretch/Buster - https://phabricator.wikimedia.org/T224590 (10ArielGlenn) p:05Triage→03Normal
[07:52:46] <wikibugs>	 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10ArielGlenn) p:05Triage→03Normal
[07:53:00] <wikibugs>	 10Operations: Migrate fermium to stretch/buster - https://phabricator.wikimedia.org/T224586 (10ArielGlenn) p:05Triage→03Normal
[07:53:19] <wikibugs>	 10Operations, 10cloud-services-team: Migrate labmon* to Stretch - https://phabricator.wikimedia.org/T224585 (10ArielGlenn) p:05Triage→03Normal
[07:53:40] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10ArielGlenn) p:05Triage→03Normal
[07:53:57] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10ArielGlenn) p:05Triage→03Normal
[07:54:08] <wikibugs>	 10Operations, 10Wikimedia-Etherpad, 10serviceops: Migrate etherpad1001 to Stretch/Buster - https://phabricator.wikimedia.org/T224580 (10ArielGlenn) p:05Triage→03Normal
[07:54:44] <wikibugs>	 10Operations: Migrate irc.wikimedia.org/kraz to Stretch/Buster - https://phabricator.wikimedia.org/T224579 (10ArielGlenn) p:05Triage→03Normal
[07:55:02] <wikibugs>	 10Operations, 10serviceops, 10Kubernetes: Migrate etcd networking cluster to Stretch/Buster - https://phabricator.wikimedia.org/T224577 (10ArielGlenn) p:05Triage→03Normal
[07:55:22] <wikibugs>	 10Operations: Upgrade install servers to Stretch/Buster - https://phabricator.wikimedia.org/T224576 (10ArielGlenn) p:05Triage→03Normal
[07:55:38] <wikibugs>	 10Operations, 10serviceops, 10Kubernetes: Migrate Kubernetes etcd clusters to Stretch/Buster - https://phabricator.wikimedia.org/T224574 (10ArielGlenn) p:05Triage→03Normal
[07:55:57] <wikibugs>	 10Operations, 10serviceops: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10ArielGlenn) p:05Triage→03Normal
[07:56:11] <wikibugs>	 10Operations: Migrate auth* servers to Stretch/Buster - https://phabricator.wikimedia.org/T224571 (10ArielGlenn) p:05Triage→03Normal
[07:56:25] <wikibugs>	 10Operations, 10Pybal, 10Traffic: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 (10ArielGlenn) p:05Triage→03Normal
[07:56:44] <wikibugs>	 10Operations, 10ORES, 10Scoring-platform-team, 10serviceops: Migrate ORES Redis servers to Stretch/Buster - https://phabricator.wikimedia.org/T224569 (10ArielGlenn) p:05Triage→03Normal
[07:56:53] <wikibugs>	 10Operations, 10Kubernetes: Migrate etcd cluster for Kubernetes staging cluster to Stretch/Buster - https://phabricator.wikimedia.org/T224568 (10ArielGlenn) p:05Triage→03Normal
[07:57:18] <wikibugs>	 10Operations, 10serviceops: Migrate debug proxies to Stretch/Buster - https://phabricator.wikimedia.org/T224567 (10ArielGlenn) p:05Triage→03Normal
[07:57:28] <wikibugs>	 10Operations: Migrate mwlog/udp2log servers to Stretch/Buster - https://phabricator.wikimedia.org/T224565 (10ArielGlenn) p:05Triage→03Normal
[07:57:42] <wikibugs>	 10Operations: Reimage wezen to Stretch (and rename to centrallog2001) - https://phabricator.wikimedia.org/T224564 (10ArielGlenn) p:05Triage→03Normal
[07:58:02] <wikibugs>	 10Operations: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) p:05Triage→03Normal a:03ArielGlenn
[07:58:16] <wikibugs>	 10Operations, 10cloud-services-team: Migrate remaining cloudvirt hosts to Stretch/Mitaka - https://phabricator.wikimedia.org/T224561 (10ArielGlenn) p:05Triage→03Normal
[07:58:25] <wikibugs>	 10Operations, 10serviceops: Migrate Zookeeper/etcd conf cluster in codfw to Stretch - https://phabricator.wikimedia.org/T224560 (10ArielGlenn) p:05Triage→03Normal
[07:58:38] <wikibugs>	 10Operations, 10Traffic, 10serviceops: Migrate Failoid hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224559 (10ArielGlenn) p:05Triage→03Normal
[07:58:53] <wikibugs>	 10Operations: Migrate ldap/corp replicas to Stretch/Buster - https://phabricator.wikimedia.org/T224557 (10ArielGlenn) p:05Triage→03Normal
[07:59:11] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10ArielGlenn) p:05Triage→03Normal
[07:59:37] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10ArielGlenn) p:05Triage→03Normal
[07:59:47] <wikibugs>	 10Operations: Migrate URL downloaders to Stretch/Buster - https://phabricator.wikimedia.org/T224551 (10ArielGlenn) p:05Triage→03Normal
[08:03:21] <wikibugs>	 10Operations, 10Gerrit, 10Traffic: When downloading from git using HTTPS: HTTP 500 / GnuTLS recv error (-110) - https://phabricator.wikimedia.org/T225347 (10ArielGlenn) p:05Triage→03Normal
[08:04:39] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10ArielGlenn) p:05Triage→03Normal
[08:05:41] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[08:07:35] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[08:18:54] <wikibugs>	 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2003.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906110818_gehel_7...
[08:20:42] <wikibugs>	 10Operations, 10Gerrit, 10Traffic: When downloading from git using HTTPS: HTTP 500 / GnuTLS recv error (-110) - https://phabricator.wikimedia.org/T225347 (10hashar) The TLS stack is just fine and the query does reach the Apache in front of Gerrit,. The reason is the OVH one is being rejected by our configura...
[08:21:31] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[08:22:20] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.13 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[08:23:17] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:23:24] <_joe_>	 what's up with kartotherian?
[08:23:25] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:23:27] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[08:23:29] <_joe_>	 gehel, onimisionipe?
[08:23:49] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[08:24:09] <onimisionipe>	 looking
[08:24:14] <gehel>	 codfw still depooled, no direct impact
[08:24:29] <gehel>	 but looks like not enough servers repooled in that clsuter
[08:25:13] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[08:25:16] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1288 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[08:26:02] <cdanis>	 gehel: logstash is showing lots of 500s for maps
[08:26:08] <_joe_>	 no gehel it's pooled
[08:26:28] <gehel>	 !log repooling maps200[124]
[08:26:29] <_joe_>	 see https://config-master.wikimedia.org/discovery/services.yaml
[08:26:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:34] <gehel>	 shoudl be good in a second
[08:26:59] <onimisionipe>	 2003 was the lat pooled node and now we reimaging
[08:27:04] <onimisionipe>	 *last
[08:27:05] <apergos>	 are their space issues all set then? (it was those hosts with space issues after reimage right?)
[08:27:19] <gehel>	 apergos: we're mostly good now
[08:27:25] <apergos>	 ah great news
[08:27:33] <_joe_>	 onimisionipe: so you need to depool codfw when you do something like that
[08:27:33] <onimisionipe>	 yea
[08:27:52] <onimisionipe>	 _joe_: noted!
[08:28:35] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) Just as an FYI, everything looks ok on this end, but there's a train freeze this week, so we have to wait before dep...
[08:30:00] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Tarrow) @akosiaris Great! Thanks for that. We look forward to seeing how it all goes forward post-offsite :)
[08:30:31] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:30:37] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[08:30:41] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[08:30:41] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[08:31:41] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[08:33:05] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[08:35:19] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[08:35:22] <wikibugs>	 10Operations, 10Gerrit, 10Traffic: When downloading from git using HTTPS: HTTP 500 / GnuTLS recv error (-110) - https://phabricator.wikimedia.org/T225347 (10hashar) Since I am not familiar with that specific configuration and there are private data involved (IP address of the machine), I have filled **a priv...
[08:35:23] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[08:37:43] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Gehel) @Cmjohnson any news on this? Do you need anything from our side?
[08:39:23] <icinga-wm>	 PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[09:10:45] <icinga-wm>	 RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[09:19:33] <icinga-wm>	 RECOVERY - Disk space on ms-be2018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[09:20:00] <godog>	 !log free up space wrongly allocated onto / with sdc1 umounted on ms-be2018
[09:20:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:12] <wikibugs>	 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2003.codfw.wmnet'] `  and were **ALL** successful.
[09:22:18] <apergos>	 oh it was sdc1!  heh
[09:22:24] <apergos>	 thanks go dog
[09:26:37] <godog>	 apergos: np! yeah as you mentioned space got wrongly written there while unmounted
[09:26:53] <apergos>	 how did you prove it? depool, umount, clean up?
[09:27:04] <apergos>	 (and what about rebalancing the rings and etc?)
[09:27:26] <godog>	 yeah exactly, stop swift/rsync, umount, delete, mount
[09:27:38] <godog>	 I'm writing it down on wikitech
[09:28:01] <apergos>	 oh cool, thanks!  I had a feeling the docs there were pretty outa date
[09:38:04] <godog>	 yeah some are indeed outdated :|
[09:41:22] <apergos>	 bug 1: in perpetuity
[09:43:03] <wikibugs>	 (03CR) 10DCausse: "I have a patch chain which includes the same patch + some refactoring which are now possible: https://gerrit.wikimedia.org/r/q/topic:%2251" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514994 (https://phabricator.wikimedia.org/T87892) (owner: 10Smalyshev)
[09:56:46] <logmsgbot>	 !log gehel@cumin2001 START - Cookbook sre.postgresql.postgres-init
[09:56:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:56] <wikibugs>	 (03PS2) 10Cparle: Add 'sms' langcode to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309)
[10:21:26] <wikibugs>	 (03PS3) 10Cparle: Add 'smn' and 'sms' langcodes to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309)
[10:29:30] <wikibugs>	 (03PS2) 10CRusnov: Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840
[10:31:48] <wikibugs>	 (03PS1) 10Ema: admin: add ema's new yubikey [puppet] - 10https://gerrit.wikimedia.org/r/516469
[10:34:42] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] "key verified via email && in person :)" [puppet] - 10https://gerrit.wikimedia.org/r/516469 (owner: 10Ema)
[10:35:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (owner: 10CRusnov)
[10:54:12] <godog>	 !log wipe fs on ms-be1033 data partitions - T223518
[10:54:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:24] <stashbot>	 T223518: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518
[11:29:49] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+2] Add 'smn' and 'sms' langcodes to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) (owner: 10Cparle)
[11:30:54] <wikibugs>	 (03Merged) 10jenkins-bot: Add 'smn' and 'sms' langcodes to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) (owner: 10Cparle)
[11:32:35] <wikibugs>	 (03CR) 10jenkins-bot: Add 'smn' and 'sms' langcodes to beta commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513565 (https://phabricator.wikimedia.org/T222309) (owner: 10Cparle)
[11:50:53] <wikibugs>	 (03PS1) 10Michael Große: Set EntityUsageTable addUsage batch size to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516478 (https://phabricator.wikimedia.org/T225500)
[11:52:20] <logmsgbot>	 !log gehel@cumin2001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0)
[11:52:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:36] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.146:9042 on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[11:52:42] <icinga-wm>	 PROBLEM - tilerator on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[11:53:26] <icinga-wm>	 PROBLEM - tileratorui on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[11:53:30] <icinga-wm>	 PROBLEM - Maps HTTPS on maps2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.154 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[11:53:36] <icinga-wm>	 PROBLEM - cassandra service on maps2003 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:58:24] <onimisionipe>	 I'm looking
[12:00:32] <gehel>	 ^ downtime expired, just extended
[12:05:36] <icinga-wm>	 RECOVERY - tilerator on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator
[12:06:22] <icinga-wm>	 RECOVERY - tileratorui on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 326 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[12:06:26] <icinga-wm>	 RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1288 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
[12:08:00] <icinga-wm>	 RECOVERY - cassandra service on maps2003 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:12:25] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Release-Engineering-Team (Backlog): Request: add awight to contint-docker - https://phabricator.wikimedia.org/T223262 (10ArielGlenn) Still waiting on @Tobi_WMDE_SW
[12:21:02] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1077: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/516481
[12:21:09] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not ready yet" [puppet] - 10https://gerrit.wikimedia.org/r/516481 (owner: 10Marostegui)
[12:52:58] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10WMDE-leszek) Apologies for the late reaction. It's been a longer weekend for me. Thanks a lot @RStallman-legalteam for checking, and reaching out...
[12:53:37] <wikibugs>	 10Operations, 10ops-eqiad, 10User-fgiunchedi: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 (10fgiunchedi) 05Resolved→03Open a:05Cmjohnson→03fgiunchedi
[12:54:13] <godog>	 !log swift eqiad-prod: put back ms-be1033 - T223518
[12:54:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:22] <stashbot>	 T223518: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518
[13:18:44] <wikibugs>	 (03PS2) 10Michael Große: Set EntityUsageTable addUsage batch size to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516478 (https://phabricator.wikimedia.org/T225500)
[13:20:09] <wikibugs>	 (03PS3) 10Michael Große: Set EntityUsageTable addUsage batch size to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516478 (https://phabricator.wikimedia.org/T225500)
[13:26:48] <wikibugs>	 (03PS3) 10CRusnov: Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840
[13:26:57] <wikibugs>	 (03CR) 10Elukey: "Thanks Andrew! Followed up with more questions :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey)
[13:34:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (owner: 10CRusnov)
[13:40:43] <wikibugs>	 (03PS5) 10Mathew.onipe: add WDQS reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385)
[13:40:55] <wikibugs>	 (03CR) 10Mathew.onipe: add WDQS reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) (owner: 10Mathew.onipe)
[13:41:29] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Krinkle) +1 to close :)  Back when the list was still private, it also had some overlap "ops-l", which continues to be a private list for anyone with production access (volunteers, WMDE, W...
[13:41:40] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Ottomata) Already done in T220081 :)
[13:48:15] <wikibugs>	 (03PS1) 10Paladox: Raise recieve.maxBatchChanges To 20 [puppet] - 10https://gerrit.wikimedia.org/r/516492
[13:48:49] <wikibugs>	 (03PS2) 10Paladox: Raise recieve.maxBatchChanges To 20 [puppet] - 10https://gerrit.wikimedia.org/r/516492
[13:49:14] <hauskatze>	 To -> to ?
[13:49:20] <wikibugs>	 (03PS3) 10Paladox: Gerrit: raise recieve.maxBatchChanges To 20 [puppet] - 10https://gerrit.wikimedia.org/r/516492
[13:50:12] <paladox>	 hauskatze: oh yes
[13:50:32] <wikibugs>	 (03PS4) 10Paladox: Gerrit: raise recieve.maxBatchChanges to 20 [puppet] - 10https://gerrit.wikimedia.org/r/516492
[13:50:49] <hauskatze>	 Ortography C+1 :D
[13:51:15] <wikibugs>	 (03CR) 10Reedy: [C: 03+1] "+1 from me" [puppet] - 10https://gerrit.wikimedia.org/r/516492 (owner: 10Paladox)
[13:54:01] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10alaa_wmde) hi @Nuria   Yeap I actually have access to logstash already. I must have confused it somehow into thinking that there's another logsta...
[13:55:01] <wikibugs>	 (03CR) 10Ottomata: Allow Hadoop-related profiles to deploy Kerberos keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey)
[13:55:58] <wikibugs>	 (03CR) 10Thcipriani: "Per the docs:" [puppet] - 10https://gerrit.wikimedia.org/r/516492 (owner: 10Paladox)
[13:58:37] <wikibugs>	 10Operations, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10hashar) Removing #release-engineering-team since there is already deployment/logstash access :]   @alaa_wmde you might already have access to http://pivot.wikimedia.org/ which...
[14:09:08] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:09:20] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:11:55] <wikibugs>	 (03CR) 10Elukey: Allow Hadoop-related profiles to deploy Kerberos keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515010 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey)
[14:12:14] <godog>	 that's telia down ulsfo/eqord btw
[14:13:45] <apergos>	 what's the circuit? I couldn't find planned maintenance for telia just now
[14:14:17] <wikibugs>	 (03Abandoned) 10Paladox: Gerrit: raise recieve.maxBatchChanges to 20 [puppet] - 10https://gerrit.wikimedia.org/r/516492 (owner: 10Paladox)
[14:14:31] <godog>	 ah yeah found it, telia sent a maint-announce@ email, suspected card failure
[14:14:38] <godog>	 half an hour ago that is
[14:14:40] <apergos>	 oh the one that says
[14:14:49] <apergos>	 Please  note that if your service is a protected service, you should not  experience any issues as your service is running on a protected path.
[14:14:54] <apergos>	 and of course ours isn't? nice
[14:15:24] <apergos>	 no eta either. oh well
[14:17:31] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) After the last round of rolling restarts (happened...
[14:18:05] <godog>	 in theory also codfw/eqord might be affected, I'm assuming it is some problem on the eqord side
[14:18:45] <elukey>	 godog: if you guys want to open a task with all the info I can live-ping Arzhel :)
[14:20:37] <godog>	 heheh not sure if there's anything actionable ATM elukey though
[14:20:52] <apergos>	 they've opened a case with the vendor, they say
[14:20:53] <apergos>	 so ...
[14:20:57] <elukey>	 ah!
[14:20:59] <elukey>	 good :)
[14:21:10] <apergos>	 and we already seem to be in the  email loop for notifications
[14:22:18] <wikibugs>	 (03CR) 10Krinkle: ":D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514193 (owner: 10Bartosz Dziewoński)
[14:24:46] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:25:41] <apergos>	 as you were saying, go god
[14:25:44] <apergos>	 *dog  !!  
[14:28:00] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul)
[14:29:21] <godog>	 heheh indeed, standing by for the eqord side
[14:35:37] <wikibugs>	 10Operations, 10serviceops, 10PHP 7.2 support, 10Patch-For-Review, and 3 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Krinkle) >>! In T224491#5235501, @Joe wrote: > No that's completely unrelated to opcache corruption. We're not res...
[14:39:25] <wikibugs>	 10Operations, 10Deployments, 10Release-Engineering-Team: Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207 (10ArielGlenn) p:05Triage→03Normal
[14:48:05] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] add WDQS reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) (owner: 10Mathew.onipe)
[14:49:34] <wikibugs>	 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) So it appears that there is a fix upstream in sshd but it hasn't made it's wa...
[15:00:51] <wikibugs>	 (03PS6) 10Mathew.onipe: add WDQS reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385)
[15:01:41] <wikibugs>	 (03CR) 10Mathew.onipe: add WDQS reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) (owner: 10Mathew.onipe)
[15:03:56] <wikibugs>	 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10ArielGlenn) I'd like to see us test with a locally patched sshd and see if that's inde...
[15:04:45] <wikibugs>	 (03CR) 10Krinkle: "In what way would removing this broken entry upset mediawiki? Is the array used for sharding keys in a way that we rely on for something?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) (owner: 10Effie Mouzeli)
[15:09:35] <wikibugs>	 (03CR) 10Ottomata: "It would not break anything :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513033 (https://phabricator.wikimedia.org/T224538) (owner: 10Effie Mouzeli)
[15:14:17] <wikibugs>	 10Operations, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10Nuria) @alaa_wmde please check if you have access to turnilo (before known as pivot) as @harshar mentioned this is probably a good tool to find answers to your questions.   Pl...
[15:35:14] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Cmjohnson) 05Open→03Resolved @marostegui that log entry may have been old. The server has both power supplies connected and does not report any current errors.  Resolving the task.
[15:37:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Cmjohnson) @gehel you will need to take the server offline for a day so I can reseat the DIMM.  The server logs do not indicate...
[15:41:10] <gehel>	 !log shutting down elastic1029 for investigation - T214283
[15:41:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:15] <stashbot>	 T214283: Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283
[15:42:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery, 10Discovery-Search (Current work): Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10Gehel) @Cmjohnson elastic1029 is shut down and downtimed in icinga, do whatever you need to do and restart whenever it is done.
[15:43:48] <hashar>	 jouncebot: now
[15:43:48] <jouncebot>	 For the next 8 hour(s) and 16 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190611T0000)
[15:43:50] <hashar>	 jouncebot: next
[15:43:51] <jouncebot>	 In 8 hour(s) and 16 minute(s): No deploys - SRE Summit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190612T0000)
[15:45:50] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T220880 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson @elukey   I found a spare disk and added the disk back, it's now online  Adapter #0  Enclosure Device ID: 32 Slot Number: 0 Drive's position: DiskGroup: 11,...
[15:46:20] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.146:9042 on maps2003 is OK: TCP OK - 0.036 second response time on 10.192.32.146 port 9042 https://phabricator.wikimedia.org/T93886
[15:49:57] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T223825 (10Cmjohnson) a:03RobH This server's SSD's are not part of the original build and under HP warranty.  They are intel SSDs that I believe came from restbase1001-1003.   Assigning to @RobH to order new SSDs...
[15:50:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1077 crashed - https://phabricator.wikimedia.org/T225391 (10Marostegui) @Cmjohnson thank you!. The record looks like from 10th June but might be related to your maintenance actually: ` /system1/log1/record19   Targets   Properties     number=19     severity=Caution     dat...
[15:50:31] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T223825 (10Cmjohnson) 05Open→03Declined this is a duplicate task declining
[15:54:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "Matrix wikimedia.org IDs domain authorization" [dns] - 10https://gerrit.wikimedia.org/r/516056 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza)
[15:54:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/516056 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza)
[15:57:13] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Cmjohnson) @Andrew what parts? There is nothing that suggests that it is CPU on the server side of things.  I reseated and...
[16:01:17] <wikibugs>	 10Operations, 10ops-eqiad: wmf7622 wont powercycle (cannot be allocated from spares) - https://phabricator.wikimedia.org/T222922 (10Cmjohnson)
[16:01:30] <wikibugs>	 10Operations, 10ops-eqiad: wmf7622 wont powercycle (cannot be allocated from spares) - https://phabricator.wikimedia.org/T222922 (10Cmjohnson)
[16:02:54] <wikibugs>	 10Operations, 10ops-eqiad: wmf7622 wont powercycle (cannot be allocated from spares) - https://phabricator.wikimedia.org/T222922 (10Cmjohnson) 05Open→03Resolved This server accepts all the racadm commands successfully. I verified on-site that these things actually happened  /admin1-> racadm serveraction po...
[16:11:16] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) Hello @Dzahn and @CRoslof, what can we do to move forward with this? This is becoming more and more an obstacle for our work and we certainly didn't exp...
[16:13:00] <tramm>	 ^^ can anyone help us with this, please?
[16:15:14] <wikibugs>	 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Cmjohnson) they declined my ticket...says I didn't isolate the problem well enough.
[16:21:08] <wikibugs>	 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) Is there anything I can do from my side to help on that?
[16:22:13] <wikibugs>	 (03PS1) 10Ottomata: Fix incorrect to_emails parameter for analytics Refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/516515 (https://phabricator.wikimedia.org/T225343)
[16:23:24] <Reedy>	 tramm: Dzahn is out on sick leave atm
[16:25:17] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Fix incorrect to_emails parameter for analytics Refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/516515 (https://phabricator.wikimedia.org/T225343) (owner: 10Ottomata)
[16:25:25] <tramm>	 Reedy: can i ping someone else with this? i also mailed croslof and didn't get a reply
[16:25:25] <wikibugs>	 (03PS2) 10Ottomata: Fix incorrect to_emails parameter for analytics Refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/516515 (https://phabricator.wikimedia.org/T225343)
[16:25:29] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix incorrect to_emails parameter for analytics Refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/516515 (https://phabricator.wikimedia.org/T225343) (owner: 10Ottomata)
[16:26:21] <Reedy>	 tramm: I don't really know. It seems from the ticket that croslof (at least) has access to the control panel. Other people in legal may too, but I have no idea
[16:28:16] <Reedy>	 Chuck doesn't seem to be off or anything... When did you mail him?
[16:28:46] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] Fix incorrect to_emails parameter for analytics Refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/516515 (https://phabricator.wikimedia.org/T225343) (owner: 10Ottomata)
[16:29:38] <tramm>	 Reedy: 2019-05-27 19:05 to croslof@wikimedia.org
[16:29:50] <Reedy>	 Have you followed up?
[16:31:34] <tramm>	 we collectively followed up on phabricator as you can see: https://phabricator.wikimedia.org/T204056
[16:34:37] <Reedy>	 Sure, but the lawyers don't actually work day to day in phabricator
[16:34:55] <Reedy>	 It's hard to know whether he's necessarily seen the pings, or even got email notifications (because they can be turned off)
[16:36:00] <tramm>	 there seem to be some activity and reactions in the past at least
[16:36:15] <Reedy>	 Sure, but peoples work schedules and activities change
[16:36:25] <tramm>	 i'll reply my email and see what happens
[16:36:38] <Reedy>	 Seems a reasonable place to start
[16:37:01] <Reedy>	 Maybe cc legal@ which might get more attention from other people who can help
[16:37:18] <Reedy>	 As like I say, I think it needs to be legal to sort it, SRE/Ops can't really help
[16:40:49] <tramm>	 did it, thanks Reedy
[16:55:05] <wikibugs>	 10Operations, 10serviceops: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10Legoktm) I don't remember if there was a reason I didn't build it for stretch-backports at the time, but that should be relatively straightforward if we decide to go with stretch instead of buster.
[17:09:42] <wikibugs>	 (03PS7) 10Mathew.onipe: add WDQS reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385)
[17:39:16] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[17:45:39] <wikibugs>	 (03PS1) 10Mathew.onipe: A more flexible approach for mjolnir update lag [puppet] - 10https://gerrit.wikimedia.org/r/516526 (https://phabricator.wikimedia.org/T214494)
[17:54:40] <wikibugs>	 (03PS1) 10Ottomata: Add 'Z' suffix to webrequest log dt format [puppet] - 10https://gerrit.wikimedia.org/r/516528 (https://phabricator.wikimedia.org/T217040)
[18:03:52] <icinga-wm>	 RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[18:06:22] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:06:40] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:07:26] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:14:03] <apergos>	 telia just reported that the card has been replaced
[18:14:07] <apergos>	 so that ought to be the end of that
[18:26:36] <addshore>	 Hmm, I just got temp removed from mediawiki-l due to apparent bounces O_o, is there anywhere to see what these bounces were? mailman wont tell me (as a user), but I can't see anything obviously wrong with my account...
[18:27:02] <addshore>	 I mean, mailman managed to send me the email telling me I have been removed from the list.... :P
[18:27:15] <MatmaRex>	 addshore: you're the third person complaining, i think
[18:27:23] <MatmaRex>	 but the others were in other channels
[18:28:11] <addshore>	 MatmaRex: I see
[18:28:36] <wikibugs>	 (03PS2) 10Mathew.onipe: A more flexible approach for mjolnir update lag [puppet] - 10https://gerrit.wikimedia.org/r/516526 (https://phabricator.wikimedia.org/T214494)
[18:38:28] <wikibugs>	 (03PS6) 10Bstorm: dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104
[18:38:53] <legoktm>	 addshore: you use gmail right?
[18:39:01] <addshore>	 yarp
[18:40:41] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10Legoktm)
[18:40:46] <legoktm>	 ^^
[18:41:46] <wikibugs>	 (03CR) 10Bstorm: dologmsg: move this little script out of toolforge profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm)
[18:43:34] <wikibugs>	 (03PS7) 10Bstorm: dologmsg: move this little script out of toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/515104
[18:43:48] <wikibugs>	 (03CR) 10Bstorm: dologmsg: move this little script out of toolforge profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm)
[18:44:11] <apergos>	 addshore: are you subcribed to other wm mailing lists fro that address? I am guessing yes?
[18:44:17] <apergos>	 and no bounces from them?
[18:44:34] <addshore>	 I am indeed subscribed to many other ones
[18:44:41] <apergos>	 any you received mail today?
[18:44:48] <apergos>	 it's just another data point for the ticket
[18:45:04] <addshore>	 not received any emails from mailing lists today
[18:45:12] <apergos>	 hm rats
[18:45:16] <wikibugs>	 (03CR) 10Bstorm: "No why is the blasted submodule and such showing up." [puppet] - 10https://gerrit.wikimedia.org/r/515104 (owner: 10Bstorm)
[18:45:39] <addshore>	 the email from mediawiki-l said it was getting excessive bounces sending me emails, but didnt say how many
[18:46:18] <Reedy>	 addshore: over 9000
[18:46:31] <apergos>	 I got the last mw-l email fine, to my staff email of course
[18:46:42] <apergos>	 couple hours ago
[18:46:55] <apergos>	 https://lists.wikimedia.org/pipermail/mediawiki-l/2019-June/048020.html 
[18:47:12] <Reedy>	 There's been a couple more since that
[18:47:13] <Reedy>	 https://lists.wikimedia.org/pipermail/mediawiki-l/2019-June/date.html
[18:47:45] <legoktm>	 I think the list admin should be able to look into mailman to see bounce reasons...
[18:51:02] <apergos>	 just now there are, yeah
[18:51:07] <apergos>	 when I went to look there weren't
[18:51:39] <apergos>	 oh
[18:51:48] <apergos>	 well there were but I sorted by thread doncha know. oops
[18:53:20] <apergos>	 the most recent mail I have is 2 hours ago, it seems
[18:53:25] <apergos>	 that's interesting
[18:54:18] <wikibugs>	 (03CR) 10EBernhardson: "i think this will work, but my prometheus-fu isn't super amazing." [puppet] - 10https://gerrit.wikimedia.org/r/516526 (https://phabricator.wikimedia.org/T214494) (owner: 10Mathew.onipe)
[18:54:26] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+1] A more flexible approach for mjolnir update lag [puppet] - 10https://gerrit.wikimedia.org/r/516526 (https://phabricator.wikimedia.org/T214494) (owner: 10Mathew.onipe)
[19:17:40] <wikibugs>	 10Operations, 10Citoid, 10Security-Team, 10Traffic, and 4 others: http://citoid.wikimedia.org/ should force HTTPS - https://phabricator.wikimedia.org/T108632 (10sbassett)
[19:24:50] <tzatziki>	 !log Removing four (4) files for legal compliance
[19:24:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:52] <icinga-wm>	 PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 2 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman
[19:38:46] <icinga-wm>	 RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman
[19:51:09] <apergos>	 I finally got those two mediawiki-l emails 
[19:51:17] <apergos>	 sure took a long time. anyways, no bounce, no nothing
[20:07:06] <Betacommand>	 FYI I got a notice that I was unsub'ed from mediawiki-l for bounce issues 
[20:10:49] <apergos>	 Betacommand: there is a task you should probably subscribe to
[20:10:58] <apergos>	 https://phabricator.wikimedia.org/T225553
[20:11:11] <apergos>	 if you are not a gmail user, add a note to that effect too
[20:14:54] <Betacommand>	 apergos: ah thanks
[20:15:00] <apergos>	 yw
[20:48:38] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[20:55:12] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 801 days) https://wikitech.wikimedia.org/wiki/Logs
[21:46:30] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[21:47:08] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[22:04:20] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Ensure no lossy WTE→VE switching in public wikis (no-op) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516567
[22:12:32] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[22:13:10] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[22:27:20] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[22:41:51] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: gmail users being suspended from mediawiki-l due to excessive bounces - https://phabricator.wikimedia.org/T225553 (10Aklapper) I know that a lot of folks received ` Your membership in the mailing list MediaWiki-l has been disabled due to excessive bounces The last bounc...
[23:00:00] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[23:03:00] <wikibugs>	 10Operations, 10serviceops: Migrate pool counters to Stretch/Buster - https://phabricator.wikimedia.org/T224572 (10MoritzMuehlenhoff) >>! In T224572#5250436, @Legoktm wrote: > I don't remember if there was a reason I didn't build it for stretch-backports at the time, but that should be relatively straightforwa...
[23:07:47] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Looks good. Let's deploy first-thing on Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516567 (owner: 10Bartosz Dziewoński)
[23:21:24] <icinga-wm>	 PROBLEM - puppet last run on db1090 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[23:35:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add 'Z' suffix to webrequest log dt format [puppet] - 10https://gerrit.wikimedia.org/r/516528 (https://phabricator.wikimedia.org/T217040) (owner: 10Ottomata)
[23:46:34] <icinga-wm>	 PROBLEM - puppet last run on alcyone is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[23:54:00] <icinga-wm>	 RECOVERY - puppet last run on db1090 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[23:57:48] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+1] [cirrus] Enable UTR30 as a lookup method for ns prefixes on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512195 (owner: 10DCausse)