[00:15:39] <wikibugs>	 (03PS1) 10EddieGP: hiera: Kill some hiera paths [labs/private] - 10https://gerrit.wikimedia.org/r/423189
[00:21:00] <wikibugs>	 (03PS1) 10EddieGP: cloud hiera: Remove unused paths from hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/423190
[00:39:51] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4080501 (10EddieGP) The title for swift::init_device comes from a hiera lookup (hiera key `swift_storage_drives`) . Openstack browser shows this key is set to the value 'lv-a' on deployment-...
[00:46:54] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4094961 (10EddieGP) Related: T184236 and the attached patches.
[00:54:20] <wikibugs>	 (03CR) 10EddieGP: "As far as the (proven to be outdated on other issues today) docs say, role-based lookup isn't supported in labs puppet. Although I don't k" [labs/private] - 10https://gerrit.wikimedia.org/r/423178 (https://phabricator.wikimedia.org/T191110) (owner: 10MarcoAurelio)
[03:26:09] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 856.89 seconds
[04:01:18] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 205.30 seconds
[05:16:38] <icinga-wm>	 PROBLEM - Disk space on restbase1009 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 1643 MB (3% inode=99%)
[05:30:19] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.131 and port 9042: Connection refused
[05:30:19] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[05:30:28] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
[05:30:29] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[05:30:29] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.131:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[05:30:49] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.48.130:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.130 and port 9042: Connection refused
[05:30:59] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[05:31:08] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused
[05:31:09] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.130:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[05:31:18] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[05:44:28] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active
[05:44:29] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1009 is OK: OK - cassandra-b is active
[05:45:08] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[05:45:19] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1009 is OK: OK - cassandra-c is active
[06:22:28] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0
[06:38:38] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0
[06:45:07] <wikibugs>	 (03PS1) 10Elukey: profile::prometheus::alerts: fix new eventlogging alarms [puppet] - 10https://gerrit.wikimedia.org/r/423198
[06:45:45] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: fix new eventlogging alarms [puppet] - 10https://gerrit.wikimedia.org/r/423198 (owner: 10Elukey)
[09:52:31] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 26 probes of 304 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map
[09:55:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received
[09:55:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for Apri
[09:55:41] <icinga-wm>	 ut before a response was received
[09:56:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200)
[09:56:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[09:56:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregat
[09:56:50] <icinga-wm>	  April 29, 2016) timed out before a response was received
[09:57:20] <icinga-wm>	 PROBLEM - eventstreams on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 8092: Connection refused
[09:57:31] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 5 probes of 304 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map
[09:57:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[09:57:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[09:57:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[09:58:20] <icinga-wm>	 RECOVERY - eventstreams on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.035 second response time
[09:59:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[09:59:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy
[09:59:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy
[09:59:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[10:05:31] <wikibugs>	 (03PS1) 10MarcoAurelio: Revert "hieradata: fix for deployment-tin/mira lack of ::git_owner" [labs/private] - 10https://gerrit.wikimedia.org/r/423231
[10:06:54] <wikibugs>	 (03CR) 10MarcoAurelio: "> In re. EddieGP" [labs/private] - 10https://gerrit.wikimedia.org/r/423178 (https://phabricator.wikimedia.org/T191110) (owner: 10MarcoAurelio)
[10:14:55] <wikibugs>	 (03PS1) 10EddieGP: hiera: fix deployment-mira, lacking ::git_owner [puppet] - 10https://gerrit.wikimedia.org/r/423232 (https://phabricator.wikimedia.org/T191110)
[10:15:12] <wikibugs>	 (03CR) 10EddieGP: [C: 031] Revert "hieradata: fix for deployment-tin/mira lack of ::git_owner" [labs/private] - 10https://gerrit.wikimedia.org/r/423231 (owner: 10MarcoAurelio)
[10:16:38] <wikibugs>	 (03CR) 10EddieGP: "@MarcoAurelio: You added the correct key to hiera, you just happened to add it to some hiera file that's not actually used anywhere (pleas" [puppet] - 10https://gerrit.wikimedia.org/r/423232 (https://phabricator.wikimedia.org/T191110) (owner: 10EddieGP)
[10:35:57] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4095116 (10MarcoAurelio) I've been looking into https://horizon.wikimedia.org/project/puppet/ and apparently I cannot do anything from there but to simply see. I am also unfamiliar with Pupp...
[10:39:07] <wikibugs>	 (03PS1) 10EddieGP: hiera: Remove unused paths [labs/private] - 10https://gerrit.wikimedia.org/r/423233
[10:53:01] <wikibugs>	 (03CR) 10MarcoAurelio: "> @MarcoAurelio: [...]" [puppet] - 10https://gerrit.wikimedia.org/r/423232 (https://phabricator.wikimedia.org/T191110) (owner: 10EddieGP)
[11:13:06] <urandom>	 !log stopping restbase1009-a (high hints storage)
[11:13:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:28] <urandom>	 !log restarting restbase1009-b
[11:14:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:40] <icinga-wm>	 RECOVERY - Disk space on restbase1009 is OK: DISK OK
[11:17:01] <elukey>	 \o
[11:17:03] <elukey>	 \o/
[11:17:20] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2018-08-17 16:11:02 +0000 (expires in 139 days)
[11:17:38] <urandom>	 hrmm... puppet must have brought that one up
[11:17:51] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042
[11:18:18] <urandom>	 !log truncating hints, restbase1009-a
[11:18:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:49] <urandom>	 !log starting restbase1009-c
[11:19:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:03] <urandom>	 !log removing corrupt commitlog segment, restbase1009-b
[11:25:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:10] <icinga-wm>	 PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:26:00] <urandom>	 !log removing corrupt commitlog segment, restbase1009-c
[11:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:10] <icinga-wm>	 RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational
[11:28:41] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.130:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-b valid until 2018-08-17 16:11:03 +0000 (expires in 139 days)
[11:29:40] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.48.130:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.130 port 9042
[11:30:11] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.131 port 9042
[11:30:20] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.131:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-c valid until 2018-08-17 16:11:04 +0000 (expires in 139 days)
[11:43:25] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4095140 (10EddieGP) Actually this is a duplicate. After https://gerrit.wikimedia.org/r/#/c/361648/ the "/dev/swift/" part will be implicit as well as the trailing "1", and after https://gerr...
[11:43:45] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4095142 (10EddieGP)
[11:43:51] <wikibugs>	 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10media-storage, 10Patch-For-Review: Puppet broken on deployment-ms-be0[34] with evaluation error in swift module - https://phabricator.wikimedia.org/T184236#4095145 (10EddieGP)
[12:01:01] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[12:01:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[12:01:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[12:01:40] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[12:01:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[12:01:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[12:01:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[12:01:40] <icinga-wm>	 PROBLEM - eventstreams on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 8092: Connection refused
[12:02:01] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy
[12:02:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[12:02:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[12:02:40] <icinga-wm>	 RECOVERY - eventstreams on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.025 second response time
[12:02:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[12:03:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[12:04:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[12:04:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy
[12:04:31] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy
[12:04:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[12:04:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[12:04:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[12:04:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy
[12:04:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy
[12:04:41] <icinga-wm>	 PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused
[12:05:10] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received
[12:05:20] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (retrieve structured reference data for the Cat article on English Wikipedia) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-w
[12:05:20] <icinga-wm>	 ns for cat) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received: /{domain}/v1/page/media/{title}{/revision}{/tid} (retrieve media items of en.wp Cat page via media route) timed out before a response was received
[12:05:41] <icinga-wm>	 RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.035 second response time
[12:06:01] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy
[12:06:10] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[12:19:00] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[12:27:01] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[12:43:53] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 3 others: Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances - https://phabricator.wikimedia.org/T153468#4095180 (10EddieGP)
[12:54:19] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#4095186 (10EddieGP) 05Open>03Resolved a:03EddieGP >>! In T132259#3879429, demon wrote: > Is this really best as a tracking task or should we add it to the...
[13:53:47] <wikibugs>	 (03PS1) 10ArielGlenn: allow configuration of extra dir to search for prefetch files [dumps] - 10https://gerrit.wikimedia.org/r/423241
[13:53:49] <wikibugs>	 (03PS1) 10ArielGlenn: use files from an optional 'prefetch dir' for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/423242
[13:54:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] use files from an optional 'prefetch dir' for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/423242 (owner: 10ArielGlenn)
[13:56:18] <wikibugs>	 (03PS2) 10ArielGlenn: use files from an optional 'prefetch dir' for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/423242
[13:56:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] use files from an optional 'prefetch dir' for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/423242 (owner: 10ArielGlenn)
[13:57:20] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[13:58:06] <wikibugs>	 (03PS3) 10ArielGlenn: use files from an optional 'prefetch dir' for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/423242
[16:20:01] <wikibugs>	 (03CR) 10EddieGP: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/398399 (owner: 10EddieGP)
[16:31:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) timed out before a response was received: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out be
[16:31:41] <icinga-wm>	  received
[16:32:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out before a response was received
[16:32:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[16:32:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[16:32:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[16:33:00] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Manitowoc, Wisconsin) timed out before a response was received: /{domain}/v1/page/css/mobile/app/bundle (Untitled test) timed out before a response was received: / (spec from root) timed out before a response was received: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve exte
[16:33:00] <icinga-wm>	 ideo article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article) timed out before a response was received: /{domain}/v1/page/media/{title}{/revision}{/tid} (retrieve media items of en.wp Cat page via media route) timed out before a response was received
[16:33:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[16:33:11] <icinga-wm>	 PROBLEM - apertium apy on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:33:20] <icinga-wm>	 PROBLEM - eventstreams on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 8092: Connection refused
[16:33:20] <icinga-wm>	 PROBLEM - SSH on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:33:30] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused
[16:33:31] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[16:33:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[16:33:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[16:33:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[16:33:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received
[16:33:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy
[16:33:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[16:33:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[16:33:51] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[16:34:10] <icinga-wm>	 RECOVERY - apertium apy on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.002 second response time
[16:34:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[16:36:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[16:36:41] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received
[16:36:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[16:36:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[16:36:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received
[16:37:01] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve most-read articles for date with no data (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mob
[16:37:01] <icinga-wm>	  out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/css/mobile/app/base (Untitled test) timed out before a response was received
[16:37:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out before a response was received
[16:37:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[16:38:00] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[16:38:11] <icinga-wm>	 RECOVERY - SSH on scb1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[16:38:40] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (retrieve media items of en.wp Cat page via media route) timed out before a response was received
[16:38:40] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy
[16:38:43] <paladox>	 hmm
[16:38:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[16:38:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[16:38:53] <paladox>	 ops ^^
[16:39:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[16:39:21] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received
[16:39:30] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy
[16:39:30] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[16:39:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[16:39:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[16:39:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[16:39:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy
[16:39:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[16:39:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[16:39:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[16:40:01] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received
[16:40:11] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy
[16:40:30] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.005 second response time
[16:40:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy
[16:40:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy
[16:42:00] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy
[16:42:20] <icinga-wm>	 RECOVERY - eventstreams on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.027 second response time
[17:16:17] <wikibugs>	 (03PS2) 10Dzahn: hiera: fix deployment-mira, lacking ::git_owner [puppet] - 10https://gerrit.wikimedia.org/r/423232 (https://phabricator.wikimedia.org/T191110) (owner: 10EddieGP)
[17:16:35] <wikibugs>	 (03PS1) 10Dzahn: Revert "hieradata: fix for deployment-tin/mira lack of ::git_owner" [labs/private] - 10https://gerrit.wikimedia.org/r/423247
[17:16:53] <wikibugs>	 (03CR) 10Dzahn: [C: 032] hiera: fix deployment-mira, lacking ::git_owner [puppet] - 10https://gerrit.wikimedia.org/r/423232 (https://phabricator.wikimedia.org/T191110) (owner: 10EddieGP)
[17:17:13] <wikibugs>	 (03CR) 10Dzahn: [V: 032 C: 032] Revert "hieradata: fix for deployment-tin/mira lack of ::git_owner" [labs/private] - 10https://gerrit.wikimedia.org/r/423247 (owner: 10Dzahn)
[17:22:00] <wikibugs>	 (03PS1) 10Dzahn: smokeping: replace bast1001 with bast1002 target [puppet] - 10https://gerrit.wikimedia.org/r/423248 (https://phabricator.wikimedia.org/T183412)
[17:24:57] <wikibugs>	 (03PS1) 10Dzahn: site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249
[17:25:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (owner: 10Dzahn)
[17:26:31] <wikibugs>	 (03PS2) 10Dzahn: site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (https://phabricator.wikimedia.org/T183412)
[17:26:37] <wikibugs>	 (03PS1) 10Dzahn: network::constants: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423250 (https://phabricator.wikimedia.org/T183412)
[17:26:44] <mutante>	 ..and out again
[17:27:29] <wikibugs>	 (03CR) 10Dzahn: [V: 032 C: 032] "replaced by https://gerrit.wikimedia.org/r/423232" [labs/private] - 10https://gerrit.wikimedia.org/r/423247 (owner: 10Dzahn)
[17:34:00] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[17:44:29] <wikibugs>	 (03CR) 10EddieGP: "Done in https://gerrit.wikimedia.org/r/c/423247" [labs/private] - 10https://gerrit.wikimedia.org/r/423231 (owner: 10MarcoAurelio)
[18:15:01] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[19:22:50] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 2 down 1
[19:34:50] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 1 down 0
[19:53:44] <wikibugs>	 10Operations, 10monitoring: add tftpd monitoring - https://phabricator.wikimedia.org/T190439#4095453 (10Dzahn) a:03Dzahn
[19:58:59] <wikibugs>	 (03PS2) 10Dzahn: smokeping: replace bast1001 with bast1002 target [puppet] - 10https://gerrit.wikimedia.org/r/423248 (https://phabricator.wikimedia.org/T183412)
[20:00:02] <wikibugs>	 (03PS3) 10Dzahn: smokeping: replace bast1001 with bast1002 target [puppet] - 10https://gerrit.wikimedia.org/r/423248 (https://phabricator.wikimedia.org/T183412)
[20:00:29] <wikibugs>	 (03CR) 10Dzahn: [C: 032] smokeping: replace bast1001 with bast1002 target [puppet] - 10https://gerrit.wikimedia.org/r/423248 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn)
[20:02:27] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4095458 (10MarcoAurelio) Still erroring:  ``` maurelio@deployment-mira:~$ sudo puppet agent -tv Info: Using configured environment 'future' I...
[20:20:42] <wikibugs>	 (03Draft1) 10Paladox: hiera: fix deployment-mira, lacking ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423256 (https://phabricator.wikimedia.org/T191110)
[20:20:46] <wikibugs>	 (03PS2) 10Paladox: hiera: fix deployment-mira, lacking ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423256 (https://phabricator.wikimedia.org/T191110)
[20:21:31] <wikibugs>	 (03CR) 10Dzahn: [C: 032] hiera: fix deployment-mira, lacking ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423256 (https://phabricator.wikimedia.org/T191110) (owner: 10Paladox)
[20:28:44] <wikibugs>	 (03PS3) 10Dzahn: site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (https://phabricator.wikimedia.org/T183412)
[20:29:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn)
[20:34:45] <wikibugs>	 (03Draft1) 10Paladox: hiera: fix deployment-tin, lacking :: git_owner and :: git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257
[20:34:47] <wikibugs>	 (03PS2) 10Paladox: hiera: fix deployment-tin, lacking :: git_owner and :: git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257
[20:35:09] <wikibugs>	 (03PS4) 10Dzahn: site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (https://phabricator.wikimedia.org/T183412)
[20:35:46] <wikibugs>	 (03PS3) 10Paladox: hiera: fix deployment-tin, lacking ::git_owner and ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257
[20:38:34] <wikibugs>	 (03PS5) 10Dzahn: site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (https://phabricator.wikimedia.org/T183412)
[20:38:43] <wikibugs>	 (03PS4) 10Paladox: hiera: fix deployment-tin, lacking ::git_owner and ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257
[20:39:06] <wikibugs>	 (03CR) 10EddieGP: [C: 031] hiera: fix deployment-tin, lacking ::git_owner and ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257 (owner: 10Paladox)
[20:39:50] <wikibugs>	 (03CR) 10Dzahn: [C: 032] site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn)
[20:40:45] <wikibugs>	 (03PS5) 10Dzahn: hiera: fix deployment-tin, lacking ::git_owner and ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257 (owner: 10Paladox)
[20:41:27] <wikibugs>	 (03CR) 10Dzahn: [C: 032] hiera: fix deployment-tin, lacking ::git_owner and ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257 (owner: 10Paladox)
[20:41:45] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#4095478 (10Paladox)
[20:41:49] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4095477 (10Paladox) 05Open>03Resolved
[20:42:05] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4093928 (10Paladox)
[20:43:17] <wikibugs>	 (03Abandoned) 10Dzahn: Revert "hieradata: fix for deployment-tin/mira lack of ::git_owner" [labs/private] - 10https://gerrit.wikimedia.org/r/423231 (owner: 10MarcoAurelio)
[20:47:51] <wikibugs>	 (03PS2) 10Dzahn: network::constants: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423250 (https://phabricator.wikimedia.org/T183412)
[20:57:11] <icinga-wm>	 PROBLEM - Check systemd state on labtestmetal2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[21:04:40] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mx - https://phabricator.wikimedia.org/T191151#4095484 (10MarcoAurelio)
[21:10:00] <wikibugs>	 (03CR) 10Dzahn: [C: 032] network::constants: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423250 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn)
[21:12:33] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "not a bastion anymore, the role has been removed and with it the ferm rules on it" [puppet] - 10https://gerrit.wikimedia.org/r/423250 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn)
[21:13:33] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mx - https://phabricator.wikimedia.org/T191151#4095500 (10EddieGP)
[21:13:37] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#4095503 (10EddieGP)
[21:14:50] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mx02 due to some Letsencrypt stuff - https://phabricator.wikimedia.org/T191152#4095504 (10MarcoAurelio)
[21:15:59] <mutante>	 !log bast1001 has been shutdown and decom'ed as planned. if you have any issues with shell access make sure you have replaced with bast1002 or any other bast host
[21:16:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:07] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mx02 due to some Letsencrypt stuff - https://phabricator.wikimedia.org/T191152#4095516 (10EddieGP)
[21:18:10] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#4095518 (10EddieGP)
[21:27:25] <wikibugs>	 10Operations, 10hardware-requests: decom bast1001 - https://phabricator.wikimedia.org/T191153#4095519 (10Dzahn) p:05Triage>03High
[21:27:26] <apergos>	 \o/
[21:27:38] <wikibugs>	 10Operations, 10hardware-requests: decom bast1001 - https://phabricator.wikimedia.org/T191153#4095519 (10Dzahn) a:05Dzahn>03None
[21:27:43] <apergos>	 bye bye bast1001, it was nice to know you
[21:28:04] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#4095534 (10MarcoAurelio)
[21:28:22] <mutante>	 apergos: :)
[21:28:26] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: deployment-etcd-01 puppet errors - https://phabricator.wikimedia.org/T191107#4095535 (10MarcoAurelio) https://github.com/search?q=org%3Awikimedia+profile%3A%3Aetcd%3A%3Atlsproxy%3A%3Alisten_port&type=Code
[21:30:24] <wikibugs>	 10Operations, 10Patch-For-Review: replace bast1001 (new hardware) - https://phabricator.wikimedia.org/T183412#4095536 (10Dzahn) 05Open>03Resolved puppet role removed, that removed all the ferm rules and already made it inaccesible before the network::constants change. running with role::spare now.  downtim...
[21:37:57] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: deployment-etcd-01 puppet errors - https://phabricator.wikimedia.org/T191107#4095540 (10MarcoAurelio) @Paladox Should we add `profile::etcd::tlsproxy::listen_port: <number>` to https://github.com/wikimedia/puppet/blob/production/hieradata/labs/deployment-prep/host/deplo...
[21:50:37] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka04 - https://phabricator.wikimedia.org/T191154#4095542 (10MarcoAurelio)
[21:52:29] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4093928 (10Dzahn) this also fixed puppet runs on a bunch of other deployment-* hosts thanks to using common.yaml instead of ./hosts/  bonus token for that ! thanks
[21:58:11] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka04 - https://phabricator.wikimedia.org/T191154#4095555 (10EddieGP) Role deleted from puppet git by @Ottomata in 661eea7bda, but still applied to deployment-kafka04 according to https://tools.wmfl...
[22:14:58] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka0[45] - https://phabricator.wikimedia.org/T191154#4095558 (10EddieGP)
[22:15:08] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka0[45] - https://phabricator.wikimedia.org/T191154#4095542 (10EddieGP) Affects deployment-kafka05 as well.
[22:27:15] <wikibugs>	 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4019041 (10KATMAKROFAN) Why can't we just merge it into Meta-Wiki?
[23:00:22] <wikibugs>	 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4095598 (10KATMAKROFAN) After renaming foundationwiki, we should enable use of LDAP accounts on there.
[23:12:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[23:14:04] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen