[00:15:39] (03PS1) 10EddieGP: hiera: Kill some hiera paths [labs/private] - 10https://gerrit.wikimedia.org/r/423189 [00:21:00] (03PS1) 10EddieGP: cloud hiera: Remove unused paths from hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/423190 [00:39:51] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4080501 (10EddieGP) The title for swift::init_device comes from a hiera lookup (hiera key `swift_storage_drives`) . Openstack browser shows this key is set to the value 'lv-a' on deployment-... [00:46:54] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4094961 (10EddieGP) Related: T184236 and the attached patches. [00:54:20] (03CR) 10EddieGP: "As far as the (proven to be outdated on other issues today) docs say, role-based lookup isn't supported in labs puppet. Although I don't k" [labs/private] - 10https://gerrit.wikimedia.org/r/423178 (https://phabricator.wikimedia.org/T191110) (owner: 10MarcoAurelio) [03:26:09] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 856.89 seconds [04:01:18] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 205.30 seconds [05:16:38] PROBLEM - Disk space on restbase1009 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 1643 MB (3% inode=99%) [05:30:19] PROBLEM - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.131 and port 9042: Connection refused [05:30:19] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [05:30:28] PROBLEM - cassandra-b service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [05:30:29] PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [05:30:29] PROBLEM - cassandra-c SSL 10.64.48.131:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [05:30:49] PROBLEM - cassandra-b CQL 10.64.48.130:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.130 and port 9042: Connection refused [05:30:59] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:31:08] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused [05:31:09] PROBLEM - cassandra-b SSL 10.64.48.130:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [05:31:18] PROBLEM - cassandra-c service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [05:44:28] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [05:44:29] RECOVERY - cassandra-b service on restbase1009 is OK: OK - cassandra-b is active [05:45:08] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [05:45:19] RECOVERY - cassandra-c service on restbase1009 is OK: OK - cassandra-c is active [06:22:28] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [06:38:38] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:45:07] (03PS1) 10Elukey: profile::prometheus::alerts: fix new eventlogging alarms [puppet] - 10https://gerrit.wikimedia.org/r/423198 [06:45:45] (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: fix new eventlogging alarms [puppet] - 10https://gerrit.wikimedia.org/r/423198 (owner: 10Elukey) [09:52:31] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 26 probes of 304 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [09:55:41] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received [09:55:41] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for Apri [09:55:41] ut before a response was received [09:56:50] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200) [09:56:50] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [09:56:50] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregat [09:56:50] April 29, 2016) timed out before a response was received [09:57:20] PROBLEM - eventstreams on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 8092: Connection refused [09:57:31] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 5 probes of 304 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [09:57:41] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [09:57:41] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [09:57:50] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [09:58:20] RECOVERY - eventstreams on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.035 second response time [09:59:41] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [09:59:41] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [09:59:41] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [09:59:41] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [10:05:31] (03PS1) 10MarcoAurelio: Revert "hieradata: fix for deployment-tin/mira lack of ::git_owner" [labs/private] - 10https://gerrit.wikimedia.org/r/423231 [10:06:54] (03CR) 10MarcoAurelio: "> In re. EddieGP" [labs/private] - 10https://gerrit.wikimedia.org/r/423178 (https://phabricator.wikimedia.org/T191110) (owner: 10MarcoAurelio) [10:14:55] (03PS1) 10EddieGP: hiera: fix deployment-mira, lacking ::git_owner [puppet] - 10https://gerrit.wikimedia.org/r/423232 (https://phabricator.wikimedia.org/T191110) [10:15:12] (03CR) 10EddieGP: [C: 031] Revert "hieradata: fix for deployment-tin/mira lack of ::git_owner" [labs/private] - 10https://gerrit.wikimedia.org/r/423231 (owner: 10MarcoAurelio) [10:16:38] (03CR) 10EddieGP: "@MarcoAurelio: You added the correct key to hiera, you just happened to add it to some hiera file that's not actually used anywhere (pleas" [puppet] - 10https://gerrit.wikimedia.org/r/423232 (https://phabricator.wikimedia.org/T191110) (owner: 10EddieGP) [10:35:57] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4095116 (10MarcoAurelio) I've been looking into https://horizon.wikimedia.org/project/puppet/ and apparently I cannot do anything from there but to simply see. I am also unfamiliar with Pupp... [10:39:07] (03PS1) 10EddieGP: hiera: Remove unused paths [labs/private] - 10https://gerrit.wikimedia.org/r/423233 [10:53:01] (03CR) 10MarcoAurelio: "> @MarcoAurelio: [...]" [puppet] - 10https://gerrit.wikimedia.org/r/423232 (https://phabricator.wikimedia.org/T191110) (owner: 10EddieGP) [11:13:06] !log stopping restbase1009-a (high hints storage) [11:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:28] !log restarting restbase1009-b [11:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:40] RECOVERY - Disk space on restbase1009 is OK: DISK OK [11:17:01] \o [11:17:03] \o/ [11:17:20] RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2018-08-17 16:11:02 +0000 (expires in 139 days) [11:17:38] hrmm... puppet must have brought that one up [11:17:51] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042 [11:18:18] !log truncating hints, restbase1009-a [11:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:49] !log starting restbase1009-c [11:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:03] !log removing corrupt commitlog segment, restbase1009-b [11:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:10] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:26:00] !log removing corrupt commitlog segment, restbase1009-c [11:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:10] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [11:28:41] RECOVERY - cassandra-b SSL 10.64.48.130:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-b valid until 2018-08-17 16:11:03 +0000 (expires in 139 days) [11:29:40] RECOVERY - cassandra-b CQL 10.64.48.130:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.130 port 9042 [11:30:11] RECOVERY - cassandra-c CQL 10.64.48.131:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.131 port 9042 [11:30:20] RECOVERY - cassandra-c SSL 10.64.48.131:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-c valid until 2018-08-17 16:11:04 +0000 (expires in 139 days) [11:43:25] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4095140 (10EddieGP) Actually this is a duplicate. After https://gerrit.wikimedia.org/r/#/c/361648/ the "/dev/swift/" part will be implicit as well as the trailing "1", and after https://gerr... [11:43:45] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-ms-be03 - https://phabricator.wikimedia.org/T190683#4095142 (10EddieGP) [11:43:51] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10media-storage, 10Patch-For-Review: Puppet broken on deployment-ms-be0[34] with evaluation error in swift module - https://phabricator.wikimedia.org/T184236#4095145 (10EddieGP) [12:01:01] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:01:10] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [12:01:31] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [12:01:40] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [12:01:40] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [12:01:40] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [12:01:40] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [12:01:40] PROBLEM - eventstreams on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 8092: Connection refused [12:02:01] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [12:02:10] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [12:02:40] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [12:02:40] RECOVERY - eventstreams on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.025 second response time [12:02:41] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:03:41] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:04:30] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [12:04:31] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [12:04:31] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [12:04:40] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [12:04:40] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [12:04:40] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [12:04:40] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [12:04:41] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [12:04:41] PROBLEM - eventstreams on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 8092: Connection refused [12:05:10] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received [12:05:20] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (retrieve structured reference data for the Cat article on English Wikipedia) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-w [12:05:20] ns for cat) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received: /{domain}/v1/page/media/{title}{/revision}{/tid} (retrieve media items of en.wp Cat page via media route) timed out before a response was received [12:05:41] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.035 second response time [12:06:01] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [12:06:10] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [12:19:00] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [12:27:01] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [12:43:53] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 3 others: Ferm/DNS library weirdness causing puppet errors on some deployment-prep instances - https://phabricator.wikimedia.org/T153468#4095180 (10EddieGP) [12:54:19] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#4095186 (10EddieGP) 05Open>03Resolved a:03EddieGP >>! In T132259#3879429, demon wrote: > Is this really best as a tracking task or should we add it to the... [13:53:47] (03PS1) 10ArielGlenn: allow configuration of extra dir to search for prefetch files [dumps] - 10https://gerrit.wikimedia.org/r/423241 [13:53:49] (03PS1) 10ArielGlenn: use files from an optional 'prefetch dir' for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/423242 [13:54:10] (03CR) 10jerkins-bot: [V: 04-1] use files from an optional 'prefetch dir' for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/423242 (owner: 10ArielGlenn) [13:56:18] (03PS2) 10ArielGlenn: use files from an optional 'prefetch dir' for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/423242 [13:56:37] (03CR) 10jerkins-bot: [V: 04-1] use files from an optional 'prefetch dir' for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/423242 (owner: 10ArielGlenn) [13:57:20] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [13:58:06] (03PS3) 10ArielGlenn: use files from an optional 'prefetch dir' for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/423242 [16:20:01] (03CR) 10EddieGP: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/398399 (owner: 10EddieGP) [16:31:40] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) timed out before a response was received: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out be [16:31:41] received [16:32:40] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out before a response was received [16:32:41] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:32:41] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:32:50] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:33:00] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Manitowoc, Wisconsin) timed out before a response was received: /{domain}/v1/page/css/mobile/app/bundle (Untitled test) timed out before a response was received: / (spec from root) timed out before a response was received: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve exte [16:33:00] ideo article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article) timed out before a response was received: /{domain}/v1/page/media/{title}{/revision}{/tid} (retrieve media items of en.wp Cat page via media route) timed out before a response was received [16:33:11] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:33:11] PROBLEM - apertium apy on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:20] PROBLEM - eventstreams on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 8092: Connection refused [16:33:20] PROBLEM - SSH on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:30] PROBLEM - pdfrender on scb1002 is CRITICAL: connect to address 10.64.16.21 and port 5252: Connection refused [16:33:31] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:33:40] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [16:33:40] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:33:40] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:33:40] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received [16:33:40] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [16:33:40] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [16:33:41] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [16:33:51] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [16:34:10] RECOVERY - apertium apy on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.002 second response time [16:34:10] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [16:36:41] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [16:36:41] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received [16:36:50] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [16:36:50] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:36:50] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received [16:37:01] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve most-read articles for date with no data (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mob [16:37:01] out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/css/mobile/app/base (Untitled test) timed out before a response was received [16:37:20] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out before a response was received [16:37:50] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:38:00] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [16:38:11] RECOVERY - SSH on scb1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [16:38:40] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (retrieve media items of en.wp Cat page via media route) timed out before a response was received [16:38:40] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [16:38:43] hmm [16:38:50] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [16:38:50] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [16:38:53] ops ^^ [16:39:20] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [16:39:21] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received [16:39:30] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [16:39:30] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:39:40] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [16:39:40] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [16:39:41] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [16:39:50] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [16:39:50] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [16:39:50] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [16:39:50] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [16:40:01] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received [16:40:11] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy [16:40:30] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.005 second response time [16:40:41] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [16:40:41] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [16:42:00] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [16:42:20] RECOVERY - eventstreams on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.027 second response time [17:16:17] (03PS2) 10Dzahn: hiera: fix deployment-mira, lacking ::git_owner [puppet] - 10https://gerrit.wikimedia.org/r/423232 (https://phabricator.wikimedia.org/T191110) (owner: 10EddieGP) [17:16:35] (03PS1) 10Dzahn: Revert "hieradata: fix for deployment-tin/mira lack of ::git_owner" [labs/private] - 10https://gerrit.wikimedia.org/r/423247 [17:16:53] (03CR) 10Dzahn: [C: 032] hiera: fix deployment-mira, lacking ::git_owner [puppet] - 10https://gerrit.wikimedia.org/r/423232 (https://phabricator.wikimedia.org/T191110) (owner: 10EddieGP) [17:17:13] (03CR) 10Dzahn: [V: 032 C: 032] Revert "hieradata: fix for deployment-tin/mira lack of ::git_owner" [labs/private] - 10https://gerrit.wikimedia.org/r/423247 (owner: 10Dzahn) [17:22:00] (03PS1) 10Dzahn: smokeping: replace bast1001 with bast1002 target [puppet] - 10https://gerrit.wikimedia.org/r/423248 (https://phabricator.wikimedia.org/T183412) [17:24:57] (03PS1) 10Dzahn: site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 [17:25:34] (03CR) 10jerkins-bot: [V: 04-1] site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (owner: 10Dzahn) [17:26:31] (03PS2) 10Dzahn: site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (https://phabricator.wikimedia.org/T183412) [17:26:37] (03PS1) 10Dzahn: network::constants: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423250 (https://phabricator.wikimedia.org/T183412) [17:26:44] ..and out again [17:27:29] (03CR) 10Dzahn: [V: 032 C: 032] "replaced by https://gerrit.wikimedia.org/r/423232" [labs/private] - 10https://gerrit.wikimedia.org/r/423247 (owner: 10Dzahn) [17:34:00] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:44:29] (03CR) 10EddieGP: "Done in https://gerrit.wikimedia.org/r/c/423247" [labs/private] - 10https://gerrit.wikimedia.org/r/423231 (owner: 10MarcoAurelio) [18:15:01] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:22:50] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [19:34:50] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 1 down 0 [19:53:44] 10Operations, 10monitoring: add tftpd monitoring - https://phabricator.wikimedia.org/T190439#4095453 (10Dzahn) a:03Dzahn [19:58:59] (03PS2) 10Dzahn: smokeping: replace bast1001 with bast1002 target [puppet] - 10https://gerrit.wikimedia.org/r/423248 (https://phabricator.wikimedia.org/T183412) [20:00:02] (03PS3) 10Dzahn: smokeping: replace bast1001 with bast1002 target [puppet] - 10https://gerrit.wikimedia.org/r/423248 (https://phabricator.wikimedia.org/T183412) [20:00:29] (03CR) 10Dzahn: [C: 032] smokeping: replace bast1001 with bast1002 target [puppet] - 10https://gerrit.wikimedia.org/r/423248 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [20:02:27] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4095458 (10MarcoAurelio) Still erroring: ``` maurelio@deployment-mira:~$ sudo puppet agent -tv Info: Using configured environment 'future' I... [20:20:42] (03Draft1) 10Paladox: hiera: fix deployment-mira, lacking ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423256 (https://phabricator.wikimedia.org/T191110) [20:20:46] (03PS2) 10Paladox: hiera: fix deployment-mira, lacking ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423256 (https://phabricator.wikimedia.org/T191110) [20:21:31] (03CR) 10Dzahn: [C: 032] hiera: fix deployment-mira, lacking ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423256 (https://phabricator.wikimedia.org/T191110) (owner: 10Paladox) [20:28:44] (03PS3) 10Dzahn: site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (https://phabricator.wikimedia.org/T183412) [20:29:01] (03CR) 10jerkins-bot: [V: 04-1] site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [20:34:45] (03Draft1) 10Paladox: hiera: fix deployment-tin, lacking :: git_owner and :: git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257 [20:34:47] (03PS2) 10Paladox: hiera: fix deployment-tin, lacking :: git_owner and :: git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257 [20:35:09] (03PS4) 10Dzahn: site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (https://phabricator.wikimedia.org/T183412) [20:35:46] (03PS3) 10Paladox: hiera: fix deployment-tin, lacking ::git_owner and ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257 [20:38:34] (03PS5) 10Dzahn: site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (https://phabricator.wikimedia.org/T183412) [20:38:43] (03PS4) 10Paladox: hiera: fix deployment-tin, lacking ::git_owner and ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257 [20:39:06] (03CR) 10EddieGP: [C: 031] hiera: fix deployment-tin, lacking ::git_owner and ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257 (owner: 10Paladox) [20:39:50] (03CR) 10Dzahn: [C: 032] site/install/bastionhost: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423249 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [20:40:45] (03PS5) 10Dzahn: hiera: fix deployment-tin, lacking ::git_owner and ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257 (owner: 10Paladox) [20:41:27] (03CR) 10Dzahn: [C: 032] hiera: fix deployment-tin, lacking ::git_owner and ::git_group [puppet] - 10https://gerrit.wikimedia.org/r/423257 (owner: 10Paladox) [20:41:45] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#4095478 (10Paladox) [20:41:49] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4095477 (10Paladox) 05Open>03Resolved [20:42:05] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4093928 (10Paladox) [20:43:17] (03Abandoned) 10Dzahn: Revert "hieradata: fix for deployment-tin/mira lack of ::git_owner" [labs/private] - 10https://gerrit.wikimedia.org/r/423231 (owner: 10MarcoAurelio) [20:47:51] (03PS2) 10Dzahn: network::constants: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423250 (https://phabricator.wikimedia.org/T183412) [20:57:11] PROBLEM - Check systemd state on labtestmetal2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:04:40] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mx - https://phabricator.wikimedia.org/T191151#4095484 (10MarcoAurelio) [21:10:00] (03CR) 10Dzahn: [C: 032] network::constants: remove bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/423250 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [21:12:33] (03CR) 10Dzahn: [C: 032] "not a bastion anymore, the role has been removed and with it the ferm rules on it" [puppet] - 10https://gerrit.wikimedia.org/r/423250 (https://phabricator.wikimedia.org/T183412) (owner: 10Dzahn) [21:13:33] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mx - https://phabricator.wikimedia.org/T191151#4095500 (10EddieGP) [21:13:37] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#4095503 (10EddieGP) [21:14:50] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mx02 due to some Letsencrypt stuff - https://phabricator.wikimedia.org/T191152#4095504 (10MarcoAurelio) [21:15:59] !log bast1001 has been shutdown and decom'ed as planned. if you have any issues with shell access make sure you have replaced with bast1002 or any other bast host [21:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:07] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mx02 due to some Letsencrypt stuff - https://phabricator.wikimedia.org/T191152#4095516 (10EddieGP) [21:18:10] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#4095518 (10EddieGP) [21:27:25] 10Operations, 10hardware-requests: decom bast1001 - https://phabricator.wikimedia.org/T191153#4095519 (10Dzahn) p:05Triage>03High [21:27:26] \o/ [21:27:38] 10Operations, 10hardware-requests: decom bast1001 - https://phabricator.wikimedia.org/T191153#4095519 (10Dzahn) a:05Dzahn>03None [21:27:43] bye bye bast1001, it was nice to know you [21:28:04] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mx due to systemd on trusty - https://phabricator.wikimedia.org/T184244#4095534 (10MarcoAurelio) [21:28:22] apergos: :) [21:28:26] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-etcd-01 puppet errors - https://phabricator.wikimedia.org/T191107#4095535 (10MarcoAurelio) https://github.com/search?q=org%3Awikimedia+profile%3A%3Aetcd%3A%3Atlsproxy%3A%3Alisten_port&type=Code [21:30:24] 10Operations, 10Patch-For-Review: replace bast1001 (new hardware) - https://phabricator.wikimedia.org/T183412#4095536 (10Dzahn) 05Open>03Resolved puppet role removed, that removed all the ferm rules and already made it inaccesible before the network::constants change. running with role::spare now. downtim... [21:37:57] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-etcd-01 puppet errors - https://phabricator.wikimedia.org/T191107#4095540 (10MarcoAurelio) @Paladox Should we add `profile::etcd::tlsproxy::listen_port: ` to https://github.com/wikimedia/puppet/blob/production/hieradata/labs/deployment-prep/host/deplo... [21:50:37] 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka04 - https://phabricator.wikimedia.org/T191154#4095542 (10MarcoAurelio) [21:52:29] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4093928 (10Dzahn) this also fixed puppet runs on a bunch of other deployment-* hosts thanks to using common.yaml instead of ./hosts/ bonus token for that ! thanks [21:58:11] 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka04 - https://phabricator.wikimedia.org/T191154#4095555 (10EddieGP) Role deleted from puppet git by @Ottomata in 661eea7bda, but still applied to deployment-kafka04 according to https://tools.wmfl... [22:14:58] 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka0[45] - https://phabricator.wikimedia.org/T191154#4095558 (10EddieGP) [22:15:08] 10Puppet, 10Beta-Cluster-Infrastructure: Error: Could not find class role::kafka::jumbo::mirror for deployment-kafka0[45] - https://phabricator.wikimedia.org/T191154#4095542 (10EddieGP) Affects deployment-kafka05 as well. [22:27:15] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4019041 (10KATMAKROFAN) Why can't we just merge it into Meta-Wiki? [23:00:22] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4095598 (10KATMAKROFAN) After renaming foundationwiki, we should enable use of LDAP accounts on there. [23:12:04] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [23:14:04] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen