[00:00:04] Deploy window No Deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180815T0000) [00:00:29] (Also SWAT is still going because Zuul still hasn't started on one of the patches I +2ed 56 minutes ago) [00:01:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Kbrown) >>! In T201668#4503579, @Legoktm wrote: > Did she log into wikitech and set a real password instead... [00:03:54] 10Operations, 10Wikidata, 10monitoring, 10Patch-For-Review, 10User-Addshore: Add Addshore & possibly other WMDE devs/deployers to the wikidata icinga contact list - https://phabricator.wikimedia.org/T195289 (10Dzahn) Yea, i think the only part missing here is that you (guys) confirm if it works. The "log... [00:08:41] Yay they merged [00:12:14] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.16/extensions/PageTriage/: SWAT: PageTriage fixes (T199357, T201812, T201560, T201373, T201253) (duration: 00m 51s) [00:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:30] T199357: New Pages Feed: score draftquality on most recent revision - https://phabricator.wikimedia.org/T199357 [00:12:30] T201373: New Pages Feed: "created date" value changes depending on sort - https://phabricator.wikimedia.org/T201373 [00:12:31] T201253: Special:NewPagesFeed should ignore ptrp_reviewed status for Drafts - https://phabricator.wikimedia.org/T201253 [00:12:32] (03CR) 10Dzahn: "oh yea, good point about hieradata. there are also both a primary.yaml and a production.yaml in common/eqiad/codfw but they are all identi" [puppet] - 10https://gerrit.wikimedia.org/r/449763 (owner: 10Dzahn) [00:12:32] T201560: NewPagesFeed in AFC mode shows each page 4 times when all four "predicted issues" checkboxes are checked - https://phabricator.wikimedia.org/T201560 [00:12:33] T201812: Running Echo as dependency of PageTriage fails unit tests - https://phabricator.wikimedia.org/T201812 [00:15:25] (03CR) 10Dzahn: "also the coal class has been deleted/moved meanwhile. rebasing" [puppet] - 10https://gerrit.wikimedia.org/r/449763 (owner: 10Dzahn) [00:15:49] (03PS2) 10Dzahn: graphite: delete duplicate role(graphite::primary) [puppet] - 10https://gerrit.wikimedia.org/r/449763 [00:17:47] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Jalexander) >>! In T201668#4503699, @Kbrown wrote: >>>! In T201668#4503579, @Legoktm wrote: >> Did she log... [00:18:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Jalexander) (Granted I obviously can't do that now for them when the account is already created... though I... [00:28:15] RoanKattouw: SWAT done? [00:30:05] (03PS3) 10Dzahn: graphite: delete duplicate role(graphite::primary) [puppet] - 10https://gerrit.wikimedia.org/r/449763 [00:30:20] James_F: Yes, done [00:30:42] RoanKattouw: Good, will merge away then. [00:38:50] (03CR) 10Dzahn: "compiler output now looking good: http://puppet-compiler.wmflabs.org/12092/" [puppet] - 10https://gerrit.wikimedia.org/r/449763 (owner: 10Dzahn) [00:50:03] (03CR) 10Filippo Giunchedi: "LGTM, though it can wait after vacations for merge" [puppet] - 10https://gerrit.wikimedia.org/r/449763 (owner: 10Dzahn) [00:50:57] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention temporarily - https://phabricator.wikimedia.org/T201971 (10Anomie) ApiFeatureUsage doesn't depend on it being in logstash, but it's convenient for looking at usage of deprecated features to see whethe... [01:13:11] James_F: (wg)ParsoidWikiPrefix per https://wikitech.wikimedia.org/wiki/User:Krinkle/Unused_config [01:13:18] Could you maybe double-check the story behind that? [01:13:49] I'm less sure than I was last week about just removing them, because of GWToolsetConfigOverrides turning out as "not yet used" instead of "no longer used". [01:18:26] (03PS1) 10Krinkle: Remove deprecated $wgEventLoggingFile config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452877 (https://phabricator.wikimedia.org/T127209) [01:40:34] PROBLEM - Host aqs1007 is DOWN: PING CRITICAL - Packet loss = 100% [01:41:04] RECOVERY - Host aqs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [01:41:14] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [01:41:14] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [01:41:15] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [01:41:15] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [01:41:15] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [01:41:16] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [01:41:16] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [01:41:24] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [01:41:44] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [01:41:45] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:41:45] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:41:45] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:41:45] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [01:41:45] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:41:45] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [01:42:45] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [01:43:45] PROBLEM - cassandra-a CQL 10.64.0.213:9042 on aqs1007 is CRITICAL: connect to address 10.64.0.213 and port 9042: Connection refused [01:43:45] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [01:44:14] PROBLEM - cassandra-b CQL 10.64.0.237:9042 on aqs1007 is CRITICAL: connect to address 10.64.0.237 and port 9042: Connection refused [01:44:25] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [01:44:44] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [01:44:54] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [01:45:15] RECOVERY - cassandra-b CQL 10.64.0.237:9042 on aqs1007 is OK: TCP OK - 0.000 second response time on 10.64.0.237 port 9042 [01:45:34] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [01:45:55] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [01:46:04] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [01:46:05] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [01:46:16] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:46:35] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [01:46:44] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [01:46:45] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [01:47:04] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [01:47:05] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [01:47:05] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [01:47:06] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [01:47:35] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [01:47:44] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [01:48:44] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [01:48:44] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [01:51:35] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:52:35] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:07:47] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [02:07:58] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [02:11:48] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [02:11:58] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:37:28] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [02:37:39] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [02:41:14] (03PS18) 10Zoranzoki21: Enable Extension:Newsletter on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381537 (https://phabricator.wikimedia.org/T177151) [02:41:29] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [02:41:48] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:45:48] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Wed Aug 15 02:45:47 UTC 2018 (duration 10m 24s) [02:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:48] PROBLEM - Long running screen/tmux on phab1002 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 10501, 1732509s 1728000s). [03:08:08] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [03:08:19] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [03:12:09] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [03:12:28] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:30:58] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 898.78 seconds [03:37:29] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [03:38:19] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [03:41:39] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:42:29] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [03:56:58] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 237.20 seconds [04:06:30] Krinkle: ParsoidWikiPrefix dates from pre-RESTBase. [04:07:29] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [04:07:39] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [04:11:38] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [04:11:48] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:37:39] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [04:37:49] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [04:39:09] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.58 seconds [04:41:49] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [04:42:08] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:06:58] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [05:07:08] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [05:11:08] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [05:11:09] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:17:12] Dear Developer, Please review https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/451823/ [05:37:09] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [05:37:18] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [05:41:19] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [05:41:29] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:55:38] PROBLEM - MariaDB Slave Lag: s2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.60 seconds [06:07:28] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [06:07:29] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [06:11:38] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [06:11:39] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:31:49] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:35:58] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:37:09] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [06:37:18] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [06:41:28] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [06:41:29] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:42:09] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:44:18] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:46:42] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp5010.eqsin.wmnet', 'cp4021.ulsfo.wmnet'] ``` The log can be found in `/var/l... [06:47:19] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.36 seconds [06:47:49] RECOVERY - Check systemd state on cp5005 is OK: OK - running: The system is fully operational [06:50:38] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:53:02] 10Operations, 10docker-pkg: operations-puppet:0.3.4 doesn't seem to be properly published - https://phabricator.wikimedia.org/T201952 (10ema) p:05Triage>03Normal [06:54:28] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5010_v4, cp5010_v6 [06:54:49] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:55:58] RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 11.49 seconds [06:56:57] 10Operations, 10docker-pkg: operations-puppet:0.3.4 doesn't seem to be properly published - https://phabricator.wikimedia.org/T201952 (10ema) One of the build failures caused by this is https://integration.wikimedia.org/ci/job/operations-puppet-tests-docker/26286/console ``` 17:58:51 + exec docker run --rm -... [06:58:33] ACKNOWLEDGEMENT - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp5010_v4, cp5010_v6 Ema reimage 5010 [07:07:35] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [07:07:36] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [07:07:55] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:11:41] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [07:11:41] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:14:01] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:21:26] godog: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/452744/ is unmerged [07:21:38] godog: running puppet-merge now as it seems fine [07:22:08] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [07:22:17] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:22:58] moritzm: how do we feel about the constant "MediaWiki memcached error rate" alerts? [07:24:17] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:24:39] I think these have been flapping for a few days, but I can doublecheck in a bit [07:25:13] ok, do we know the cause of those? How to check what's going on? [07:31:11] ema: looking at logstash, there was surge of errors where mw1230 failed to connect to it's local mcrouter instance, looking at local logs on the host [07:33:54] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 52 ESP OK [07:33:56] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4021.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['cp4021.ulsfo.wmnet'] ``` [07:37:04] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [07:37:13] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [07:37:57] interesting, aqs1007 recovered just as I was looking into what went wrong ^ [07:38:59] /var/log/cassandra/system-a.log showed a commit log error: "Could not read commit log descriptor in file /srv/cassandra-a/commitlog/CommitLog-5-1530620590775.log" [07:39:34] so I've ran `ls -l /srv/cassandra-a/commitlog/CommitLog-5-1530620590775.log` and a few seconds ago the service recovered [07:39:47] you clearly scared it [07:41:13] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [07:41:13] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:41:47] not for long! [07:48:53] as it seems to be tripping over CommitLog-5-1530620590775.log repeatedly maybe moving that file away is a suitable fix, OTOH it might cause other fallout [07:50:03] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:51:11] gehel: maybe you have suggestions? [07:52:45] ema: just back from daycare, reading backlog... [07:52:51] <3 [07:53:10] ema: cassandra on aqs? [07:53:15] gehel: yup [07:57:41] we should be able to delete the problematic commitlog and do a nodetool repair to recover data from the other nodes [07:58:16] gehel: ok, maybe mv /srv/cassandra-a/commitlog/CommitLog-5-1530620590775.log /var/tmp/ ? [07:58:18] but I would wait until we have someone who actually understands cassandra before deleting anything [07:59:01] ema: I **think** that should work, but we might corrupt the state of the cluster as well (I just don't know much about cassandra) [07:59:34] waiting seems reasonable then :) [07:59:59] yeah! deleting commit log fells like a scary operation [08:00:31] also it looks like we have enough redundancy on that cluster that loosing 1 node should not be a major issue [08:00:46] perfect, thanks for looking! [08:00:47] (03PS3) 10Gehel: Changing day of the cron for testing [puppet] - 10https://gerrit.wikimedia.org/r/452467 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [08:01:41] (03CR) 10Gehel: [C: 032] Changing day of the cron for testing [puppet] - 10https://gerrit.wikimedia.org/r/452467 (https://phabricator.wikimedia.org/T194787) (owner: 10MSantos) [08:02:27] I agreed. I check "nodetool-a status -r" on aqs1008 and that seems fine [08:06:43] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [08:06:43] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [08:07:01] it recovered on its own? [08:08:00] it has been flapping earlier as well [08:09:47] it is dead again, probably green only for a short time when trying to start and then failing [08:09:52] should we mask that unit? [08:10:15] gehel: +1 [08:10:31] ok, doing it [08:10:57] ema: do we already have a phab task for this? [08:11:19] gehel: we don't, I'm gonna open one now [08:11:33] ema: I can do it, and I'll add the log to it [08:11:39] and you can add whatever info you have [08:11:53] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:11:53] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [08:12:01] gehel: alright! [08:12:34] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:13:43] 10Operations, 10Cassandra: cassandra-a instance on aqs1007 is not starting - https://phabricator.wikimedia.org/T201986 (10Gehel) [08:14:14] !log masking cassandra-a instance on aqs1007 since it is flapping - T201986 [08:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:22] T201986: cassandra-a instance on aqs1007 is not starting - https://phabricator.wikimedia.org/T201986 [08:14:41] ema: ^ [08:15:04] gehel: great, thanks! [08:15:13] ema: no problem! [08:15:18] and now coffee time [08:15:34] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp3032.esams.wmnet', 'cp4022.ulsfo.wmnet'] ``` The log can be found in `/var/l... [08:15:35] reimage time here [08:15:59] ema: sadly I have to wait a bit before I can start my own batch of reimages :) [08:16:14] 10Operations, 10Cassandra: cassandra-a instance on aqs1007 is not starting - https://phabricator.wikimedia.org/T201986 (10ema) p:05Triage>03Normal [08:21:10] 10Operations, 10Cassandra: cassandra-a instance on aqs1007 is not starting - https://phabricator.wikimedia.org/T201986 (10ema) It looks like the host is up only since ~6 hours, and cassandra-a never actually managed to start. ``` root@aqs1007:~# uptime ; date ; journalctl -u cassandra-a.service | head 08:20:2... [08:22:03] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4022_v4, cp4022_v6 [08:22:04] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [08:22:04] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [08:22:05] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4022_v4, cp4022_v6 [08:22:05] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4022_v4, cp4022_v6 [08:22:13] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp3032_v4, cp3032_v6 [08:22:13] PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp3032_v4, cp3032_v6 [08:22:13] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp3032_v4, cp3032_v6 [08:22:13] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp3032_v4, cp3032_v6 [08:22:14] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4022_v4, cp4022_v6 [08:22:14] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp4022_v4, cp4022_v6 [08:22:14] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 66 not-conn: cp4022_v4, cp4022_v6 [08:22:28] sorry that's me, ignore ^ [08:30:40] (03PS1) 10Muehlenhoff: Tweak fragementation settings [puppet] - 10https://gerrit.wikimedia.org/r/452901 [08:37:44] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [08:37:45] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [08:39:35] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [08:39:35] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 68 ESP OK [08:39:35] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 68 ESP OK [08:39:44] RECOVERY - IPsec on cp1087 is OK: Strongswan OK - 52 ESP OK [08:39:45] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [08:39:45] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [08:39:45] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK [08:39:46] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 52 ESP OK [08:39:46] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 52 ESP OK [08:39:46] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [08:39:54] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK [08:39:54] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 68 ESP OK [08:41:05] wmf-auto-reimage seems to have some bad race, if I reimage two hosts, the second one fails at a certain point: [08:41:08] 08:36:22 | cp4022.ulsfo.wmnet | Unable to run wmf-downtime-host: Failed to icinga_downtime [08:41:26] it's probably easier to just run wmf-auto-reimage-host multiple times [08:41:45] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:41:54] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [08:42:42] ema: yes there is a known race due to puppet being so slow on the icinga hosts, it usually happens with many hosts in parallel [08:43:08] it's in the TODO, just E_NOTIME ;) [08:43:34] https://jynus.com/better-call-volans.jpg [08:44:04] ahaha [08:45:24] PROBLEM - Varnish traffic logger - varnishmedia on cp4022 is CRITICAL: NRPE: Command check_varnishmedia not defined [08:45:29] ACKNOWLEDGEMENT - Varnish HTTP upload-backend - port 3128 on cp4022 is CRITICAL: connect to address 10.128.0.122 and port 3128: Connection refused Ema Reimaging: Unable to run wmf-downtime-host: Failed to icinga_downtime [08:45:29] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend - port 3120 on cp4022 is CRITICAL: connect to address 10.128.0.122 and port 3120: Connection refused Ema Reimaging: Unable to run wmf-downtime-host: Failed to icinga_downtime [08:45:29] ACKNOWLEDGEMENT - Varnish traffic logger - varnishmedia on cp4022 is CRITICAL: NRPE: Command check_varnishmedia not defined Ema Reimaging: Unable to run wmf-downtime-host: Failed to icinga_downtime [08:46:24] RECOVERY - Varnish traffic logger - varnishmedia on cp4022 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishmedia, UID = 0 (root) [08:47:55] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp3032.esams.wmnet', 'cp4022.ulsfo.wmnet'] ``` and were **ALL** successful. [08:59:40] !log rebooting deployment-mediawiki-09 [08:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:19] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [09:08:20] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [09:12:20] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:12:29] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [09:17:30] !log rebooting mw2135-mw2147 for kernel security update (also bundling wikidiff and apache updates) [09:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:24] ACKNOWLEDGEMENT - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ema https://phabricator.wikimedia.org/T201986 [09:18:24] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.213:9042 on aqs1007 is CRITICAL: connect to address 10.64.0.213 and port 9042: Connection refused Ema https://phabricator.wikimedia.org/T201986 [09:18:24] ACKNOWLEDGEMENT - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed Ema https://phabricator.wikimedia.org/T201986 [09:23:56] (03CR) 10Gehel: [C: 04-1] "minor comments inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/452569 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [09:33:02] 10Operations, 10ops-eqiad: Broken memory on elastic1029 - https://phabricator.wikimedia.org/T201991 (10MoritzMuehlenhoff) [09:37:09] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [09:37:19] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [09:39:05] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1030.eqiad.wmnet', 'elastic1031... [09:39:20] !log reimaging elastic103[012], this will trigger master re-election [09:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:19] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:41:20] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [09:48:40] 10Operations, 10docker-pkg: operations-puppet:0.3.4 doesn't seem to be properly published - https://phabricator.wikimedia.org/T201952 (10Reedy) >>! In T201952#4504194, @ema wrote: > We reverted the [[https://gerrit.wikimedia.org/r/#/q/81a730d75cbe8554779c2c6ab9530fde12b171fd|jjb bump to 0.3.4]], after which bu... [10:03:13] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1031.eqiad.wmnet', 'elastic1030.eqiad.wmnet', 'elastic1032.eqiad.wmnet'] ``` an... [10:08:14] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [10:08:24] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [10:12:23] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:12:33] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [10:19:06] Images not loading - https://en.wikisource.org/wiki/File:Patients_in_mental_institutions_1948.pdf [10:37:09] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [10:37:20] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [10:41:09] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:41:20] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [10:52:25] !log rebooting mw2163-mw2189 for kernel security update (also bundling wikidiff and apache updates) [10:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:50] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [11:07:09] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [11:07:16] I know it's WMF holiday, but I need to make a deployment for https://phabricator.wikimedia.org/T201934 [11:07:48] Any ops around? [11:10:10] PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:10:19] PROBLEM - traffic-pool service on cp4021 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [11:11:00] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:11:20] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [11:15:38] ShakespeareFan00: https://mediawiki.org/wiki/How_to_report_a_bug ; opening the image thumbnail itself shows an HTTP 500 Error [11:17:10] PROBLEM - Check systemd state on cp5010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:19:21] Amir1: Seemingly. Lots of the europeans don't tend to follow those so rigidly [11:19:43] Amir1: maybe try asking Reedy, or moritzm who just spoke about a hr ago [11:20:13] or Reedy replies as I type and get distracted watching tv >.> [11:20:22] Reedy: yup, at least WMDE don't have an Ops [11:20:36] Amir1: Go for it [11:20:40] and it's not holiday here [11:20:45] okay, let's go [11:20:57] I'm semi-around too [11:21:11] bugfix seems legit :) [11:21:13] Up late or up early? ;) [11:21:24] early :) [11:26:10] zuul is pretty busy [11:26:27] I think it's on a go slow tbh [11:28:33] :/ [11:28:58] Was yesterday too [11:32:38] (03CR) 10Ladsgroup: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452716 (owner: 10Ladsgroup) [11:34:10] It says it will take 17 minutes [11:34:50] estimation in CS is bull [11:36:42] :))))) [11:37:09] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [11:37:29] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [11:41:20] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:41:40] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [11:52:10] RECOVERY - Check systemd state on cp5010 is OK: OK - running: The system is fully operational [11:53:30] RECOVERY - Check systemd state on cp4021 is OK: OK - running: The system is fully operational [11:54:13] 31 minutes so far... [11:55:00] it's nearly finished [11:55:02] lol [11:55:11] 11:55:05 ............................................................. 4941 / 7113 ( 69%) [11:55:35] :)))))))) [11:56:49] RECOVERY - traffic-pool service on cp4021 is OK: OK - traffic-pool is active [11:57:39] the cp4021-related alerts were due to reimage kinda-failing-but-not-really and me pushing things left and right [11:57:47] note to self: next time just reboot and save time [12:04:02] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.16/includes/changetags/ChangeTags.php: [[gerrit:452921|Swap SET and WHERE statements in ChangeTags::undefineTag (T201934)]] (duration: 00m 55s) [12:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:10] T201934: Duplicate entry 'emoji' for key 'ctd_name' - https://phabricator.wikimedia.org/T201934 [12:07:40] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [12:08:09] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [12:08:34] (03CR) 10Reedy: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [12:10:04] (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [12:12:00] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:12:20] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [12:15:42] 10Operations, 10ops-eqiad: Broken memory on elastic1029 - https://phabricator.wikimedia.org/T201991 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:17:58] 10Operations, 10ops-codfw: mw2184 stuck after reboot - https://phabricator.wikimedia.org/T202006 (10MoritzMuehlenhoff) [12:18:08] 10Operations, 10ops-codfw: mw2184 stuck after reboot - https://phabricator.wikimedia.org/T202006 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:19:19] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2184.codfw.wmnet [12:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:47] !log rebooting mw2190-mw2219 for kernel security update (also bundling wikidiff and apache updates) [12:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:38] 10Operations, 10docker-pkg: operations-puppet:0.3.4 doesn't seem to be properly published - https://phabricator.wikimedia.org/T201952 (10Reedy) 05Open>03Resolved a:03Reedy Tried again this morning. Caches seemingly expired and it's working! [12:36:49] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [12:37:29] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [12:37:50] PROBLEM - IPMI Sensor Status on cp3032 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [12:41:00] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [12:41:40] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:45:11] (03PS8) 10Paladox: Add gerrit-theme.html and also add footer links [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 (https://phabricator.wikimedia.org/T196835) [12:51:33] !log reindexing Polish wikis on elastic@eqiad and elastic@codfw complete (T200037) [12:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:40] T200037: Re-index Polish Wikis to patch Stempel stems - https://phabricator.wikimedia.org/T200037 [12:53:44] (03PS9) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) [13:02:10] (03PS10) 10Ema: ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) [13:03:45] (03CR) 10Ema: [C: 032] ATS: add Lua scripting support [puppet] - 10https://gerrit.wikimedia.org/r/451838 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [13:04:39] (03PS4) 10Ema: tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) [13:05:49] (03CR) 10jerkins-bot: [V: 04-1] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [13:07:30] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [13:08:09] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [13:12:20] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:12:49] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [13:21:40] 10Operations, 10ops-codfw: mw2213 correctable memory errors - https://phabricator.wikimedia.org/T194172 (10MoritzMuehlenhoff) >>! In T194172#4191150, @RobH wrote: > I would recommend that we decommission this host, as it has had multiple memory, cpu, and mainboard errors. > > @Joe: Would you be the person I n... [13:35:23] !log installing samba security updates [13:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:29] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [13:37:59] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [13:41:05] (03PS1) 10Muehlenhoff: Add library hint for samba [puppet] - 10https://gerrit.wikimedia.org/r/452936 [13:41:39] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:41:59] (03PS5) 10Ema: tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) [13:42:09] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [13:44:27] (03CR) 10Ema: [C: 032] tox: add ts-lua tests for trafficserver [puppet] - 10https://gerrit.wikimedia.org/r/452612 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [13:45:39] (03PS2) 10Muehlenhoff: Add library hint for samba [puppet] - 10https://gerrit.wikimedia.org/r/452936 [13:46:49] (03CR) 10Muehlenhoff: [C: 032] Add library hint for samba [puppet] - 10https://gerrit.wikimedia.org/r/452936 (owner: 10Muehlenhoff) [13:47:31] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2184.codfw.wmnet [13:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:22] 10Operations, 10Analytics, 10Analytics-Kanban: Move internal sites hosted on thorium to ganeti instance(s) - https://phabricator.wikimedia.org/T202011 (10Ottomata) p:05Triage>03Normal [13:55:54] (03CR) 10Ottomata: [C: 032] Add reference to Wikitech docs [puppet] - 10https://gerrit.wikimedia.org/r/452738 (https://phabricator.wikimedia.org/T201653) (owner: 10Milimetric) [13:56:01] (03PS2) 10Ottomata: Add reference to Wikitech docs [puppet] - 10https://gerrit.wikimedia.org/r/452738 (https://phabricator.wikimedia.org/T201653) (owner: 10Milimetric) [13:56:03] (03CR) 10Ottomata: [V: 032 C: 032] Add reference to Wikitech docs [puppet] - 10https://gerrit.wikimedia.org/r/452738 (https://phabricator.wikimedia.org/T201653) (owner: 10Milimetric) [14:00:49] PROBLEM - Host mw2184 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:11] ACKNOWLEDGEMENT - Host mw2184 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T202006 [14:06:49] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [14:07:10] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [14:11:29] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [14:11:59] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:13:24] !log reimaging elastic102[567] [14:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:43] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1025.eqiad.wmnet', 'elastic1026... [14:15:34] (03PS5) 10Andrew Bogott: openstack glance: move active service for eqiad1 and main to cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/452595 (https://phabricator.wikimedia.org/T191791) [14:15:36] (03PS5) 10Andrew Bogott: Openstack glance: remove glance service from labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/452596 (https://phabricator.wikimedia.org/T191791) [14:16:51] (03CR) 10Andrew Bogott: [C: 032] openstack glance: move active service for eqiad1 and main to cloudcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/452595 (https://phabricator.wikimedia.org/T191791) (owner: 10Andrew Bogott) [14:17:01] (03CR) 10Andrew Bogott: [C: 032] Openstack glance: remove glance service from labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/452596 (https://phabricator.wikimedia.org/T191791) (owner: 10Andrew Bogott) [14:24:28] (03PS1) 10Ema: WIP: trafficserver regex_map rules [puppet] - 10https://gerrit.wikimedia.org/r/452941 [14:29:44] (03PS1) 10Ottomata: Exclude readme.html from being deleted during dumps::web::fetches::stats jobs [puppet] - 10https://gerrit.wikimedia.org/r/452945 (https://phabricator.wikimedia.org/T201653) [14:30:22] Is EventLogging the only way to do client-side (JavaScript) event logging? [14:31:11] davidwbarratt: for the most part, yes [14:31:21] there is also statsv for more operational measures [14:31:21] ottomata okie dokie, thanks! [14:31:30] https://wikitech.wikimedia.org/wiki/Graphite#statsv [14:31:58] davidwbarratt: since you got me answering your question, you will also get a link to https://phabricator.wikimedia.org/T185233 just incase you don't know about it already :) [14:34:06] ottomata oh nice! thanks! [14:37:06] 08Warning Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Inbound interface errors [14:37:10] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [14:37:50] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [14:38:55] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1025.eqiad.wmnet', 'elastic1027.eqiad.wmnet', 'elastic1026.eqiad.wmnet'] ``` an... [14:41:14] 10Operations, 10vm-requests: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Ottomata) [14:41:27] 10Operations, 10vm-requests: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Ottomata) [14:41:29] 10Operations, 10Analytics, 10Analytics-Kanban: Move internal sites hosted on thorium to ganeti instance(s) - https://phabricator.wikimedia.org/T202011 (10Ottomata) [14:41:50] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [14:42:02] 10Operations, 10Analytics, 10vm-requests: eqiad: (3) VM %request for internal analytics web sites - https://phabricator.wikimedia.org/T202013 (10Ottomata) [14:42:19] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:57:05] 08̶W̶a̶r̶n̶i̶n̶g Device asw2-d-eqiad.mgmt.eqiad.wmnet recovered from Inbound interface errors [14:58:18] !log reimaging elastic1028 [14:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:33] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1028.eqiad.wmnet'] ``` The log... [14:59:36] Any developer available? Sorry to bother during staff holiday https://phabricator.wikimedia.org/T200201 [15:06:44] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [15:07:14] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [15:11:14] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:11:45] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:22:30] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1028.eqiad.wmnet'] ``` and were **ALL** successful. [15:30:02] 10Operations, 10ops-eqiad: rack/setup/install cloudservices1004.wikimedia.org - https://phabricator.wikimedia.org/T201341 (10Cmjohnson) [15:30:29] 10Operations, 10ops-eqiad: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10Cmjohnson) [15:37:22] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [15:37:41] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [15:41:21] PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:41:31] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:41:42] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:46:52] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [15:47:52] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [15:57:31] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [16:01:41] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [16:06:21] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:07:31] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [16:07:41] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [16:11:42] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [16:11:52] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:27:13] !log redirect ns1 to radon - T201608 [16:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: new ssh key for daniel - https://phabricator.wikimedia.org/T201913 (10daniel) [16:35:07] !log reboot authdns2001 for kernel security update [16:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:02] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [16:37:52] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [16:39:46] mutante: yes, the new key in T201913 is mine. I now signed it as well. [16:39:47] T201913: new ssh key for daniel - https://phabricator.wikimedia.org/T201913 [16:41:11] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:42:02] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [16:42:06] !log rollback: redirect ns1 to radon - T201608 [16:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:33] !log redirect ns0 to authdns1001 - T201608 [16:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:44] yay new dns servers mean more decoms =] [17:00:17] !log reboot radon for kernel security update [17:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:44] 10Operations, 10DNS, 10Traffic: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10MoritzMuehlenhoff) dnsauth1001 is in active duty, so it seems radon can be removed next. Adding @ema, @BBlack and @Vgutierrez for comments. [17:07:01] !log redirect ns2 to authdns1001 - T201608 [17:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:22] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [17:07:32] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [17:11:32] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [17:11:42] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:16:14] !log reboot eeden for kernel security update [17:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:47] !log Rollback: redirect ns2 to authdns1001 - T201608 [17:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:30] 10Operations: Default gateway unreachable on baham.wikimedia.org after reboot - https://phabricator.wikimedia.org/T131966 (10MoritzMuehlenhoff) This didn't happen during the latest round of dnsauth reboots, neither for the jessie nor the stretch hosts, I guess we can close this or were there other cases during t... [17:36:42] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [17:37:41] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [17:41:02] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:41:52] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [18:07:01] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [18:07:11] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [18:11:21] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [18:11:31] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:37:32] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [18:37:41] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [18:41:42] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [18:41:52] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:53:11] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:55:12] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [19:07:13] !log reimaging elastic10(18|19|20) [19:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:45] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic1018.eqiad.wmnet', 'elastic1019... [19:08:01] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [19:08:11] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [19:12:21] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [19:12:22] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:17:58] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10RobH) [19:27:11] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10RobH) [19:33:02] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic1019.eqiad.wmnet', 'elastic1020.eqiad.wmnet', 'elastic1018.eqiad.wmnet'] ``` an... [19:34:01] 10Operations, 10Cloud-Services, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559 (10Andrew) p:05Triage>03Normal [19:35:29] 10Operations, 10Cloud-Services, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559 (10Andrew) a:05Andrew>03RobH @robh, sorry, this task seems to have been lost in phab for a while. Silve... [19:37:29] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [19:37:39] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [19:39:38] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:41:38] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [19:41:39] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:45:44] 10Operations, 10Maps-Sprint, 10Maps (Tilerator): Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939 (10Mholloway) >>! In T137939#2779156, @Gehel wrote: > Replication frequency is set to 1 hour on the maps-test cluster. We can see that the server load average and IO peaks every... [19:47:39] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:49:28] RECOVERY - Auth DNS on cloudservices1003 is OK: DNS OK: 0.008 seconds response time. labs-ns2.wikimedia.org returns [20:07:32] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [20:07:33] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [20:11:33] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:11:42] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:18:03] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10RobH) So dbproxy1015 drac isn't responsive to network, and dbproxy1017 has a media check failure when attempting to boot PXE. [20:21:13] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10RobH) Odd issue attempting to pxe boot dbproxy1016. It gets no free leases from dhcp, so it cannot then be served the tftp image since its not getting an IP address assignemnt. Th... [20:28:08] Can somebody please help us with T201314 ? [20:28:09] T201314: Please unblock stuck global rename: EricEnfermero to Larry Hockett - https://phabricator.wikimedia.org/T201314 [20:28:20] it's 10 days already, thanks [20:32:33] it's a WMF holiday [20:37:23] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [20:37:32] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [20:41:42] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [20:41:43] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:43:59] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10RobH) a:05RobH>03Cmjohnson Ok, @ayounsi and I tracked down this issue. @Cmjohnson: dbproxy1015 and dbproxy1016 have the same IP assigned for mgmt, they both are using dbproxy10... [20:44:22] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10RobH) [20:44:27] and wikibugs is not reporting any gerrit activity apparently [20:45:39] * Krinkle staging on mwdebug1002/deploy1001 [20:47:43] 10Operations, 10decommission: decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559 (10RobH) 05Open>03Resolved [20:47:46] 10Operations, 10decommission: decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559 (10RobH) 05Resolved>03Open [20:48:29] 10Operations, 10decommission: decom silver (was silver has trouble rebooting) - https://phabricator.wikimedia.org/T168559 (10RobH) [20:49:28] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I207421c87eb6d2f1 (duration: 00m 57s) [20:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:41] (03CR) 10RobH: [C: 032] dbproxy101[56] mac addresses need commenting out [puppet] - 10https://gerrit.wikimedia.org/r/453061 (owner: 10RobH) [21:02:50] (03CR) 10jenkins-bot: Remove deprecated $wgEventLoggingFile config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452877 (https://phabricator.wikimedia.org/T127209) (owner: 10Krinkle) [21:05:20] (03PS1) 10Krinkle: Remove use of deprecated StartProfiler.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453062 (https://phabricator.wikimedia.org/T201782) [21:05:23] (03PS1) 10Krinkle: Remove StartProfiler.php (no longer used) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453063 (https://phabricator.wikimedia.org/T201782) [21:05:38] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10RobH) [21:05:41] (03PS1) 10Legoktm: php72: Add more missing extensions that php5.6 had [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453064 (https://phabricator.wikimedia.org/T188318) [21:07:53] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [21:08:02] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [21:10:26] (03CR) 10Legoktm: [C: 04-1] "Needs newer libzip" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/453064 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [21:12:12] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [21:12:12] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:19:06] (03PS2) 10Krinkle: Remove use of deprecated StartProfiler.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453062 (https://phabricator.wikimedia.org/T201782) [21:19:12] (03PS2) 10Krinkle: Remove StartProfiler.php (no longer used) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453063 (https://phabricator.wikimedia.org/T201782) [21:19:35] * Krinkle staging on mwdebug1002/deploy1001 [21:19:51] (03CR) 10Krinkle: [C: 032] Remove use of deprecated StartProfiler.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453062 (https://phabricator.wikimedia.org/T201782) (owner: 10Krinkle) [21:21:10] (03Merged) 10jenkins-bot: Remove use of deprecated StartProfiler.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453062 (https://phabricator.wikimedia.org/T201782) (owner: 10Krinkle) [21:35:02] (03PS2) 10Andrew Bogott: designate: set reverse domain for eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/453060 [21:35:46] PROBLEM - Auth DNS on cloudservices1003 is CRITICAL: CRITICAL - Plugin timed out while executing system call [21:36:22] PROBLEM - designate-central process on cloudservices1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-central [21:36:24] (03CR) 10Andrew Bogott: [C: 032] designate: set reverse domain for eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/453060 (owner: 10Andrew Bogott) [21:36:52] PROBLEM - designate-sink process on cloudservices1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-sink [21:37:11] PROBLEM - designate-pool-manager process on cloudservices1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-pool-manager [21:37:36] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [21:37:36] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [21:37:36] PROBLEM - designate-api http on cloudservices1003 is CRITICAL: connect to address 208.80.154.135 and port 9001: Connection refused [21:38:58] ? [21:39:42] RECOVERY - designate-central process on cloudservices1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/designate-central [21:39:43] RECOVERY - designate-api http on cloudservices1003 is OK: HTTP OK: HTTP/1.1 200 OK - 579 bytes in 0.004 second response time [21:40:10] those alerts are all me being clumsy… cloudservices1003 is still under construction [21:40:11] RECOVERY - designate-sink process on cloudservices1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/designate-sink [21:40:12] RECOVERY - Auth DNS on cloudservices1003 is OK: DNS OK: 0.009 seconds response time. labs-ns2.wikimedia.org returns [21:40:22] RECOVERY - designate-pool-manager process on cloudservices1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/designate-pool-manager [21:41:32] ack [21:41:56] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [21:41:56] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:46:35] !log krinkle@deploy1001 Synchronized scap/plugins/: I173a02910c (duration: 00m 50s) [21:46:38] (03CR) 10jenkins-bot: Remove use of deprecated StartProfiler.php file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453062 (https://phabricator.wikimedia.org/T201782) (owner: 10Krinkle) [21:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:23] !log krinkle@deploy1001 Synchronized wmf-config/: I173a02910c (duration: 00m 50s) [21:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:23] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.10: rm StartProfiler.php - T201782 (duration: 03m 01s) [21:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:30] T201782: Remove use of StartProfiler.php in wmf production - https://phabricator.wikimedia.org/T201782 [22:04:00] RECOVERY - Check systemd state on cloudservices1003 is OK: OK - running: The system is fully operational [22:07:09] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [22:11:54] !log krinkle@deploy1001 Started scap: rm StartProfiler.php - T201782 [22:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:00] T201782: Remove use of StartProfiler.php in wmf production - https://phabricator.wikimedia.org/T201782 [22:12:09] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:19:39] RECOVERY - Check for gridmaster host resolution TCP on cloudservices1003 is OK: DNS OK - 0.019 seconds response time (tools-grid-master.tools.eqiad.wmflabs. 60 IN A 10.68.20.158) [22:19:50] RECOVERY - Check for gridmaster host resolution UDP on cloudservices1003 is OK: DNS OK - 0.010 seconds response time (tools-grid-master.tools.eqiad.wmflabs. 60 IN A 10.68.20.158) [22:33:30] !log update old eqiad recdns IPs in cr1/2-eqiad bgp group Anycast4 with dns1001/2 [22:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:09] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [22:37:50] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [22:41:29] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [22:42:09] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:43:40] !log krinkle@deploy1001 Finished scap: rm StartProfiler.php - T201782 (duration: 31m 46s) [22:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:48] T201782: Remove use of StartProfiler.php in wmf production - https://phabricator.wikimedia.org/T201782 [23:08:00] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [23:08:20] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [23:12:19] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:12:39] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:37:50] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational [23:38:20] RECOVERY - cassandra-a service on aqs1007 is OK: OK - cassandra-a is active [23:41:29] PROBLEM - cassandra-a service on aqs1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:42:09] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.