[00:11:09] (03CR) 10Dzahn: [C: 03+1] "looks good, the only nitpick i see is he is staff but not using staff email. but i don't think that's a requirement. it does match LDAP. g" [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite) [01:07:01] (03PS1) 10Jeena Huneidi: Modify Restrouter chart to allow for minikube development [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) [01:10:49] (03PS2) 10Jeena Huneidi: Modify Restrouter chart to allow for minikube development [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) [01:12:10] (03CR) 10Jeena Huneidi: "It looks like we will need a dev image published for now because the production one doesn't include sqlite" [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) (owner: 10Jeena Huneidi) [02:21:16] (03PS1) 10DannyS712: Set $wgArticleCountMethod to 'any' for frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545424 (https://phabricator.wikimedia.org/T236212) [02:21:52] (03PS2) 10DannyS712: Set $wgArticleCountMethod to 'any' for frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545424 (https://phabricator.wikimedia.org/T236212) [02:31:52] (03PS7) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) [02:37:41] (03PS8) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) [02:56:28] (03PS9) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) [02:58:21] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18981384 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:00:55] (03PS10) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) [03:01:35] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 9288 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:08:39] (03PS11) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) [03:26:45] (03CR) 10Vgutierrez: [C: 03+2] ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [03:51:18] !log depool cp5007 - T234887 [03:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:23] T234887: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 [04:36:42] !log Fixed a page title via namespaceDupes.php on pswiki [04:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1096:3315 after maintenance maintenance', diff saved to https://phabricator.wikimedia.org/P9441 and previous config saved to /var/cache/conftool/dbconfig/20191023-044833-marostegui.json [04:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:19] !log repool cp5007 - T234887 [04:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:23] T234887: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 [04:57:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1096:3315 after maintenance maintenance', diff saved to https://phabricator.wikimedia.org/P9442 and previous config saved to /var/cache/conftool/dbconfig/20191023-045722-marostegui.json [04:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:06] 10Operations, 10Traffic, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) I've depooled cp5007 to conduct some experiments, I've captured the varnish-fe traffic with the following tcpdu... [05:08:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1096:3315 after maintenance maintenance', diff saved to https://phabricator.wikimedia.org/P9443 and previous config saved to /var/cache/conftool/dbconfig/20191023-050812-marostegui.json [05:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:25] (03PS1) 10Marostegui: mariadb: Remove puppet references for dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/545429 (https://phabricator.wikimedia.org/T220002) [05:10:49] (03PS1) 10Marostegui: wmnet: Remove production DNS entries for dbstore2002 [dns] - 10https://gerrit.wikimedia.org/r/545430 (https://phabricator.wikimedia.org/T220002) [05:29:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1096:3315 after maintenance maintenance', diff saved to https://phabricator.wikimedia.org/P9444 and previous config saved to /var/cache/conftool/dbconfig/20191023-052940-marostegui.json [05:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:52] !log ema@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kibana,name=codfw [05:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:46] (03PS2) 10Ema: kibana: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/545287 (https://phabricator.wikimedia.org/T227432) [05:31:31] (03CR) 10Ema: [C: 03+2] kibana: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/545287 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [05:49:33] (03Abandoned) 10Giuseppe Lavagetto: puppet: disable hiera autolookup [puppet] - 10https://gerrit.wikimedia.org/r/380304 (owner: 10Giuseppe Lavagetto) [05:50:16] (03Abandoned) 10Giuseppe Lavagetto: environments: add environment for removing hiera autolookups [puppet] - 10https://gerrit.wikimedia.org/r/395545 (https://phabricator.wikimedia.org/T181971) (owner: 10Giuseppe Lavagetto) [05:50:24] (03Abandoned) 10Giuseppe Lavagetto: profile::mediawiki::nutcracker: explicitly set log verbosity [puppet] - 10https://gerrit.wikimedia.org/r/395717 (owner: 10Giuseppe Lavagetto) [05:50:36] (03Abandoned) 10Giuseppe Lavagetto: standard: assume standard profile structure [puppet] - 10https://gerrit.wikimedia.org/r/395546 (https://phabricator.wikimedia.org/T181971) (owner: 10Giuseppe Lavagetto) [05:51:13] (03Abandoned) 10Giuseppe Lavagetto: mediawiki::cron: general encapsulation for mediawiki cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/346173 (owner: 10Giuseppe Lavagetto) [05:51:51] (03Abandoned) 10Giuseppe Lavagetto: service::node: restrict readability of configurations. [puppet] - 10https://gerrit.wikimedia.org/r/309522 (owner: 10Giuseppe Lavagetto) [05:52:22] (03Abandoned) 10Giuseppe Lavagetto: hiera: first step of simplification [puppet] - 10https://gerrit.wikimedia.org/r/402347 (owner: 10Giuseppe Lavagetto) [05:52:50] (03Abandoned) 10Giuseppe Lavagetto: Create flake8 rules that make sense in our context [debs/pybal] - 10https://gerrit.wikimedia.org/r/355784 (owner: 10Giuseppe Lavagetto) [05:53:11] (03Abandoned) 10Giuseppe Lavagetto: site.pp: merge videoscalers into the jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/437776 (owner: 10Giuseppe Lavagetto) [05:53:41] (03Abandoned) 10Giuseppe Lavagetto: jobrunner/videoscaler: factor out "base" roles to use in beta [puppet] - 10https://gerrit.wikimedia.org/r/437406 (owner: 10Giuseppe Lavagetto) [06:38:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3315 for compression T235599', diff saved to https://phabricator.wikimedia.org/P9445 and previous config saved to /var/cache/conftool/dbconfig/20191023-063800-marostegui.json [06:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:06] T235599: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 [06:38:31] !log Compress tables on db1097:3315 T235599 [06:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:07] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove puppet references for dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/545429 (https://phabricator.wikimedia.org/T220002) (owner: 10Marostegui) [06:47:40] (03PS2) 10Marostegui: wmnet: Remove production DNS entries for dbstore2002 [dns] - 10https://gerrit.wikimedia.org/r/545430 (https://phabricator.wikimedia.org/T220002) [06:48:08] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for dbstore2002 [dns] - 10https://gerrit.wikimedia.org/r/545430 (https://phabricator.wikimedia.org/T220002) (owner: 10Marostegui) [06:50:23] 10Operations, 10Discovery-Search, 10vm-requests: setup/install airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10elukey) a:05RobH→03None [06:50:25] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission dbstore200.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Marostegui) a:05RobH→03Papaul [06:50:35] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission dbstore200.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Marostegui) These two hosts are ready for switch disablement and on-site steps [06:51:44] 10Operations, 10DC-Ops, 10decommission: decommission dbstore1001.eqiad.wmnet - https://phabricator.wikimedia.org/T236227 (10Marostegui) [06:52:17] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Marostegui) [06:53:11] (03PS1) 10Ayounsi: Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545440 (https://phabricator.wikimedia.org/T235805) [06:54:01] (03CR) 10Ayounsi: [C: 03+2] Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545440 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [06:54:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:19] 10Operations, 10DC-Ops, 10decommission: decommission dbstore1001.eqiad.wmnet - https://phabricator.wikimedia.org/T236227 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `dbstore1001.eqiad.wmnet` - dbstore1001.eqiad.wmnet (**PASS**) - Downtimed host on Ic... [06:54:39] (03PS1) 10Marostegui: mariadb: Remove puppet references from dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/545441 (https://phabricator.wikimedia.org/T236227) [06:55:12] (03PS1) 10Marostegui: wmnet: Remove production DNS for dbstore1001 [dns] - 10https://gerrit.wikimedia.org/r/545442 (https://phabricator.wikimedia.org/T236227) [06:55:13] PROBLEM - PHP7 rendering on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1311 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:55:23] ^ looking [06:55:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove puppet references from dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/545441 (https://phabricator.wikimedia.org/T236227) (owner: 10Marostegui) [06:55:50] (03PS2) 10Ayounsi: Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545440 (https://phabricator.wikimedia.org/T235805) [06:56:09] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS for dbstore1001 [dns] - 10https://gerrit.wikimedia.org/r/545442 (https://phabricator.wikimedia.org/T236227) (owner: 10Marostegui) [06:57:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission dbstore1001.eqiad.wmnet - https://phabricator.wikimedia.org/T236227 (10Marostegui) a:03Jclark-ctr [06:57:41] (03CR) 10Ayounsi: [C: 03+2] Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545440 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [06:57:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission dbstore1001.eqiad.wmnet - https://phabricator.wikimedia.org/T236227 (10Marostegui) Host ready for #dc-ops steps [06:57:49] (03PS3) 10Ayounsi: Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545440 (https://phabricator.wikimedia.org/T235805) [06:57:55] !log Depooling mw1317 [06:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:12] !log depool esams - T235805 [06:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:16] T235805: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 [07:02:36] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime [07:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:52] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10ops-monitoring-bot) Icinga downtime for 5:00:00 set by ayounsi@cumin1001 on 28 host(s) and their services with reason: Onsite work ` bast3002.wikimedia.org,cp[3007-30... [07:04:02] ACKNOWLEDGEMENT - PHP7 rendering on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1310 bytes in 0.064 second response time Effie Mouzeli Host has been depooled, checking https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:04:47] !log Enable slow query log 1/10 on db1089 (enwiki) T223151 [07:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:51] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [07:05:37] !log redirect ns2 to eqiad - T235805 [07:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:41] T235805: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 [07:09:51] (03CR) 10Muehlenhoff: puppetdb: enable multiple service urls and command_broadcast (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [07:10:37] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 45.41 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:12:02] (03PS1) 10Ayounsi: Add mgmt IPs for esams scs and asw2 [dns] - 10https://gerrit.wikimedia.org/r/545444 (https://phabricator.wikimedia.org/T235805) [07:16:00] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10jijiki) `ps1-a6-eqiad` is shown as down in icinga, I believe that is expected? [07:27:58] (03CR) 10Muehlenhoff: CI rspec: update puppet version used in spec tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [07:28:24] !log logstash: refreshing index fields for logstash-* indices (via https://logstash.wikimedia.org/app/kibana#/management/kibana/indices/logstash-* ) # T234564 [07:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:29] T234564: Logstash discards messages from MediaWiki if they contain uncommon keys in the $context array - https://phabricator.wikimedia.org/T234564 [07:30:41] !log powering down cr2-esams for relocation [07:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s6 codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9446 and previous config saved to /var/cache/conftool/dbconfig/20191023-073556-marostegui.json [07:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:01] T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 [07:37:04] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [07:38:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s6 eqiad - T231018', diff saved to https://phabricator.wikimedia.org/P9447 and previous config saved to /var/cache/conftool/dbconfig/20191023-073831-marostegui.json [07:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:23] (03PS1) 10Ema: ATS: use TLS and DNS discovery to connect to kibana [puppet] - 10https://gerrit.wikimedia.org/r/545445 (https://phabricator.wikimedia.org/T210411) [07:39:39] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10MoritzMuehlenhoff) 05Resolved→03Open Reopening, currently the same key is used in Cloud VPS and production, which is a security risk. [07:46:50] !log powering down cr2-esams for relocation (for real this time) [07:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s7 codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9448 and previous config saved to /var/cache/conftool/dbconfig/20191023-074828-marostegui.json [07:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:34] T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 [07:49:20] 04Critical Alert for device cr2-esams.wikimedia.org - Emergency syslog message [07:51:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s7 eqiad - T231018', diff saved to https://phabricator.wikimedia.org/P9449 and previous config saved to /var/cache/conftool/dbconfig/20191023-075106-marostegui.json [07:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:49] (03PS1) 10Ema: Add graphite.discovery.wmnet pointing to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/545470 (https://phabricator.wikimedia.org/T210411) [07:54:20] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Emergency syslog message [07:55:08] !log kafka-logging delete unused topic syslog-notice [07:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:41] PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:55:53] XioNoX: this is expected, right? [07:56:06] yep [07:56:13] alright, thanks [07:56:24] esams is depooled [07:56:36] everything amsterdam related is expected at this point :) [07:56:52] yeah, I've been following the traffic rampup on the eqiad LVSs :) [07:56:53] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:57:15] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:03:34] (03PS3) 10Muehlenhoff: Extend wmf-userschema for additional MFA options [puppet] - 10https://gerrit.wikimedia.org/r/543402 [08:04:41] (03PS1) 10Ema: graphite: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/545494 (https://phabricator.wikimedia.org/T210411) [08:06:43] (03PS1) 10Ema: secret: dummy key for graphite [labs/private] - 10https://gerrit.wikimedia.org/r/545495 (https://phabricator.wikimedia.org/T210411) [08:07:18] 04Critical Alert for device cr2-esams.wikimedia.org - Juniper alarm active [08:09:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s8 codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9450 and previous config saved to /var/cache/conftool/dbconfig/20191023-080857-marostegui.json [08:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:04] T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 [08:10:42] (03PS1) 10Ema: graphite: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/545496 (https://phabricator.wikimedia.org/T210411) [08:10:50] (03CR) 10Jcrespo: [C: 03+2] dbmonitor: Deploy git repo as mwdeploy, otherwise no write permission [puppet] - 10https://gerrit.wikimedia.org/r/545282 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [08:11:01] (03PS2) 10Jcrespo: dbmonitor: Deploy git repo as mwdeploy, otherwise no write permission [puppet] - 10https://gerrit.wikimedia.org/r/545282 (https://phabricator.wikimedia.org/T224589) [08:11:39] !log kafka-logging eqiad set 12 partitions for ^mwlog- ^logback- and eqiad.client.error topics [08:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:27] (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for graphite [labs/private] - 10https://gerrit.wikimedia.org/r/545495 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [08:14:47] (03CR) 10Ema: "pcc looks fine: https://puppet-compiler.wmflabs.org/compiler1002/19010/" [puppet] - 10https://gerrit.wikimedia.org/r/545496 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [08:19:49] (03PS1) 10Vgutierrez: ATS: Send "100 continue" responses on the ats-tls instance [puppet] - 10https://gerrit.wikimedia.org/r/545499 (https://phabricator.wikimedia.org/T234887) [08:20:46] (03Abandoned) 10Jbond: puppet-merge: switch to GitPython [puppet] - 10https://gerrit.wikimedia.org/r/544922 (owner: 10Jbond) [08:21:01] (03PS2) 10Jcrespo: dbmonitor: Install the right apache modules for buster [puppet] - 10https://gerrit.wikimedia.org/r/545286 (https://phabricator.wikimedia.org/T224589) [08:22:32] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/19011/" [puppet] - 10https://gerrit.wikimedia.org/r/545499 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [08:22:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s8 eqiad - T231018', diff saved to https://phabricator.wikimedia.org/P9451 and previous config saved to /var/cache/conftool/dbconfig/20191023-082246-marostegui.json [08:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:53] T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018 [08:23:01] !log roll restart logstash in codfw/eqiad to pick up new kafka partitions [08:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:25] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1002/19012/" [puppet] - 10https://gerrit.wikimedia.org/r/545286 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [08:23:53] (03CR) 10Jcrespo: [C: 03+2] dbmonitor: Install the right apache modules for buster [puppet] - 10https://gerrit.wikimedia.org/r/545286 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [08:24:51] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 70.2 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:26:20] (03PS4) 10Jbond: CI rspec: update puppet version used in spec tests [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) [08:28:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2091:3312 after table compression', diff saved to https://phabricator.wikimedia.org/P9452 and previous config saved to /var/cache/conftool/dbconfig/20191023-082826-marostegui.json [08:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:29] (03CR) 10Jbond: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) (owner: 10Jbond) [08:31:35] 10Operations, 10Patch-For-Review: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) 🤔 ` Notice: /Stage[main]/Httpd/Httpd::Conf[defaults]/File[/etc/apache2/conf-enabled/00-defaults.conf]/ensure: created Info: /Stage[main]/Httpd/Httpd::Conf[defaults]/File[/etc/apac... [08:31:41] (03CR) 10Gehel: [C: 04-1] "minor comment inline, otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [08:32:44] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [08:34:50] (03PS2) 10Ema: graphite: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/545496 (https://phabricator.wikimedia.org/T210411) [08:34:53] (03PS1) 10Ema: prometheus: aggregation rule for ats-be availability [puppet] - 10https://gerrit.wikimedia.org/r/545500 [08:39:39] grr [08:40:10] 10Operations, 10Patch-For-Review: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) I think I know what happened: Initially the Puppet code was wrong and libapache-mod-php didn't get installed (which needs mpm_prefork). But "apache" still got installed... [08:42:26] !log roll restart rsyslog on cirrus and wqds hosts to pick up changes to logback topic partitions [08:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:29] (03PS2) 10Vgutierrez: ATS: Send "100 continue" responses on the ats-tls instance [puppet] - 10https://gerrit.wikimedia.org/r/545499 (https://phabricator.wikimedia.org/T234887) [08:45:57] (03CR) 10Ema: [C: 03+1] ATS: Send "100 continue" responses on the ats-tls instance [puppet] - 10https://gerrit.wikimedia.org/r/545499 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [08:46:40] (03CR) 10Vgutierrez: [C: 03+2] ATS: Send "100 continue" responses on the ats-tls instance [puppet] - 10https://gerrit.wikimedia.org/r/545499 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez) [08:46:46] (03CR) 10Gehel: [C: 04-1] "almost good!" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [08:49:06] (03CR) 10Jbond: puppetdb: enable multiple service urls and command_broadcast (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [08:50:47] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [08:50:48] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:16] (03CR) 10Ema: [C: 03+2] ATS: use TLS and DNS discovery to connect to kibana [puppet] - 10https://gerrit.wikimedia.org/r/545445 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [08:53:32] (03PS2) 10Ema: ATS: use TLS and DNS discovery to connect to kibana [puppet] - 10https://gerrit.wikimedia.org/r/545445 (https://phabricator.wikimedia.org/T210411) [08:54:07] !log installing systemd bugfix update on mw canaries [08:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:25] RECOVERY - PHP7 rendering on mw1317 is OK: HTTP OK: HTTP/1.1 200 OK - 77259 bytes in 0.211 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:56:11] PROBLEM - Check systemd state on mw1317 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:50] (03PS9) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) [08:59:19] (03CR) 10Ema: [C: 03+2] graphite: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/545494 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [08:59:41] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:59:42] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:26] !log rebooting logstash2021 for some firmware tests [09:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:39] (03CR) 10Ema: [C: 03+2] graphite: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/545496 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:01:56] 10Operations, 10Patch-For-Review: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) That was more or less what I tried before, but it installs event version rather than prefork. Just to be sure, I tried your exact purges again, and I got the same error: ` Notic... [09:02:29] (03CR) 10Ema: [C: 03+2] Add graphite.discovery.wmnet pointing to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/545470 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:02:31] (03CR) 10Volans: wdqs: add data-reload cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [09:04:34] 10Operations, 10Patch-For-Review: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) I ran manually `a2dismod mpm_event` and now it worked. I will check if this happens again on a clean install of dbmonitor1001 and add code to handle it. [09:05:25] (03PS1) 10Ema: ATS: use TLS and DNS discovery to connect to graphite [puppet] - 10https://gerrit.wikimedia.org/r/545504 (https://phabricator.wikimedia.org/T210411) [09:06:39] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [09:07:11] PROBLEM - HTTP-dbtree on dbmonitor2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 212 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [09:07:56] jynus: ^^ [09:09:03] yes, sorry, it got enabled then the ack gone [09:09:12] you can ignore it, I will downtime it again, sorrry [09:19:59] PROBLEM - Host checker.tools.wmflabs.org is DOWN: CRITICAL - Host Unreachable (checker.tools.wmflabs.org) [09:21:37] arturo: ^ [09:21:58] known, thanks marostegui [09:22:07] thanks :) [09:23:07] RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [09:24:43] 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) Now we "only" need to fix the php, with I would prefer not to, not because it would be difficult, but because it would be a waste of time, and I would prefer to create a simple flash + d3 microsite, sp... [09:26:29] (03CR) 10Hashar: [C: 03+1] "And due to -XX:G1NewSizePercent=15 , the Eden space would grow from 3G to 4,8G which is probably fine :]" [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn) [09:27:36] (03CR) 10Jbond: "> Patch Set 9: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [09:27:39] (03PS5) 10Vgutierrez: ATS: Deploy acme-chief version of the unified certificate globally [puppet] - 10https://gerrit.wikimedia.org/r/545208 (https://phabricator.wikimedia.org/T234803) [09:27:59] PROBLEM - mediawiki-installation DSH group on mw1317 is CRITICAL: Host mw1317 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:28:14] ^ that is ok [09:28:16] I will ack [09:28:38] jouncebot next [09:28:38] In 1 hour(s) and 31 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T1100) [09:28:52] (03CR) 10Jbond: [C: 03+2] puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [09:28:58] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10hashar) p:05Normal→03Low Adding the few project tags we are using nowadays. Lowering priority since clearly w... [09:30:00] (03PS10) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) [09:31:01] (03CR) 10Vgutierrez: [C: 03+2] ATS: Deploy acme-chief version of the unified certificate globally [puppet] - 10https://gerrit.wikimedia.org/r/545208 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [09:31:41] !log bump rsyslog-notice topic to 6 partitions [09:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:37] RECOVERY - HTTP-dbtree on dbmonitor2001 is OK: HTTP OK: HTTP/1.1 200 OK - 2391 bytes in 1.029 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [09:36:09] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/380304 (owner: 10Giuseppe Lavagetto) [09:36:29] (03CR) 10Alexandros Kosiaris: "Just for fun, let's do one fleet PCC run. How bad could it be? :P" [puppet] - 10https://gerrit.wikimedia.org/r/380304 (owner: 10Giuseppe Lavagetto) [09:40:42] (03PS1) 10Jcrespo: mysql: Migrate away from long-deprecated mysql module to mysqli [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589) [09:40:42] !log roll restart logstash to pick up new rsyslog-notice partitions [09:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:25] !log bounce burrow-logging-eqiad.service on kafkamon1001 [09:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:00] (03PS2) 10Jcrespo: mysql: Migrate away from long-deprecated mysql module to mysqli [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589) [09:46:13] (03CR) 10Ema: [C: 03+2] ATS: use TLS and DNS discovery to connect to graphite [puppet] - 10https://gerrit.wikimedia.org/r/545504 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [09:46:49] 10Operations, 10Traffic, 10Patch-For-Review: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 (10Vgutierrez) >>! In T234803#5583888, @BBlack wrote: > Notes from IRC, etc: > > The current patch (merging shortly: https://gerrit.wikimedi... [09:49:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] monitoring: set wmcs servers to email when mgmt interfaces fail [puppet] - 10https://gerrit.wikimedia.org/r/545386 (https://phabricator.wikimedia.org/T223458) (owner: 10Bstorm) [09:51:41] (03CR) 10Volans: "Some comments inline, mostly minor or suggestions" (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [09:52:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1" [puppet] - 10https://gerrit.wikimedia.org/r/545285 (owner: 10Giuseppe Lavagetto) [09:53:47] (03PS1) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/545507 (https://phabricator.wikimedia.org/T235655) [09:54:36] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Gilles) @elitre it should be its own task, since it's a PDF failing to render and thi... [09:54:41] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545507 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [09:55:14] (03PS4) 10Muehlenhoff: Extend wmf-userschema for additional MFA options [puppet] - 10https://gerrit.wikimedia.org/r/543402 [09:56:16] (03PS3) 10Jcrespo: mysql: Migrate away from long-deprecated mysql module to mysqli [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589) [09:57:06] (03CR) 10Muehlenhoff: [C: 03+2] Extend wmf-userschema for additional MFA options [puppet] - 10https://gerrit.wikimedia.org/r/543402 (owner: 10Muehlenhoff) [09:58:30] (03CR) 10Marostegui: [C: 03+1] "wow!" [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [10:02:28] (03PS1) 10Ema: ATS: pass --enable-reload to tslua as the last argument [puppet] - 10https://gerrit.wikimedia.org/r/545508 [10:02:52] (03CR) 10Alexandros Kosiaris: "> and this would de facto make the install candidate the most recent version in the series at any given time." [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris) [10:03:17] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [10:04:52] !log cp1075: ats-backend-restart to test https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545508/ [10:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:39] (03CR) 10Jcrespo: [C: 03+2] mysql: Migrate away from long-deprecated mysql module to mysqli [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [10:09:57] (03CR) 10Jbond: [C: 03+2] puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/545507 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond) [10:10:07] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mysql: Migrate away from long-deprecated mysql module to mysqli [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo) [10:11:03] !log deploying new version of dbtree T224589 [10:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:07] T224589: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 [10:11:44] (03CR) 10Vgutierrez: [C: 03+1] ATS: pass --enable-reload to tslua as the last argument [puppet] - 10https://gerrit.wikimedia.org/r/545508 (owner: 10Ema) [10:11:54] (03CR) 10Ema: [C: 03+2] ATS: pass --enable-reload to tslua as the last argument [puppet] - 10https://gerrit.wikimedia.org/r/545508 (owner: 10Ema) [10:13:47] !log reverting dbtree revision to HEAD~1 T224589 [10:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:28] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Yair_rand) >>! In T211881#5509001, @dr0ptp4kt wrote: > Hello all. We're going to turn this into a client-s... [10:19:55] 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) `lines=10 [Wed Oct 23 10:17:48.055752 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning: mysqli_select_db() expects exactly 2 parameters, 1 given in /srv/dbtree/index.php on line 33 [W... [10:21:33] (03PS1) 10Muehlenhoff: Also use wmf-user LDAP schema on "labs" LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/545512 [10:21:49] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Elitre) oK,will file separately then, TY, [10:24:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac) [10:25:03] ACKNOWLEDGEMENT - Check systemd state on mw1317 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Server was misbehaving, TBD what well do https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:03] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1317 is CRITICAL: Host mw1317 is not in mediawiki-installation dsh group Effie Mouzeli Server was misbehaving, TBD what well do https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:25:26] 10Operations, 10Gerrit: Editing in Gerrit isn't saved after the update/migration to gerrit1001 - https://phabricator.wikimedia.org/T236143 (10MoritzMuehlenhoff) 05Open→03Invalid I can't reproduce this any longer, maybe it got resolved with the subsequent Gerrit restart to bump the Java size or similar. Clo... [10:29:36] (03Abandoned) 10Jbond: ulogd: filter out etcd broadcast messages [puppet] - 10https://gerrit.wikimedia.org/r/543149 (owner: 10Jbond) [10:30:30] (03PS2) 10Muehlenhoff: Also use wmf-user LDAP schema on "labs" LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/545512 [10:30:34] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/19015/" [puppet] - 10https://gerrit.wikimedia.org/r/545512 (owner: 10Muehlenhoff) [10:33:23] (03PS1) 10Ema: ATS: do not pass enable-reload to tslua [puppet] - 10https://gerrit.wikimedia.org/r/545522 (https://phabricator.wikimedia.org/T233274) [10:33:59] (03CR) 10Vgutierrez: [C: 03+1] ATS: do not pass enable-reload to tslua [puppet] - 10https://gerrit.wikimedia.org/r/545522 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [10:34:12] (03PS2) 10Ema: ATS: do not pass enable-reload to tslua [puppet] - 10https://gerrit.wikimedia.org/r/545522 (https://phabricator.wikimedia.org/T233274) [10:34:35] 10Operations, 10Traffic, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) 05Resolved→03Open [10:35:27] (03CR) 10Ema: [C: 03+2] ATS: do not pass enable-reload to tslua [puppet] - 10https://gerrit.wikimedia.org/r/545522 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [10:35:33] 10Operations, 10Traffic, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) The solution proposed in https://gerrit.wikimedia.org/r/543022 doesn't work as expected due to a bug on ATS. after a config reload the lua script loses the argtb [10:39:57] (03PS1) 10Giuseppe Lavagetto: lvs::configuration: do not assume port 80 by default [puppet] - 10https://gerrit.wikimedia.org/r/545525 [10:39:59] (03PS1) 10Giuseppe Lavagetto: profile::discovery::client: fix services file [puppet] - 10https://gerrit.wikimedia.org/r/545526 [10:42:38] (03CR) 10jerkins-bot: [V: 04-1] profile::discovery::client: fix services file [puppet] - 10https://gerrit.wikimedia.org/r/545526 (owner: 10Giuseppe Lavagetto) [10:44:09] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error: 429, Too Many Requests while trying to access other resolutions for a PDF file - https://phabricator.wikimedia.org/T236240 (10Elitre) [10:46:04] !log cp-ats: rolling ATS backend restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545522/ T233274 [10:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:08] T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 [10:46:18] 10Operations, 10DBA, 10Traffic, 10WMF-Legal, and 3 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499 (10jcrespo) [10:46:59] 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) So normally the fix for the above would be trivial, but the design decisions of making sql class a singleton are in my opinion not worthy fixing, because it would force to either a deeper refactoring o... [10:49:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/545512 (owner: 10Muehlenhoff) [10:51:02] 10Operations: Puppet breakage in automation-framework VMs - https://phabricator.wikimedia.org/T234452 (10Volans) There are also local modifications in the private repo fwiw. [10:51:46] 10Operations: Puppet breakage in automation-framework VMs - https://phabricator.wikimedia.org/T234452 (10Volans) p:05Triage→03High [10:51:52] (03PS1) 10Jcrespo: Revert "mysql: Migrate away from long-deprecated mysql module to mysqli" [software/dbtree] - 10https://gerrit.wikimedia.org/r/545531 [10:52:08] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "mysql: Migrate away from long-deprecated mysql module to mysqli" [software/dbtree] - 10https://gerrit.wikimedia.org/r/545531 (owner: 10Jcrespo) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:05] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: haproxy: don't use haproxy base module [puppet] - 10https://gerrit.wikimedia.org/r/545532 (https://phabricator.wikimedia.org/T236074) [11:01:01] I'll deploy a patch [11:01:12] (03CR) 10Urbanecm: [C: 03+2] Set $wgArticleCountMethod to 'any' for frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545424 (https://phabricator.wikimedia.org/T236212) (owner: 10DannyS712) [11:02:03] (03Merged) 10jenkins-bot: Set $wgArticleCountMethod to 'any' for frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545424 (https://phabricator.wikimedia.org/T236212) (owner: 10DannyS712) [11:02:53] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:03:25] Urbanecm: what about https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/544882/ ? [11:03:45] never did InterwikiSortOrders patches fwiw so I'm not able to CR [11:04:05] hauskater: can do that as well :) [11:04:30] (03PS3) 10Urbanecm: Add custom Minerva wordmark for Hebrew wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542660 (https://phabricator.wikimedia.org/T234278) (owner: 10Ammarpad) [11:04:38] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542660 (https://phabricator.wikimedia.org/T234278) (owner: 10Ammarpad) [11:05:43] (03Merged) 10jenkins-bot: Add custom Minerva wordmark for Hebrew wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542660 (https://phabricator.wikimedia.org/T234278) (owner: 10Ammarpad) [11:06:00] (03PS1) 10Giuseppe Lavagetto: logstash: support both mediawiki and parsoid-php types [puppet] - 10https://gerrit.wikimedia.org/r/545534 [11:06:13] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: cf8e2f1: Set $wgArticleCountMethod to any for frwikiquote (T236212) (duration: 01m 12s) [11:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:17] T236212: Set $wgArticleCountMethod to 'any' for frwikiquote - https://phabricator.wikimedia.org/T236212 [11:06:53] git fetch [11:07:01] (03CR) 10Urbanecm: [C: 03+2] Add Balinese to interwiki sort orders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544882 (https://phabricator.wikimedia.org/T234768) (owner: 10Jon Harald Søby) [11:07:25] wrong window :) [11:07:47] (03Merged) 10jenkins-bot: Add Balinese to interwiki sort orders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544882 (https://phabricator.wikimedia.org/T234768) (owner: 10Jon Harald Søby) [11:09:47] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright: SWAT: 0889da0: Add custom Minerva wordmark for Hebrew wikivoyage (1/2; T234278) (duration: 01m 01s) [11:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:52] T234278: Add localized Wikivoyage wordmark to the Hebrew mobile frontend - https://phabricator.wikimedia.org/T234278 [11:11:05] hauskater: thanks, fixed :D [11:12:23] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:13:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:14:07] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 58.78 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:14:09] PROBLEM - Host 91.198.174.122 is DOWN: CRITICAL - Time to live exceeded (91.198.174.122) [11:15:03] RECOVERY - Host 91.198.174.122 is UP: PING WARNING - Packet loss = 64%, RTA = 81.78 ms [11:15:25] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [11:15:40] ACKNOWLEDGEMENT - MD RAID on maerlant is CRITICAL: connect to address 91.198.174.122 port 5666: No route to host nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T236244 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:15:44] 10Operations, 10ops-esams: Degraded RAID on maerlant - https://phabricator.wikimedia.org/T236244 (10ops-monitoring-bot) [11:15:52] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:16:02] PROBLEM - Host 91.198.174.122 is DOWN: CRITICAL - Time to live exceeded (91.198.174.122) [11:16:02] PROBLEM - Host 91.198.174.106 is DOWN: CRITICAL - Time to live exceeded (91.198.174.106) [11:16:16] Not sure if that's related, but I was just kicked from the deployment host when connected via bast3002 [11:16:34] RECOVERY - Host 91.198.174.122 is UP: PING OK - Packet loss = 0%, RTA = 83.45 ms [11:16:34] when connecting via bast1002, everything works correctly [11:16:36] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0889da0: Add custom Minerva wordmark for Hebrew wikivoyage (2/2; T234278) (duration: 01m 01s) [11:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:40] T234278: Add localized Wikivoyage wordmark to the Hebrew mobile frontend - https://phabricator.wikimedia.org/T234278 [11:16:43] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: haproxy: don't use haproxy base module [puppet] - 10https://gerrit.wikimedia.org/r/545532 (https://phabricator.wikimedia.org/T236074) [11:17:11] Urbanecm: I've got a report that there's a trouble restoring a file on commons [11:17:12] RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:17:20] RECOVERY - Host 91.198.174.106 is UP: PING OK - Packet loss = 0%, RTA = 83.63 ms [11:17:31] "Error restaurando archivo: El archivo «mwstore://local-multiwrite/local-public/8/89/Premios_L´Oreal_(012).jpg» se encuentra en un estado incoherente dentro de los sistemas de almacenamiento interno" [11:18:02] Error restoring file. The file <<>> is in an incoherent state in our internal storage system [11:18:07] that's interesting [11:18:08] (rough translation) [11:18:10] (03PS2) 10Ema: prometheus: fix aggregation rule for ats-be availability [puppet] - 10https://gerrit.wikimedia.org/r/545500 [11:18:29] anything in Logstash? Maybe we can port the error via Phatality [11:18:53] !log mwscript updateArticleCount.php --wiki=frwikiquote --update (T236212) [11:18:55] hauskater: looking [11:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:57] T236212: Set $wgArticleCountMethod to 'any' for frwikiquote - https://phabricator.wikimedia.org/T236212 [11:18:58] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: fix aggregation rule for ats-be availability [puppet] - 10https://gerrit.wikimedia.org/r/545500 (owner: 10Ema) [11:19:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: haproxy: don't use haproxy base module [puppet] - 10https://gerrit.wikimedia.org/r/545532 (https://phabricator.wikimedia.org/T236074) (owner: 10Arturo Borrero Gonzalez) [11:19:46] hauskater: for which timeframe should I look? [11:19:55] last hour [11:20:01] file name would help? [11:20:12] https://commons.wikimedia.org/wiki/User_talk:MarcoAurelio#Error_al_restaurar [11:20:18] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [11:20:18] thx hauskater [11:20:34] File:Premios_L´Oreal_(012).jpg [11:20:37] that's the file name [11:20:38] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 83.35 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:20:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:21:44] thx hauskater [11:22:26] !log urbanecm@deploy1001 Synchronized wmf-config/InterwikiSortOrders.php: SWAT: e21054e: Add Balinese to interwiki sort orders (T234768) (duration: 01m 01s) [11:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:41] T234768: Create Balinese Wikipedia - https://phabricator.wikimedia.org/T234768 [11:23:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545534 (owner: 10Giuseppe Lavagetto) [11:24:13] !log EU SWAT done [11:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:03] !log powering down cr1-esams [11:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:55] hauskater: still no match [11:27:06] I'll file a task [11:28:19] thanks [11:29:58] Done [11:30:15] Urbanecm: maybe I can try and you can take a look at the logs to see if anything pops up, to help you? [11:30:39] hauskater: that would be a solution [11:30:44] ok [11:31:38] error [11:31:40] Error undeleting file: The file "mwstore://local-multiwrite/local-public/8/89/Premios_L´Oreal_(012).jpg" is in an inconsistent state within the internal storage backends [11:34:03] Urbanecm: ^ & T236246 [11:34:05] T236246: Error restoring file on Wikimedia Commons: "File:Premios L´Oreal (012).jpg" - https://phabricator.wikimedia.org/T236246 [11:34:08] ack [11:35:58] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error restoring file on Wikimedia Commons: "File:Premios L´Oreal (012).jpg" - https://phabricator.wikimedia.org/T236246 (10MarcoAurelio) [11:41:19] hauskater: it says `FileBackendMultiWrite::doOperationsInternal: failed sync check: ["mwstore://local-multiwrite/local-deleted/g/a/y/gayjvm1vy8agetj61ckj0suucb097z3.jpg","mwstore://local-multiwrite/local-public/8/89/Premios_L\u00b4Oreal_(012).jpg"] [11:41:46] mwlog1001? [11:42:15] sound pretty similar to the error I got via the UI [11:43:59] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error restoring file on Wikimedia Commons: "File:Premios L´Oreal (012).jpg" - https://phabricator.wikimedia.org/T236246 (10Urbanecm) Logstash: https://logstash.wikimedia.org/goto/fb4822e96e27da8bfc3bb8273f4e6132 ` 2019-10-2... [11:44:01] logstash [11:44:02] https://logstash.wikimedia.org/goto/fb4822e96e27da8bfc3bb8273f4e6132 [11:51:17] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: haproxy: tell the service to load all config files [puppet] - 10https://gerrit.wikimedia.org/r/545541 (https://phabricator.wikimedia.org/T236074) [11:51:18] thanks Urbanecm [11:51:25] yw hauskater [11:51:30] hopefully someone will be able to take a look and unbreak [11:51:51] but looking at the workboard... makes me pesimistic ;) [11:53:01] Curiously I can see the deleted file via the special:undelete UI [11:53:09] so... it's not like it's corrupted or missing [11:53:35] https://www.irccloud.com/pastebin/Pgr0H0Uk/ [11:53:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: haproxy: tell the service to load all config files [puppet] - 10https://gerrit.wikimedia.org/r/545541 (https://phabricator.wikimedia.org/T236074) (owner: 10Arturo Borrero Gonzalez) [11:53:58] I'd say this code shows the error we're seeing [11:54:09] in includes\libs\filebackend\FileBackendMultiWrite.php [11:54:39] Sounds it might be [11:54:45] * hauskater lunch [11:54:48] k [11:57:34] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [11:58:10] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error restoring file on Wikimedia Commons: "File:Premios L´Oreal (012).jpg" - https://phabricator.wikimedia.org/T236246 (10Ezarate) Thank you Marco and another volunteers, the OTRS ticket 2019101810005971 is licensing the fil... [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T1200) [12:02:22] PROBLEM - OSPF status on cr1-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:02:30] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - The requested table is empty or does not exist https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:02:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:02:40] PROBLEM - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:02:54] all of those are known an no big deal ^ the downtime expire [12:02:54] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 78, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:02:55] d [12:03:03] 10Operations, 10Wikimedia-General-or-Unknown: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Reedy) [12:03:08] PROBLEM - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:03:34] PROBLEM - PyBal BGP sessions are established on lvs3003 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams+prometheus/ops [12:03:43] ACKNOWLEDGEMENT - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - The requested table is empty or does not exist Ayounsi Onsite work https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:03:43] ACKNOWLEDGEMENT - OSPF status on cr1-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP Ayounsi Onsite work https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:03:43] ACKNOWLEDGEMENT - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 78, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Onsite work https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:03:43] ACKNOWLEDGEMENT - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms Ayounsi Onsite work https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:03:43] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Onsite work https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:03:43] ACKNOWLEDGEMENT - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP Ayounsi Onsite work https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:08:41] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [12:09:37] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [12:10:30] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [12:19:22] (03CR) 10Alexandros Kosiaris: "INFO: [change] Nodes: 14 FAIL 366 ERROR" [puppet] - 10https://gerrit.wikimedia.org/r/380304 (owner: 10Giuseppe Lavagetto) [12:19:54] (03PS1) 10Ayounsi: Rename cr1-esams to cr3-esams [dns] - 10https://gerrit.wikimedia.org/r/545544 (https://phabricator.wikimedia.org/T235805) [12:26:38] RECOVERY - BGP status on cr1-esams is OK: BGP OK - up: 0, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:26:48] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [12:27:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1130 from the special slaves group on s5 and leave it back with its original pooling options T223151', diff saved to https://phabricator.wikimedia.org/P9454 and previous config saved to /var/cache/conftool/dbconfig/20191023-122708-marostegui.json [12:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:14] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [12:31:22] !log restarting ats-tls on cache text nodes - T233274 [12:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:26] T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 [12:32:32] RECOVERY - PyBal BGP sessions are established on lvs3003 is OK: (C)0 le (W)0 le 1 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams+prometheus/ops [12:33:23] (03PS1) 10Ayounsi: Rename cr1-esams to cr3-esams (same IP, new box) [puppet] - 10https://gerrit.wikimedia.org/r/545546 (https://phabricator.wikimedia.org/T235805) [12:34:08] (03CR) 10Ayounsi: [C: 03+2] Rename cr1-esams to cr3-esams [dns] - 10https://gerrit.wikimedia.org/r/545544 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [12:34:37] (03CR) 10Ayounsi: [C: 03+2] Add mgmt IPs for esams scs and asw2 [dns] - 10https://gerrit.wikimedia.org/r/545444 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [12:34:51] (03PS2) 10Ayounsi: Add mgmt IPs for esams scs and asw2 [dns] - 10https://gerrit.wikimedia.org/r/545444 (https://phabricator.wikimedia.org/T235805) [12:36:36] PROBLEM - HTTPS Unified RSA on cp1075 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [12:36:56] (03PS2) 10Ayounsi: Rename cr1-esams to cr3-esams [dns] - 10https://gerrit.wikimedia.org/r/545544 (https://phabricator.wikimedia.org/T235805) [12:37:26] !log Depool mwdebug1002 - T214734 [12:37:28] (03CR) 10Gehel: [C: 04-1] "Adding context to some of the comments" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [12:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:31] T214734: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 [12:37:32] hmmm checking cp1075 [12:38:10] (03CR) 10Ayounsi: [C: 03+2] Rename cr1-esams to cr3-esams (same IP, new box) [puppet] - 10https://gerrit.wikimedia.org/r/545546 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi) [12:39:11] oh right.. 1 failure being on 3/3 warnings and got reported [12:39:23] cause I've restarted trafficserver-tls on that node [12:39:39] bad timing :) [12:44:28] PROBLEM - Host re0.cr3-esams is DOWN: PING CRITICAL - Packet loss = 100% [12:45:36] (03PS1) 10Ottomata: Include hadoop client packages and config on dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) [12:46:03] 10Operations, 10serviceops: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10jijiki) [12:46:55] (03CR) 10Ottomata: Include hadoop client packages and config on dumps distribution servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [12:49:43] (03PS3) 10Jbond: apereo_cas: add ability to use groovy script to determine MFA [puppet] - 10https://gerrit.wikimedia.org/r/539336 (https://phabricator.wikimedia.org/T233937) [12:50:40] (03CR) 10Jbond: "updated with new LDAP paramaters" [puppet] - 10https://gerrit.wikimedia.org/r/539336 (https://phabricator.wikimedia.org/T233937) (owner: 10Jbond) [12:51:58] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:54:00] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 78, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:00:04] liw and brennen: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T1300). [13:01:29] (03PS1) 10Lars Wirzenius: group1 wikis to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545557 [13:01:31] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545557 (owner: 10Lars Wirzenius) [13:01:59] (03PS1) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 [13:02:26] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545557 (owner: 10Lars Wirzenius) [13:02:33] (03CR) 10jerkins-bot: [V: 04-1] systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (owner: 10Effie Mouzeli) [13:02:52] RECOVERY - Juniper alarms on cr2-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:04:24] !log ssastry@deploy1001 Started deploy [parsoid/deploy@451db1e]: Updating Parsoid to 5521ea74; Dummy Parsoid deploy to debug Parsoid/PHP deployment issues [13:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:46] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:05:59] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.3 [13:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:00] !log liw@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.3 (duration: 01m 00s) [13:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:08] !log ssastry@deploy1001 Finished deploy [parsoid/deploy@451db1e]: Updating Parsoid to 5521ea74; Dummy Parsoid deploy to debug Parsoid/PHP deployment issues (duration: 08m 44s) [13:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:08] (03CR) 10Ema: [C: 03+2] prometheus: fix aggregation rule for ats-be availability [puppet] - 10https://gerrit.wikimedia.org/r/545500 (owner: 10Ema) [13:28:02] RECOVERY - Confd template for /var/lib/gdnsd/discovery-kibana.state on multatuli is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [13:34:01] !log disable puppet on mwdebug1002 - T214734 [13:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:05] T214734: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734 [13:35:03] !log migrate esams mgmt to new mgmt router [13:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:51] liw: I am going to restart the CI Jenkins soonish [13:37:58] that might interfer with the train [13:38:36] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:38:56] PROBLEM - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100% [13:39:28] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:40:08] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:41:42] PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:41:58] PROBLEM - Host mr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:43:19] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [13:46:02] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10fgiunchedi) [13:46:05] hashar, I've deployed to group1, not currently deploying [13:46:07] (03PS2) 10BBlack: geodns: eqiad non-primary for all public users [dns] - 10https://gerrit.wikimedia.org/r/545385 (https://phabricator.wikimedia.org/T235805) [13:46:13] liw: great :) [13:46:19] (03CR) 10Jcrespo: "+1 for the mysql module process, I have yet to have a look at the bacula one, which Alex should also weight in." [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) (owner: 10Jbond) [13:48:57] (03PS1) 10Filippo Giunchedi: DNM: adjust logstash index template for ES 7 [puppet] - 10https://gerrit.wikimedia.org/r/545566 (https://phabricator.wikimedia.org/T235891) [13:57:03] (03PS2) 10Andrew Bogott: m5 grants: remove grants for 'labtestwiki' database [puppet] - 10https://gerrit.wikimedia.org/r/543955 (https://phabricator.wikimedia.org/T233236) [13:57:16] <_joe_> !log manually changing the symlinked deployed version of parsoid on wtp1025 T236275 [13:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:20] T236275: Parsoid-php doesn't get updated after a code deploy - https://phabricator.wikimedia.org/T236275 [13:58:59] (03PS1) 10Andrew Bogott: cloud-vps: stub out the (unused-on-VMs) profile::backup::ferm_directors [puppet] - 10https://gerrit.wikimedia.org/r/545567 (https://phabricator.wikimedia.org/T236239) [14:00:58] !log Restarting CI Jenkins [14:01:00] (03CR) 10Jcrespo: "Is this really part of T229209? Will help anyway if it isn't, but I don't understand the context." [puppet] - 10https://gerrit.wikimedia.org/r/545567 (https://phabricator.wikimedia.org/T236239) (owner: 10Andrew Bogott) [14:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:47] (03CR) 10Ori.livneh: "Sorry, I don't have access to beta. Alex cherry-picked it for me. Alex, can you remove it from there? (And could you add me to the beta pr" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [14:10:45] (03PS2) 10Andrew Bogott: cloud-vps: stub out some unused-on-vms puppetmaster bits [puppet] - 10https://gerrit.wikimedia.org/r/545567 (https://phabricator.wikimedia.org/T236239) [14:11:22] (03CR) 10Reedy: "Just added you to the beta project as an admin :)" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [14:12:47] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: stub out some unused-on-vms puppetmaster bits [puppet] - 10https://gerrit.wikimedia.org/r/545567 (https://phabricator.wikimedia.org/T236239) (owner: 10Andrew Bogott) [14:13:19] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [14:14:01] XioNoX: ^ [14:15:39] (03CR) 10Eevans: rename service definition (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [14:18:41] (03PS1) 10BBlack: Revert "Depool esams for onsite work" [dns] - 10https://gerrit.wikimedia.org/r/545570 (https://phabricator.wikimedia.org/T235805) [14:18:48] 10Operations, 10Puppet: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) p:05Triage→03Normal [14:18:57] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "Depool esams for onsite work" [dns] - 10https://gerrit.wikimedia.org/r/545570 (https://phabricator.wikimedia.org/T235805) (owner: 10BBlack) [14:19:03] (03PS1) 10CDanis: Repool esams [dns] - 10https://gerrit.wikimedia.org/r/545571 (https://phabricator.wikimedia.org/T235805) [14:19:20] !log repooling esams [14:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:33] cdanis: beware of the duplicate patch [14:19:35] (03Abandoned) 10CDanis: Repool esams [dns] - 10https://gerrit.wikimedia.org/r/545571 (https://phabricator.wikimedia.org/T235805) (owner: 10CDanis) [14:19:37] (03PS1) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) [14:19:39] indeed [14:19:43] ah, you saw it [14:20:39] (03CR) 10jerkins-bot: [V: 04-1] puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [14:22:47] (03PS2) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) [14:24:02] RECOVERY - Host asw-esams is UP: PING OK - Packet loss = 0%, RTA = 83.83 ms [14:24:52] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 83.77 ms [14:24:58] RECOVERY - Host re0.cr3-esams is UP: PING OK - Packet loss = 0%, RTA = 83.75 ms [14:25:08] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:26:19] alright [14:26:23] finally [14:26:58] \o/ [14:27:20] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [14:27:23] FYI, OSPF doesn't want to establish if the MTU is not *exactly* the same on both sides [14:27:48] I spent 1h trying to figure out what was wrong with my ospf [14:28:05] MTU, the gift that will always keep on giving :P [14:28:38] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 42, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:28:40] RECOVERY - Host mr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 83.75 ms [14:31:12] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 55.14 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:35:17] (03PS1) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) [14:35:48] !log ema@cumin1001 START - Cookbook sre.hosts.decommission [14:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:07] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:13] 10Operations, 10ops-esams, 10Traffic, 10decommission: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ema@cumin1001 for hosts: `cp3007.esams.wmnet` - cp3007.esams.wmnet (**PASS**) - Downtimed host on Icin... [14:36:42] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [14:37:20] (03CR) 10jerkins-bot: [V: 04-1] metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [14:37:30] !log ema@cumin1001 START - Cookbook sre.hosts.decommission [14:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:05] !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [14:38:08] 10Operations, 10ops-esams, 10Traffic, 10decommission: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ema@cumin1001 for hosts: `cp3008.esams.wmnet` - cp3008.esams.wmnet (**FAIL**) - Downtimed host on Icin... [14:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:46] !log ema@cumin1001 START - Cookbook sre.hosts.decommission [14:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:04] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 46, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:40:20] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [14:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:24] 10Operations, 10ops-esams, 10Traffic, 10decommission: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ema@cumin1001 for hosts: `cp3010.esams.wmnet` - cp3010.esams.wmnet (**PASS**) - Downtimed host on Icin... [14:40:29] (03PS1) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) [14:41:53] 10Operations, 10ops-esams, 10Traffic, 10decommission: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ema) >>! In T208585#5599005, @ops-monitoring-bot wrote: > - **Failed to power off, manual intervention required**: Remote IPMI for cp3008.mgmt.esams.wmnet failed (exit=... [14:42:15] (03CR) 10Mobrovac: [C: 03+1] rename service definition (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [14:42:22] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 93.22 ms [14:44:36] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [14:44:56] PROBLEM - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100% [14:45:14] (03CR) 10BPirkle: [C: 03+1] rename service definition (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [14:45:30] ACKNOWLEDGEMENT - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 35.94 le 60 Ema Expected eqiad traffic drop due to esams repool https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:46:04] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:46:16] (03PS3) 10Eevans: rename service definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) [14:46:26] (03CR) 10Eevans: [C: 03+1] rename service definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [14:47:18] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Juniper alarm active [14:47:52] PROBLEM - Host mr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:48] (03PS1) 10Jhedden: toolforge: add package and service deps to etcd profile [puppet] - 10https://gerrit.wikimedia.org/r/545579 [14:50:18] (03PS3) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) [14:50:36] (03PS2) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) [14:50:51] (03CR) 10jerkins-bot: [V: 04-1] toolforge: add package and service deps to etcd profile [puppet] - 10https://gerrit.wikimedia.org/r/545579 (owner: 10Jhedden) [14:52:06] (03PS2) 10Jhedden: toolforge: add package and service deps to etcd profile [puppet] - 10https://gerrit.wikimedia.org/r/545579 [14:52:11] (03PS2) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) [14:54:14] (03CR) 10jerkins-bot: [V: 04-1] metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [14:55:32] (03PS3) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) [14:56:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/545579 (owner: 10Jhedden) [14:59:57] (03CR) 10Jhedden: [C: 03+2] toolforge: add package and service deps to etcd profile [puppet] - 10https://gerrit.wikimedia.org/r/545579 (owner: 10Jhedden) [15:01:46] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 83.72 ms [15:01:48] RECOVERY - Host asw-esams is UP: PING OK - Packet loss = 0%, RTA = 93.04 ms [15:01:55] cabling issue ^ [15:02:22] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 76.05 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:02:42] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: ingress: fix serviceaccount names [puppet] - 10https://gerrit.wikimedia.org/r/545587 (https://phabricator.wikimedia.org/T236074) [15:02:58] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:04:08] RECOVERY - Host mr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 83.70 ms [15:09:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: ingress: fix serviceaccount names [puppet] - 10https://gerrit.wikimedia.org/r/545587 (https://phabricator.wikimedia.org/T236074) (owner: 10Arturo Borrero Gonzalez) [15:10:58] RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:11:12] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:15:30] (03PS4) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) [15:16:16] 10Operations, 10cloud-services-team (Kanban): puppet ca_server confusion - https://phabricator.wikimedia.org/T176437 (10Andrew) 05Open→03Declined Lots of things have changed since I wrote this; closing for now until I'm confused anew in the future. [15:16:55] (03CR) 10Jforrester: [C: 03+1] "Let's do it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [15:17:11] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:20:28] (03CR) 10Volans: "Compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [15:23:46] 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10Bstorm) a:05Andrew→03None [15:24:08] 10Puppet, 10cloud-services-team (Kanban): Reduce the effects of puppet breakage on VPS - https://phabricator.wikimedia.org/T226270 (10Bstorm) a:05Andrew→03None [15:30:34] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10Papaul) @robh there is no dns3003 the last server is bast3003 so only dns300[1-2] [15:32:04] (03PS1) 10Papaul: DNS: Add mgmt DNS for lvs3006 ganeti3002 dns3002 cp305[5-9] and cp3060 [dns] - 10https://gerrit.wikimedia.org/r/545599 [15:32:10] 10Operations, 10ops-esams, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10BBlack) confirming above - @papaul is correct. The total set of new esams Linux boxes AFAIK is: 16x caches, 3x LVS, 2x DNS, 1x Bastion, 3x Ganeti. [15:32:11] !log Enable slow query log 1/20 on db1089 (enwiki) T223151 [15:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:16] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [15:32:30] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt DNS for lvs3006 ganeti3002 dns3002 cp305[5-9] and cp3060 [dns] - 10https://gerrit.wikimedia.org/r/545599 (owner: 10Papaul) [15:33:25] 10Operations, 10ops-esams, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10RobH) [15:33:27] (03PS5) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) [15:33:40] 10Operations, 10ops-esams, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10RobH) [15:35:26] (03CR) 10Thcipriani: [C: 03+1] "Looks like a good first step." [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn) [15:35:59] (03PS1) 10Giuseppe Lavagetto: parsoid: allow restarting safely php-fpm during deployments. [puppet] - 10https://gerrit.wikimedia.org/r/545600 (https://phabricator.wikimedia.org/T236275) [15:36:03] PROBLEM - OSPF status on mr1-esams is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:36:32] (03CR) 10CDanis: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/544943 (owner: 10Jbond) [15:36:54] 10Operations, 10Puppet, 10Proposal: Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10Andrew) [15:37:17] (03PS2) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) [15:37:58] (03CR) 10jerkins-bot: [V: 04-1] parsoid: allow restarting safely php-fpm during deployments. [puppet] - 10https://gerrit.wikimedia.org/r/545600 (https://phabricator.wikimedia.org/T236275) (owner: 10Giuseppe Lavagetto) [15:39:17] (03CR) 10jerkins-bot: [V: 04-1] systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) (owner: 10Effie Mouzeli) [15:40:06] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Designate seems very slow to delete records? - https://phabricator.wikimedia.org/T149057 (10JHedden) 05Open→03Resolved a:03JHedden Record deletes are working as expected now, likely resolved from the OpenStack upgrades and service improvem... [15:42:23] RECOVERY - Memory correctable errors -EDAC- on mw1252 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1252&var-datasource=eqiad+prometheus/ops [15:42:31] (03PS1) 10Jbond: jenkins: use correct java version on buster [puppet] - 10https://gerrit.wikimedia.org/r/545603 [15:43:17] (03PS4) 10Giuseppe Lavagetto: LVS: add config for parsoid-php service [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:43:28] (03PS3) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) [15:43:48] (03CR) 10jerkins-bot: [V: 04-1] LVS: add config for parsoid-php service [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:43:55] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:45:14] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10aborrero) [15:45:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545603 (owner: 10Jbond) [15:45:32] (03CR) 10jerkins-bot: [V: 04-1] systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) (owner: 10Effie Mouzeli) [15:45:45] (03CR) 10Jbond: [C: 03+2] jenkins: use correct java version on buster [puppet] - 10https://gerrit.wikimedia.org/r/545603 (owner: 10Jbond) [15:45:46] 10Operations: reprepro: automate incoming processing - https://phabricator.wikimedia.org/T215812 (10Andrew) [15:48:03] (03PS4) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) [15:48:36] (03PS5) 10Giuseppe Lavagetto: LVS: add config for parsoid-php service [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:50:12] (03PS1) 10Jbond: Revert "jenkins: use correct java version on buster" [puppet] - 10https://gerrit.wikimedia.org/r/545605 [15:51:10] (03CR) 10Jbond: [C: 03+2] Revert "jenkins: use correct java version on buster" [puppet] - 10https://gerrit.wikimedia.org/r/545605 (owner: 10Jbond) [15:52:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] LVS: add config for parsoid-php service [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:52:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:54:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:54:17] (03PS5) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) [15:54:31] (03CR) 10Effie Mouzeli: "Looks ok https://puppet-compiler.wmflabs.org/compiler1002/19023/mw1222.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) (owner: 10Effie Mouzeli) [15:54:35] (03PS6) 10Dzahn: gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) [15:55:12] <_joe_> !log restarting pybal on lvs2006, then 2003 for picking up parsoid-php [15:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:22] (03CR) 10Paladox: [C: 03+1] gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn) [15:56:32] mhh re: the 503s are still there albeit at a smaller rate, looks like from this dashboard that cp3030 might be in trouble? https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [15:56:37] RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:56:40] cc bblack ^ [15:56:47] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:57:55] 10Operations, 10Traffic: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10colewhite) [[ https://logstash.wikimedia.org/goto/0493475ebf5b04d14b38741e3c75261a | And now it's dropped off for a few hours. ]] [15:58:24] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10Andrew) [15:59:23] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers wtp2018.codfw.wmnet, wtp2019.codfw.wmnet, wtp2015.codfw.wmnet, wtp2001.codfw.wmnet, wtp2020.codfw.wmnet, wtp2006.codfw.wmnet, wtp2009.codfw.wmnet, wtp2016.codfw.wmnet, wtp2008.codfw.wmnet, wtp2005.codfw.wmnet, wtp2011.codfw.wmnet, wtp2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBa [16:00:57] RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:01:18] <_joe_> known ^^ [16:01:22] (03PS1) 10Giuseppe Lavagetto: lvs: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/545608 [16:01:23] <_joe_> the pybal alert [16:01:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "sigh." [puppet] - 10https://gerrit.wikimedia.org/r/545608 (owner: 10Giuseppe Lavagetto) [16:02:03] 10Operations, 10serviceops, 10Patch-For-Review: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10colewhite) p:05Triage→03Normal [16:02:37] 10Operations, 10Discovery-Search, 10vm-requests: setup/install airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10colewhite) p:05Triage→03Normal [16:03:27] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers wtp2018.codfw.wmnet, wtp2019.codfw.wmnet, wtp2020.codfw.wmnet, wtp2001.codfw.wmnet, wtp2015.codfw.wmnet, wtp2006.codfw.wmnet, wtp2009.codfw.wmnet, wtp2016.codfw.wmnet, wtp2008.codfw.wmnet, wtp2005.codfw.wmnet, wtp2011.codfw.wmnet, wtp2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBa [16:04:28] ^ known [16:05:01] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:05:47] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:08:14] <_joe_> ok, good [16:08:25] <_joe_> I will run puppet on the icinga hosts in a few minutes [16:13:01] (03CR) 10Dzahn: [C: 04-1] DNS: Add mgmt DNS for lvs3006 ganeti3002 dns3002 cp305[5-9] and cp3060 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/545599 (owner: 10Papaul) [16:14:01] (03CR) 10BPirkle: [C: 03+1] "For completeness, +1 to rebased patch set 3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [16:14:19] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:15:13] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 80, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:17:36] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10wiki_willy) Hi @jijiki - I think there are a couple things that @Jclark-ctr needs to check and resolve, before @RobH can configure it. After that, the alert should go away. Th... [16:18:15] i got distracted with payments cert stuff [16:18:29] 10Operations, 10Core Platform Team, 10serviceops: php-fpm invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10jijiki) [16:21:08] 10Operations, 10Core Platform Team, 10serviceops: php-fpm invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10jijiki) mw1317 will be reimaged, but not yet. We will keep it around (but off production) until someone can have a closer look [16:22:52] (03PS2) 10Giuseppe Lavagetto: parsoid: allow restarting safely php-fpm during deployments. [puppet] - 10https://gerrit.wikimedia.org/r/545600 (https://phabricator.wikimedia.org/T236275) [16:24:20] 10Operations, 10MediaWiki-Maintenance-scripts, 10cloud-services-team (Kanban): processEchoEmailBatch.php failing for labtestwiki - https://phabricator.wikimedia.org/T236145 (10Andrew) This ought to be fixed now -- please let me know if it is not! [16:24:57] (03PS3) 10Giuseppe Lavagetto: parsoid: allow restarting safely php-fpm during deployments. [puppet] - 10https://gerrit.wikimedia.org/r/545600 (https://phabricator.wikimedia.org/T236275) [16:26:44] (03PS6) 10Eevans: cassandra config updates for 3.11.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) [16:27:51] (03PS2) 10Dzahn: DNS: Add mgmt DNS for lvs3006 ganeti3002 dns3002 cp305[5-9] and cp3060 [dns] - 10https://gerrit.wikimedia.org/r/545599 (owner: 10Papaul) [16:29:37] (03CR) 10Dzahn: [C: 03+2] DNS: Add mgmt DNS for lvs3006 ganeti3002 dns3002 cp305[5-9] and cp3060 [dns] - 10https://gerrit.wikimedia.org/r/545599 (owner: 10Papaul) [16:30:54] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) >>! In T191183#5597740, @hashar wrote: > Lowering priority since clearly we have no bandwith to work on addin... [16:31:38] 10Operations, 10SRE-Access-Requests, 10WMF-NDA-Requests: Please add @jrobell and @spatton to WMF-NDA (access to private Phabricator tasks) - https://phabricator.wikimedia.org/T161822 (10Lgruwell-WMF) Not sure the process here, but I approve of Spatton and Jrobell having access to WMF-NDA. [16:33:10] (03PS4) 10Giuseppe Lavagetto: parsoid: allow restarting safely php-fpm during deployments. [puppet] - 10https://gerrit.wikimedia.org/r/545600 (https://phabricator.wikimedia.org/T236275) [16:33:15] 04Critical Alert for device ps1-23-ulsfo.mgmt.ulsfo.wmnet - Device rebooted [16:34:43] (03CR) 10Dzahn: [C: 03+2] "deployed - all working except cp3057 is not found - i dont see yet why that is despite the fix" [dns] - 10https://gerrit.wikimedia.org/r/545599 (owner: 10Papaul) [16:35:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/19026/wtp1025.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/545600 (https://phabricator.wikimedia.org/T236275) (owner: 10Giuseppe Lavagetto) [16:35:46] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) [16:35:48] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Papaul) @BBlack here is the information for the CP servers in rack 15 cp3055 : xe-5/0/15 cp3056: xe-5/0/16 cp3057: xe-5/0/17 cp3058: xe-5/0/18 cp3059: xe-5/0... [16:37:12] 10Operations, 10ops-esams, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Papaul) @BBlack dn3002 racked in rack 15 switch information xe-5/0/14 [16:37:39] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [16:38:05] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [16:38:07] 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Papaul) ganeti3002 switch information xe-5/0/13 [16:38:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:38:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:39:13] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [16:39:41] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [16:39:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:39:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:42:37] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10RobH) [16:42:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:42:58] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10RobH) [16:43:15] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-23-ulsfo.mgmt.ulsfo.wmnet recovered from Device rebooted [16:43:40] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10RobH) [16:44:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:45:03] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul) @BBlack lvs3005 switch information xe-5/0/12 [16:45:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:45:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:46:15] (03PS1) 10Giuseppe Lavagetto: parsoid: switch command with sudo rule to check-and-restart-php [puppet] - 10https://gerrit.wikimedia.org/r/545618 [16:46:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:46:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:48:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] parsoid: switch command with sudo rule to check-and-restart-php [puppet] - 10https://gerrit.wikimedia.org/r/545618 (owner: 10Giuseppe Lavagetto) [16:49:39] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Dzahn) @lexnasser Please create a new SSH key that is not used in cloud and let us know the public part so we can update the production access. [16:49:54] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Dzahn) a:05Dzahn→03lexnasser [16:54:52] PROBLEM - Host re0.cr2-esams is DOWN: PING CRITICAL - Packet loss = 100% [16:55:00] PROBLEM - LVS HTTP IPv4 #page on parsoid.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.28 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:55:09] <_joe_> sigh [16:55:16] <_joe_> I have no idea why it's paging [16:55:16] PROBLEM - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:55:17] 04Critical Alert for device asw2-esams.mgmt.esams.wmnet - Juniper alarm active [16:55:19] <_joe_> or better, I know [16:55:27] _joe_: new service or need help? [16:55:31] <_joe_> but please ignore, I need to understand what did I do wrong [16:55:33] <_joe_> new service [16:55:36] ok [16:55:46] XioNoX: esams network stuff above? [16:55:53] ok [16:56:08] <_joe_> volans: can you downtime the other one in codfw please? [16:56:11] <_joe_> it will page too [16:56:13] sure [16:56:18] <_joe_> thanks [16:56:19] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10lexnasser) Here's another public ED25519 key: AAAAC3NzaC1lZDI1NTE5AAAAIOBTDDmL8isvso6xqOJB5qkk3n8xuM0XxFc1Q34ZnZRj Let me know which service is associated with which k... [16:56:46] PROBLEM - LVS HTTP IPv4 #page on parsoid.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.28 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:56:52] ETOOLATE [16:56:54] <_joe_> heh [16:56:55] sorry [16:56:57] <_joe_> sorry [16:57:00] <_joe_> no I noticed late too [16:57:01] was about to [16:57:09] yeah the juniper alarms you can ignore for now [16:57:22] there is some cables shuffling going on [16:57:24] _joe_: it's only the HTTP [16:57:27] the HTTPS is green [16:57:31] heh oh well [16:57:38] does it even listen to http? [16:57:39] <_joe_> yes, I don't get why it defined the http too [16:57:46] ok [16:57:47] <_joe_> volans: it does but the port is filtered [16:58:28] RECOVERY - Juniper alarms on cr2-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [16:58:32] <_joe_> I guess there is some magic I forgot about going on here [16:58:40] need help? [16:58:47] <_joe_> akosiaris: might be [16:59:20] parsoid? [16:59:28] <_joe_> jynus: ignore [16:59:44] sorry [16:59:45] read too late [16:59:50] <_joe_> np :) [16:59:58] (03CR) 10Nuria: [C: 04-1] "Let's make sure staff uses staff e-mail though, I think that should be easy to change." [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite) [17:00:02] <_joe_> akosiaris: I don't get why the http check is defined too [17:00:04] RECOVERY - Host re0.cr2-esams is UP: PING OK - Packet loss = 0%, RTA = 88.35 ms [17:00:05] parsoid is listening just fine on 8000 [17:00:25] and the firewall rule is ok as well [17:00:29] e.g. on wtp1025 [17:00:35] <_joe_> akosiaris: this is parsoid-php [17:00:42] <_joe_> it listens on 443, but not 80 [17:00:46] sigh, port 80 [17:00:49] never mind [17:00:55] <_joe_> but somehow what I wrote in lvs::configuration activated both [17:01:12] there is some really weird logic in one place about the icinga checks [17:01:14] it uses Lvs::Monitor_service_http_https [17:01:58] <_joe_> volans: does it? [17:02:02] from puppetboard [17:02:03] yes [17:02:10] Lvs::Monitor_service_http_https[parsoid.svc.codfw.wmnet] [17:02:14] <_joe_> yeah [17:02:37] <_joe_> then I don't get how api and api-https can coexist [17:02:44] <_joe_> oh right I see now [17:02:46] <_joe_> gosh [17:02:51] that contains Monitoring::Service[parsoid.svc.codfw.wmnet] [17:02:55] that is the http veersion [17:02:56] <_joe_> the wizardry lvs::monitor [17:02:56] of the check [17:03:10] PROBLEM - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100% [17:03:31] _joe_: I think there is a quick fix here [17:03:33] <_joe_> volans: yeah so we get some duplicates that end up coinciding in the output, ugh [17:03:40] <_joe_> cdanis: there are a couple, yes [17:03:52] <_joe_> the easiest one is the one I'm going to apply now [17:04:21] if not needed I need to step out [17:04:32] RECOVERY - Host asw-esams is UP: PING OK - Packet loss = 0%, RTA = 83.77 ms [17:04:41] <_joe_> this is not an emergency [17:04:46] <_joe_> parsoid-php is not in production [17:04:57] <_joe_> it's just ironic I tried to fix that check so that it won't page [17:05:06] (03PS1) 10CDanis: lvs parsoid-php workaround [puppet] - 10https://gerrit.wikimedia.org/r/545619 [17:05:08] <_joe_> and some abstraction we created years ago is biting me [17:05:29] :D [17:05:32] ttyl [17:05:33] <_joe_> cdanis: check_https_lvs [17:05:40] <_joe_> if it exists [17:05:49] lol it does [17:05:51] wrong copy and paste [17:06:46] <_joe_> check_https_lvs_on_port [17:06:52] <_joe_> but doesn't support the hostname [17:06:54] <_joe_> ahah [17:07:16] monitor_service_http_https calls check_http_lvs [17:07:34] <_joe_> check_https_url [17:07:47] <_joe_> this is what you probably want [17:07:52] yes [17:07:53] you are right [17:08:13] (03PS2) 10CDanis: lvs parsoid-php workaround [puppet] - 10https://gerrit.wikimedia.org/r/545619 [17:08:20] and relies on $check_command to differentiate between the simple form and adding both http/https [17:09:14] <_joe_> akosiaris: yeah I expected that specifying just the uri would result in a check of the port to call [17:09:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs parsoid-php workaround [puppet] - 10https://gerrit.wikimedia.org/r/545619 (owner: 10CDanis) [17:09:35] 3rd time this is biting us in a couple of months [17:09:55] <_joe_> time to fix that horror and rewrite it in puppet 4+ [17:09:55] It's probably about time we rework the entire structure + corresponding code [17:10:03] https://puppet-compiler.wmflabs.org/compiler1002/19031/icinga1001.wikimedia.org/ [17:10:08] it's been there since 2014? [17:10:14] <_joe_> yeah [17:10:19] (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19031/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/545619 (owner: 10CDanis) [17:10:20] <_joe_> earlier possibly [17:10:22] full of assumptions to accommodate the status quo stanza [17:10:33] (03CR) 10Cwhite: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite) [17:11:18] <_joe_> yeah but more in general, I'd like to rethink how we set up a new service from scratch [17:11:27] it is far too complicated right now [17:11:28] <_joe_> ideally I'd like to make 1, 2 puppet commits tops [17:11:31] <_joe_> yes [17:11:41] and even those who understand every part get things wrong sometimes ;) [17:12:08] <_joe_> cdanis: tbh I completely forgot how the tricks we do in ruby in lvs::monitor allow duplicate declarations :D [17:12:44] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:12:47] <_joe_> I would also love not to have to restart pybal when we add a service, but that's not exaclty easy [17:13:20] RECOVERY - LVS HTTP IPv4 #page on parsoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 14776 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:13:34] <_joe_> it still adds both, sigh wtf [17:13:44] that isn't what it looked like should happen in pcc [17:13:45] <_joe_> at least it's all green [17:14:12] no it doesn't... I only see 2 now, 1 per DC [17:14:27] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=parsoid.svc# [17:14:32] <_joe_> yes it removed the https definitions [17:14:36] <_joe_> and kept the http [17:14:38] yes [17:14:39] <_joe_> who call https [17:14:40] RECOVERY - LVS HTTP IPv4 #page on parsoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 14776 bytes in 0.240 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:14:41] <_joe_> ahahahahah [17:14:44] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:14:45] ;-) [17:14:45] <_joe_> ok whatever [17:14:51] it's a workaround of a workaround of a workaround [17:14:53] what did you expect ;) [17:14:54] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:14:55] <_joe_> I need to go afk at least for a few hours [17:15:00] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:15:01] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:15:08] <_joe_> can someone look at the availability issue? [17:15:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:15:19] _joe_: it's being looked at [17:15:20] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:15:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:15:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:15:54] some varnish backend again ? [17:15:58] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:16:04] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:16:04] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:16:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:16:26] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:16:47] <_joe_> I would say restbase [17:16:54] <_joe_> given we have errors coming from ats [17:17:05] <_joe_> (they have referer=envoy) [17:17:35] <_joe_> oh no right now it's commons' api [17:17:55] <_joe_> and cp1077 it seems [17:18:11] my guesses say restbase1081 [17:18:19] <_joe_> that doesn't exist [17:18:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:18:34] cp1081* [17:18:35] dammit [17:18:40] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:18:43] <_joe_> look at the last few minutes [17:18:50] <_joe_> it was 1081 before [17:18:59] <_joe_> but the last spike is 1077 indubitably [17:19:13] and there's cp1089 in the middle, maxing out on connections to backends [17:19:22] but it's spread amongst servers reasonably well [17:19:30] so it's something about the traffic or appserver behavior [17:19:35] which is bouncing between varnishes [17:19:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:20:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:20:38] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:20:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:20:46] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:20:54] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:21:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:21:50] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:22:00] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:22:00] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:22:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:22:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:25:26] !log restart varnish-be on cp1081 as a response to HTTP availability alerts [17:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:11] I did restart -be anyway. I t seems to have recovered, I 'd correlation is not causation, but maybe it was in this case? [17:28:11] akosiaris: I don't think so, according to https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-3h&to=now there were lots of other backend instances suffering as well [17:28:40] still others at or close to their parallelism limit for connections to appservers [17:28:56] to me that indicate something pathological about the traffic being handled [17:30:03] cp1087's mailbox lag is through the roof as well [17:30:12] chances are we are going to see a problem again [17:31:40] !log restart varnish-be on cp1089 as a response to HTTP availability alerts. High mailbox lag [17:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:07] let's see now [17:32:23] and restarting wipes caches too, so it's not without detriment to roll through restarting them all either [17:32:48] it's the backends, it shouldn't, right? [17:33:08] I mean the cache is on disk (for that weird definition of disk that is now varnish on disk) [17:33:43] the disk cache is ephemeral, it's wiped on every restart of the daemon [17:34:14] sigh, I forgot about that [17:34:15] we can handle it within reason, but there will be a spike of increased misses for a while as you roll through them [17:34:15] akosiaris: you can see this for instance on https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&panelId=8&fullscreen&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-3h&to=now [17:34:56] cp1077 again has lots of inuse connections to appservers and api_appservers [17:35:43] so whatever it is is shifting between varnishes? [17:37:17] yeah the failed fetches are now on to cp1077 [17:37:34] ok no more whack a mole, need to find what's going on [17:37:49] it is possibly just one URL that has gotten very slow and is being hammered [17:37:56] well two URLs right? [17:38:00] api + appservers [17:38:09] they're separate pools, so that part I don't get [17:38:23] mm true [17:39:03] there are elevated inuse connections on most varnishes though, although generally one is much more pronounced at any given time [17:39:25] right [17:39:41] but only to api/appservers, not elevated to other distinct backend services? [17:40:21] more often than not, just api/appservers; sometimes, also restbase -- but it's hard to tease that apart, of course [17:40:33] RB looks pretty elevatedin some of that too, yeah [17:40:54] there can of course be systemic effects where they all mix together in varnish, too [17:49:20] looks like a ton of objects are being created e.g. on cp1077 currently affected https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?panelId=13&fullscreen&orgId=1&var-server=cp1077&var-datasource=eqiad%20prometheus%2Fops&from=1571831334213&to=1571852934213 [17:49:36] (03PS1) 10Dzahn: admins: temp disable lexnasser's shell account [puppet] - 10https://gerrit.wikimedia.org/r/545624 [17:49:50] (03PS2) 10Dzahn: admins: temp disable lexnasser's shell account [puppet] - 10https://gerrit.wikimedia.org/r/545624 [17:49:50] are those inspectable in varnish ? [17:49:53] (03CR) 10jerkins-bot: [V: 04-1] admins: temp disable lexnasser's shell account [puppet] - 10https://gerrit.wikimedia.org/r/545624 (owner: 10Dzahn) [17:51:45] (03CR) 10jerkins-bot: [V: 04-1] admins: temp disable lexnasser's shell account [puppet] - 10https://gerrit.wikimedia.org/r/545624 (owner: 10Dzahn) [17:52:11] (03PS3) 10Dzahn: admins: temp disable lexnasser's shell account [puppet] - 10https://gerrit.wikimedia.org/r/545624 [17:54:50] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:55:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:55:02] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:55:30] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:55:30] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:56:24] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:56:38] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:56:38] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:57:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:57:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:01:46] Daimona: around? i can deploy for https://phabricator.wikimedia.org/T236286 [18:02:44] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:02:56] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:02:56] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:03:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:03:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:03:44] 10Operations, 10ops-esams, 10netops: set up cr3-esams - https://phabricator.wikimedia.org/T174616 (10ayounsi) 05Open→03Resolved Done. [18:03:47] 10Operations, 10Traffic, 10netops, 10Wikimedia-Incident: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi) [18:03:50] 10Operations, 10ops-esams, 10netops: Complete router migration from cr1-esams to cr3-esams - https://phabricator.wikimedia.org/T184067 (10ayounsi) [18:04:01] 10Operations, 10ops-esams, 10Epic: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking) - https://phabricator.wikimedia.org/T184061 (10ayounsi) [18:04:02] PROBLEM - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100% [18:04:03] 10Operations, 10ops-esams, 10netops: Complete router migration from cr1-esams to cr3-esams - https://phabricator.wikimedia.org/T184067 (10ayounsi) 05Open→03Resolved a:03ayounsi Done. [18:04:18] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:04:22] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [18:04:30] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:04:30] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:05:00] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:05:00] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:06:26] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:06:30] PROBLEM - Host multatuli.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:07:02] PROBLEM - Host cp3030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:07:02] PROBLEM - Host cp3038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:07:11] brennen: So and so [18:07:12] PROBLEM - Host cp3035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:07:25] I'd say yes for the next 5 minutes or so [18:07:42] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:07:43] it's probably not an issue, but i may wait until [18:07:44] PROBLEM - Host lvs3004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:07:51] er, may wait until error traffic subsides... [18:07:52] PROBLEM - Host lvs3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:07:52] PROBLEM - Host lvs3001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:07:52] PROBLEM - Host lvs3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:07:52] PROBLEM - Host maerlant.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:08:08] PROBLEM - Host cp3039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:08:16] PROBLEM - Host nescio.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:08:28] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [18:08:32] yes, please hold deploys for now [18:08:36] ack [18:08:52] PROBLEM - Host cp3033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:08:52] PROBLEM - Host bast3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:08:52] PROBLEM - Host cp3034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:08:52] PROBLEM - Host cp3036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:08:52] PROBLEM - Host cp3040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:08:52] PROBLEM - Host cp3042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:08:52] PROBLEM - Host cp3041.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:08:53] PROBLEM - Host cp3043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:02] PROBLEM - Host cp3044.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:02] PROBLEM - Host cp3046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:04] PROBLEM - Host cp3047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:04] PROBLEM - Host cp3045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:04] PROBLEM - Host cp3049.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:18] PROBLEM - Host cp3032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:09:26] PROBLEM - Host mr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:36] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:09:38] the management router crashed [18:09:40] RECOVERY - Host asw-esams is UP: PING OK - Packet loss = 0%, RTA = 83.88 ms [18:09:40] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 83.70 ms [18:09:59] I think they were downtimed too [18:10:45] Oh, didn't even notice them... In case I disappear, testing is pretty easily: head to Special:AbuseFilter/new and ensure that the "Actions to take when matched" section is not empty [18:10:46] i don't think they were. unexpected [18:10:59] Let me provide examples [18:11:14] This is how it should like: https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:AbuseFilter/new [18:11:49] Whereas currently it's empty (see e.g. https://phabricator.wikimedia.org/F30876704) [18:12:02] everything should be back to normal [18:12:06] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [18:12:13] i was about to silence the alerts but then decided it's nicer to see it coming back [18:12:31] XioNoX: thanks! [18:12:44] 10Operations, 10serviceops: php-fpm invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10CCicalese_WMF) It does not look like there is work for #core_platform_team to do on this at this point, but @tstarling may want to take a look. [18:12:47] No 'recovery' alerts? [18:13:09] I think it crashed again [18:13:16] what the [18:13:24] PROBLEM - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100% [18:14:24] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:14:57] (03PS1) 10CRusnov: librenms: Handle the case where hardware is null [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/545628 [18:15:35] (03CR) 10CRusnov: [C: 03+2] librenms: Handle the case where hardware is null [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/545628 (owner: 10CRusnov) [18:15:39] (03PS1) 10BBlack: Example defensive timeout config [puppet] - 10https://gerrit.wikimedia.org/r/545629 [18:15:44] (03PS2) 10CRusnov: librenms: Handle the case where hardware is null [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/545628 [18:17:42] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:17:45] alright mr1 is dead [18:17:56] paravoid: ^ [18:22:30] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:22:46] !log mforns@deploy1001 Started deploy [analytics/refinery@1110d59]: deploying refinery up to 1110d59c3983bcff4986bce1baf885f05ee06ba5 [18:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:59] (03CR) 10Dzahn: [C: 03+2] admins: temp disable lexnasser's shell account [puppet] - 10https://gerrit.wikimedia.org/r/545624 (owner: 10Dzahn) [18:25:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:25:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:25:44] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:25:56] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:26:12] PROBLEM - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - commonswiki_content_1556151793(67gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [18:26:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:26:36] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:26:42] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:26:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:26:48] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:26:58] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:27:10] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:27:30] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:27:34] (03PS7) 10Dzahn: gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) [18:28:00] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:28:14] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:28:20] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:28:24] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:28:46] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:28:54] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:28:54] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:29:27] !log mforns@deploy1001 Finished deploy [analytics/refinery@1110d59]: deploying refinery up to 1110d59c3983bcff4986bce1baf885f05ee06ba5 (duration: 06m 40s) [18:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:00] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:37:54] (03CR) 10Nuria: [C: 04-1] "I think so, but let's ping user on ticket to make sure he knows it is happening." [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite) [18:39:05] (03CR) 10Andrew Bogott: [C: 03+1] "I'm ready for you to deploy this whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542506 (https://phabricator.wikimedia.org/T223907) (owner: 10BryanDavis) [18:39:21] (03PS1) 10Dzahn: admins: re-enable shell account for lexnasser with new key [puppet] - 10https://gerrit.wikimedia.org/r/545630 (https://phabricator.wikimedia.org/T235688) [18:40:29] brennen or liw, did the train go ok this morning? Can we close https://phabricator.wikimedia.org/T236166? [18:42:32] andrewbogott: yeah, should be good. [18:42:40] great, thanks [18:43:55] (03CR) 10Dzahn: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite) [18:45:06] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:45:48] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:45:58] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:46:00] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:46:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:46:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:46:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:47:12] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:47:30] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:49:02] !log milimetric@deploy1001 Started deploy [analytics/refinery@3aaabf6]: Minor: fix two scripts [18:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:44] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:51:30] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:51:50] (03PS1) 10BBlack: cache_text: raise appservers/api conn limits to 10K [puppet] - 10https://gerrit.wikimedia.org/r/545634 [18:52:02] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:52:13] (03CR) 10CDanis: [C: 03+1] cache_text: raise appservers/api conn limits to 10K [puppet] - 10https://gerrit.wikimedia.org/r/545634 (owner: 10BBlack) [18:52:26] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:52:30] (03CR) 10BBlack: [C: 03+2] cache_text: raise appservers/api conn limits to 10K [puppet] - 10https://gerrit.wikimedia.org/r/545634 (owner: 10BBlack) [18:52:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:52:54] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:52:54] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:53:50] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:54:00] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:55:56] (03PS2) 10Dzahn: admins: re-enable shell account for lexnasser with new key [puppet] - 10https://gerrit.wikimedia.org/r/545630 (https://phabricator.wikimedia.org/T235688) [18:56:08] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [18:56:36] (03PS3) 10Bstorm: monitoring: set wmcs servers to email when mgmt interfaces fail [puppet] - 10https://gerrit.wikimedia.org/r/545386 (https://phabricator.wikimedia.org/T223458) [18:56:56] !log milimetric@deploy1001 Finished deploy [analytics/refinery@3aaabf6]: Minor: fix two scripts (duration: 07m 53s) [18:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:13] (03CR) 10Dzahn: [C: 03+2] admins: re-enable shell account for lexnasser with new key [puppet] - 10https://gerrit.wikimedia.org/r/545630 (https://phabricator.wikimedia.org/T235688) (owner: 10Dzahn) [18:59:48] alright, mr1 is now booting from the USB drive [19:00:38] (03CR) 10Bstorm: [C: 03+2] monitoring: set wmcs servers to email when mgmt interfaces fail [puppet] - 10https://gerrit.wikimedia.org/r/545386 (https://phabricator.wikimedia.org/T223458) (owner: 10Bstorm) [19:03:50] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:03:52] RECOVERY - Host asw-esams is UP: PING OK - Packet loss = 0%, RTA = 90.73 ms [19:03:58] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 83.69 ms [19:04:06] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:05:00] RECOVERY - Host multatuli.mgmt is UP: PING OK - Packet loss = 0%, RTA = 83.94 ms [19:05:33] RECOVERY - Host cp3030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.50 ms [19:05:33] RECOVERY - Host cp3038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.50 ms [19:05:33] RECOVERY - Host cp3035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 83.82 ms [19:05:43] bblack / XioNoX: clear for deploys at this point? [19:05:52] RECOVERY - Host lvs3004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.40 ms [19:05:59] RECOVERY - Host lvs3001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.39 ms [19:05:59] RECOVERY - Host lvs3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.89 ms [19:05:59] RECOVERY - Host lvs3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.61 ms [19:05:59] RECOVERY - Host maerlant.mgmt is UP: PING OK - Packet loss = 0%, RTA = 89.32 ms [19:06:27] RECOVERY - Host nescio.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.10 ms [19:06:27] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 93.18 ms [19:06:29] RECOVERY - Host bast3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.82 ms [19:06:39] RECOVERY - Host cp3033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 98.92 ms [19:06:39] RECOVERY - Host cp3034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 98.39 ms [19:06:39] RECOVERY - Host cp3036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 97.72 ms [19:06:39] RECOVERY - Host cp3040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 96.43 ms [19:06:39] RECOVERY - Host cp3039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 96.52 ms [19:06:39] RECOVERY - Host cp3041.mgmt is UP: PING OK - Packet loss = 0%, RTA = 95.81 ms [19:06:39] RECOVERY - Host cp3042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 95.21 ms [19:06:40] RECOVERY - Host cp3043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 98.04 ms [19:06:50] (03PS1) 10Jbond: puppet: clean up unsed parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640 [19:06:53] RECOVERY - Host cp3044.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.62 ms [19:06:53] RECOVERY - Host cp3046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.33 ms [19:06:57] RECOVERY - Host cp3045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.36 ms [19:06:57] RECOVERY - Host cp3047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.07 ms [19:06:57] RECOVERY - Host cp3049.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.96 ms [19:07:09] RECOVERY - Host cp3032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.33 ms [19:07:21] RECOVERY - Host mr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 83.81 ms [19:09:03] (03CR) 10jerkins-bot: [V: 04-1] puppet: clean up unsed parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640 (owner: 10Jbond) [19:09:26] brennen: I think for now we need to keep holding a bit, we still don't really understand what's going on with massive request parallelism/timeouts [19:10:24] esams mgmt is back to an good enough state for tonight [19:10:26] bblack: cool, thanks for update. i may not be able to test the patch i've got here anyway, so it can probably wait until J.ames_F is online. [19:10:37] ok [19:10:58] XioNoX: ack, i hope you guys get some rest after long day now. what a timing [19:11:53] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:12:31] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:14:18] (03PS2) 10Jbond: puppet: clean up unsed parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640 [19:19:03] (03CR) 10Alex Monk: [C: 03+1] puppet: clean up unsed parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640 (owner: 10Jbond) [19:19:44] (03PS3) 10Jbond: puppet: clean up unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640 [19:23:08] (03CR) 10Cwhite: [C: 03+2] admin: add gsingers to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/544197 (https://phabricator.wikimedia.org/T235260) (owner: 10Herron) [19:23:18] 04Critical Alert for device mr1-esams.wikimedia.org - Juniper alarm active [19:25:46] (03PS2) 10Cwhite: admin: add gsingers to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/544197 (https://phabricator.wikimedia.org/T235260) (owner: 10Herron) [19:28:30] (03CR) 10Cwhite: [C: 03+2] admin: add gsingers to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/544197 (https://phabricator.wikimedia.org/T235260) (owner: 10Herron) [19:51:30] (03PS1) 10Anomie: RejectParserCacheValue to reject possibly-corrupted entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545647 (https://phabricator.wikimedia.org/T235188) [20:00:05] cscott, arlolra, subbu, halfak, and accraze: That opportune time is upon us again. Time for a Services – Graphoid / Parsoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T2000). [20:13:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:15:41] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [20:15:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:16:03] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [20:16:11] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:16:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:17:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:17:17] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [20:17:29] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:17:39] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [20:17:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:17:51] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:18:23] (03PS1) 10Effie Mouzeli: hhvm: remove hhvm leftovers from apache configs [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) [20:29:31] (03CR) 10Dzahn: [C: 03+2] gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn) [20:29:33] (03PS1) 10MarcoAurelio: Restrict uploads on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545655 (https://phabricator.wikimedia.org/T236307) [20:31:21] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:31:35] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [20:31:37] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:32:23] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:32:31] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:32:39] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [20:32:43] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:32:51] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:34:15] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:35:55] ^^ stat1007 looks pretty busy running an R program [20:36:29] shdubsh: unfortunately that's common. restart nagios-nrpe-server should fix it all [20:36:39] it always gets killed first by OOM killer [20:36:59] and stat1007 often has this issue that user jobs use all the RAM [20:37:11] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:37:17] it's https://phabricator.wikimedia.org/T212824 [20:37:19] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:37:25] odd, memory utilization is really low [20:37:27] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [20:37:31] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:37:35] it's always the same explanation [20:37:39] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:37:45] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [20:37:46] !log restart nagios-nrpe-server on stat1007 [20:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:59] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [20:37:59] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:38:48] (03CR) 10RLazarus: [C: 03+1] "I'm still learning how these files work, but seems legit!" [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [20:38:55] interesting, there was a lot of memory utilization just before [20:39:40] these are things run manually by people.. so who knows [20:39:49] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:40:01] it's often R.. ack [20:40:33] sometimes i sent a message to | wall that it's causing an issue [20:41:09] seems strange that the oom killer taking out R also takes out nrpe [20:41:24] oh, heh [20:41:33] nagios-nrpe-server.service: Failed to fork: Cannot allocate memory [20:41:40] that'll do it [20:42:09] (03PS1) 10BBlack: Basic install for new esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294) [20:42:30] unfortunately it is often (always?) the first victim of the killer [20:42:36] then turning into that icinga spam [20:46:38] it must have a low priority or somtheing [20:53:25] (03PS1) 10Ayounsi: New esams stuff [homer/public] - 10https://gerrit.wikimedia.org/r/545660 (https://phabricator.wikimedia.org/T235805) [20:54:35] here is the suggestion to put the users into a different slice https://phabricator.wikimedia.org/T212824#4967798 [20:55:35] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [20:57:09] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:02:04] (03CR) 10Dzahn: [C: 04-2] gerrit: change gerrit master_host to gerrit1001, remove duplicate [puppet] - 10https://gerrit.wikimedia.org/r/545342 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [21:05:37] (03PS1) 10BBlack: basic DNS entries for new esams hosts [dns] - 10https://gerrit.wikimedia.org/r/545662 (https://phabricator.wikimedia.org/T236294) [21:06:51] (03PS4) 10Dzahn: webperf: add backups for arclamp application data [puppet] - 10https://gerrit.wikimedia.org/r/543005 (https://phabricator.wikimedia.org/T235425) [21:07:32] it's so nice to see bast3003 being added [21:10:03] bblack: maybe it should be bast3004. because technically the decom ticket for bast3003 is open https://phabricator.wikimedia.org/T216199 [21:10:30] there was already much ambiguity hence those ticket comments from back then [21:10:38] lol [21:10:47] thanks for the heads up, agree, should rename it to bast3004 :) [21:11:00] 'k :) [21:11:53] (03PS2) 10BBlack: Basic install for new esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294) [21:12:46] fixed [21:12:52] (03PS2) 10BBlack: basic DNS entries for new esams hosts [dns] - 10https://gerrit.wikimedia.org/r/545662 (https://phabricator.wikimedia.org/T236294) [21:12:53] gotta run :) [21:21:53] (03CR) 10Dzahn: [C: 03+2] webperf: add backups for arclamp application data [puppet] - 10https://gerrit.wikimedia.org/r/543005 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn) [21:26:17] PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:45] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:27] ^ that would be me adding bacula service.. looking [21:32:31] RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:39] RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:44] !log webperf1002/2002 - starting bacula-fd service that is failed after initial puppet run turning them into backup::hosts [21:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:56] up to bast3003 already? [21:33:02] Krenair: 3004 :p [21:33:07] that seems quick. doesn't feel like hooft.esmas was that long ago [21:33:32] Krenair: "bast3002 was broken and to be replaced with another server, bast3003, which was formerly amslvs4." :p [21:33:39] they kept breaking [21:34:03] I'm assuming that stuff would all be way out of warranty by now :P [21:34:22] yea, which is why we had to find _something_ to use as bastion [21:34:26] heh [21:34:41] but now finally new hardware, yay [21:34:51] nice [21:44:51] Krenair: we have a new server bast3003 [21:45:59] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:47:23] papaul: please rename to bast3004 because https://phabricator.wikimedia.org/T216199 [21:47:46] (b.black did in upcoming DNS changes.. but labels) [21:48:11] also, have some rest :) [21:50:18] mutante: can't sleep don't know why [21:50:37] papaul: jetlag :) [21:50:47] but will have to clearfy that tomorrow since on the new order we have a new server called bast3003 too [21:51:27] yes please, that sounds like it would cause more confusion [21:51:48] and the already confusing old ticket [21:51:58] for final decom of bast3003 [21:52:13] mutante: understood [22:00:21] !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server) [22:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:41] !log twentyafterfour@deploy1001 Finished deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server) (duration: 00m 21s) [22:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:31] (03CR) 10Mathew.onipe: wdqs: add data-reload cookbook (0312 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [22:14:30] !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server) [22:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:40] !log twentyafterfour@deploy1001 Finished deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server) (duration: 01m 10s) [22:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:14] (03PS10) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [22:16:16] (03PS1) 10Mathew.onipe: fix unused format [cookbooks] - 10https://gerrit.wikimedia.org/r/545672 [22:16:18] (03PS1) 10Mathew.onipe: Better query to host check [cookbooks] - 10https://gerrit.wikimedia.org/r/545673 [22:16:20] Disconnecting authenticating user phab-deploy ....: Too many authentication failures [preauth] [22:16:29] wth [22:19:59] !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server) [22:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:03] !log twentyafterfour@deploy1001 Finished deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server) (duration: 00m 05s) [22:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:15] !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server) [22:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:36] !log twentyafterfour@deploy1001 Finished deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server) (duration: 00m 21s) [22:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:41] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:41:39] PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 674.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:42:25] PROBLEM - MariaDB Slave Lag: m3 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 704.68 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:42:51] (03CR) 10Dzahn: [C: 03+1] basic DNS entries for new esams hosts [dns] - 10https://gerrit.wikimedia.org/r/545662 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [22:44:24] James_F: think https://phabricator.wikimedia.org/T236286 should be on mwdebug1001; mind testing? [22:47:31] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:47:39] (03CR) 10Dzahn: "dns3001 missing in DHCP?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [22:51:14] brennen: Sorry, testing now. [22:51:54] brennen: Yeah, LGTM. [22:52:10] James_F: rad, thank you. [22:55:53] !log brennen@deploy1001 Synchronized php-1.35.0-wmf.3/extensions/AbuseFilter: SWAT: [[gerrit:545620|Unbreak filter edit form (T236286)]] (duration: 01m 05s) [22:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:57] T236286: Uncaught Error: Widget not found when editing filters - https://phabricator.wikimedia.org/T236286 [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:08:57] (03PS2) 10Mathew.onipe: Better query-to-host check [cookbooks] - 10https://gerrit.wikimedia.org/r/545673 [23:08:59] (03PS11) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [23:09:38] (03CR) 10Mathew.onipe: query_service: prepare query_service for reusbility (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [23:11:11] (03PS20) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) [23:11:13] (03PS27) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) [23:11:15] (03PS25) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) [23:11:17] (03PS23) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) [23:11:20] (03PS24) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) [23:11:21] (03PS24) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) [23:24:08] (03CR) 10Mathew.onipe: "PCC is Ok: https://puppet-compiler.wmflabs.org/compiler1001/19032/" [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [23:26:24] (03CR) 10Mathew.onipe: "PCC is good: https://puppet-compiler.wmflabs.org/compiler1002/19033/" [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [23:26:35] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:26:35] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:27:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 29.94 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:27:47] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 42.68 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:30:21] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 105 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:30:57] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 80.15 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:31:01] (03PS1) 10Alex Monk: Swap toolforge proxies to use acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/545679 [23:31:34] (03PS2) 10Alex Monk: Swap toolforge proxies to use acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/545679 (https://phabricator.wikimedia.org/T235252) [23:43:26] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports