[00:11:09] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "looks good, the only nitpick i see is he is staff but not using staff email. but i don't think that's a requirement. it does match LDAP. g" [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite)
[01:07:01] <wikibugs>	 (03PS1) 10Jeena Huneidi: Modify Restrouter chart to allow for minikube development [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910)
[01:10:49] <wikibugs>	 (03PS2) 10Jeena Huneidi: Modify Restrouter chart to allow for minikube development [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910)
[01:12:10] <wikibugs>	 (03CR) 10Jeena Huneidi: "It looks like we will need a dev image published for now because the production one doesn't include sqlite" [deployment-charts] - 10https://gerrit.wikimedia.org/r/545421 (https://phabricator.wikimedia.org/T228910) (owner: 10Jeena Huneidi)
[02:21:16] <wikibugs>	 (03PS1) 10DannyS712: Set $wgArticleCountMethod to 'any' for frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545424 (https://phabricator.wikimedia.org/T236212)
[02:21:52] <wikibugs>	 (03PS2) 10DannyS712: Set $wgArticleCountMethod to 'any' for frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545424 (https://phabricator.wikimedia.org/T236212)
[02:31:52] <wikibugs>	 (03PS7) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887)
[02:37:41] <wikibugs>	 (03PS8) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887)
[02:56:28] <wikibugs>	 (03PS9) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887)
[02:58:21] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18981384 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:00:55] <wikibugs>	 (03PS10) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887)
[03:01:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 9288 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:08:39] <wikibugs>	 (03PS11) 10Vgutierrez: ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887)
[03:26:45] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: Adjust timeouts in ats-tls and ats-backend instances [puppet] - 10https://gerrit.wikimedia.org/r/541524 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez)
[03:51:18] <vgutierrez>	 !log depool cp5007 - T234887
[03:51:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:51:23] <stashbot>	 T234887: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887
[04:36:42] <MaxSem>	 !log Fixed a page title via namespaceDupes.php on pswiki
[04:36:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:48:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1096:3315 after maintenance maintenance', diff saved to https://phabricator.wikimedia.org/P9441 and previous config saved to /var/cache/conftool/dbconfig/20191023-044833-marostegui.json
[04:48:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:49:19] <vgutierrez>	 !log repool cp5007 - T234887
[04:49:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:49:23] <stashbot>	 T234887: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887
[04:57:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1096:3315 after maintenance maintenance', diff saved to https://phabricator.wikimedia.org/P9442 and previous config saved to /var/cache/conftool/dbconfig/20191023-045722-marostegui.json
[04:57:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:04:06] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) I've depooled cp5007 to conduct some experiments, I've captured the varnish-fe traffic with the following tcpdu...
[05:08:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1096:3315 after maintenance maintenance', diff saved to https://phabricator.wikimedia.org/P9443 and previous config saved to /var/cache/conftool/dbconfig/20191023-050812-marostegui.json
[05:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:10:25] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove puppet references for dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/545429 (https://phabricator.wikimedia.org/T220002)
[05:10:49] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Remove production DNS entries for dbstore2002 [dns] - 10https://gerrit.wikimedia.org/r/545430 (https://phabricator.wikimedia.org/T220002)
[05:29:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1096:3315 after maintenance maintenance', diff saved to https://phabricator.wikimedia.org/P9444 and previous config saved to /var/cache/conftool/dbconfig/20191023-052940-marostegui.json
[05:29:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:29:52] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kibana,name=codfw
[05:29:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:30:46] <wikibugs>	 (03PS2) 10Ema: kibana: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/545287 (https://phabricator.wikimedia.org/T227432)
[05:31:31] <wikibugs>	 (03CR) 10Ema: [C: 03+2] kibana: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/545287 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema)
[05:49:33] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: puppet: disable hiera autolookup [puppet] - 10https://gerrit.wikimedia.org/r/380304 (owner: 10Giuseppe Lavagetto)
[05:50:16] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: environments: add environment for removing hiera autolookups [puppet] - 10https://gerrit.wikimedia.org/r/395545 (https://phabricator.wikimedia.org/T181971) (owner: 10Giuseppe Lavagetto)
[05:50:24] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: profile::mediawiki::nutcracker: explicitly set log verbosity [puppet] - 10https://gerrit.wikimedia.org/r/395717 (owner: 10Giuseppe Lavagetto)
[05:50:36] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: standard: assume standard profile structure [puppet] - 10https://gerrit.wikimedia.org/r/395546 (https://phabricator.wikimedia.org/T181971) (owner: 10Giuseppe Lavagetto)
[05:51:13] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: mediawiki::cron: general encapsulation for mediawiki cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/346173 (owner: 10Giuseppe Lavagetto)
[05:51:51] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: service::node: restrict readability of configurations. [puppet] - 10https://gerrit.wikimedia.org/r/309522 (owner: 10Giuseppe Lavagetto)
[05:52:22] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: hiera: first step of simplification [puppet] - 10https://gerrit.wikimedia.org/r/402347 (owner: 10Giuseppe Lavagetto)
[05:52:50] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: Create flake8 rules that make sense in our context [debs/pybal] - 10https://gerrit.wikimedia.org/r/355784 (owner: 10Giuseppe Lavagetto)
[05:53:11] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: site.pp: merge videoscalers into the jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/437776 (owner: 10Giuseppe Lavagetto)
[05:53:41] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: jobrunner/videoscaler: factor out "base" roles to use in beta [puppet] - 10https://gerrit.wikimedia.org/r/437406 (owner: 10Giuseppe Lavagetto)
[06:38:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1097:3315 for compression T235599', diff saved to https://phabricator.wikimedia.org/P9445 and previous config saved to /var/cache/conftool/dbconfig/20191023-063800-marostegui.json
[06:38:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:06] <stashbot>	 T235599: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599
[06:38:31] <marostegui>	 !log Compress tables on db1097:3315 T235599
[06:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission
[06:46:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[06:46:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove puppet references for dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/545429 (https://phabricator.wikimedia.org/T220002) (owner: 10Marostegui)
[06:47:40] <wikibugs>	 (03PS2) 10Marostegui: wmnet: Remove production DNS entries for dbstore2002 [dns] - 10https://gerrit.wikimedia.org/r/545430 (https://phabricator.wikimedia.org/T220002)
[06:48:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for dbstore2002 [dns] - 10https://gerrit.wikimedia.org/r/545430 (https://phabricator.wikimedia.org/T220002) (owner: 10Marostegui)
[06:50:23] <wikibugs>	 10Operations, 10Discovery-Search, 10vm-requests: setup/install airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10elukey) a:05RobH→03None
[06:50:25] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission dbstore200.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Marostegui) a:05RobH→03Papaul
[06:50:35] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission dbstore200.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Marostegui) These two hosts are ready for switch disablement and on-site steps
[06:51:44] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission dbstore1001.eqiad.wmnet - https://phabricator.wikimedia.org/T236227 (10Marostegui)
[06:52:17] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission dbstore2001.codfw.wmnet and dbstore2002.codfw.wmnet - https://phabricator.wikimedia.org/T220002 (10Marostegui)
[06:53:11] <wikibugs>	 (03PS1) 10Ayounsi: Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545440 (https://phabricator.wikimedia.org/T235805)
[06:54:01] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545440 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi)
[06:54:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission
[06:54:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[06:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:19] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission dbstore1001.eqiad.wmnet - https://phabricator.wikimedia.org/T236227 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `dbstore1001.eqiad.wmnet` -  dbstore1001.eqiad.wmnet (**PASS**)   - Downtimed host on Ic...
[06:54:39] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove puppet references from dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/545441 (https://phabricator.wikimedia.org/T236227)
[06:55:12] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Remove production DNS for dbstore1001 [dns] - 10https://gerrit.wikimedia.org/r/545442 (https://phabricator.wikimedia.org/T236227)
[06:55:13] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1311 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[06:55:23] <effie>	 ^ looking 
[06:55:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove puppet references from dbstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/545441 (https://phabricator.wikimedia.org/T236227) (owner: 10Marostegui)
[06:55:50] <wikibugs>	 (03PS2) 10Ayounsi: Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545440 (https://phabricator.wikimedia.org/T235805)
[06:56:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS for dbstore1001 [dns] - 10https://gerrit.wikimedia.org/r/545442 (https://phabricator.wikimedia.org/T236227) (owner: 10Marostegui)
[06:57:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission dbstore1001.eqiad.wmnet - https://phabricator.wikimedia.org/T236227 (10Marostegui) a:03Jclark-ctr
[06:57:41] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545440 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi)
[06:57:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decommission dbstore1001.eqiad.wmnet - https://phabricator.wikimedia.org/T236227 (10Marostegui) Host ready for #dc-ops steps
[06:57:49] <wikibugs>	 (03PS3) 10Ayounsi: Depool esams for onsite work [dns] - 10https://gerrit.wikimedia.org/r/545440 (https://phabricator.wikimedia.org/T235805)
[06:57:55] <effie>	 !log Depooling mw1317
[06:58:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:12] <XioNoX>	 !log depool esams - T235805
[06:59:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:16] <stashbot>	 T235805: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805
[07:02:36] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime
[07:02:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:46] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[07:02:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:52] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10ops-monitoring-bot) Icinga downtime for 5:00:00 set by ayounsi@cumin1001 on 28 host(s) and their services with reason: Onsite work ` bast3002.wikimedia.org,cp[3007-30...
[07:04:02] <icinga-wm>	 ACKNOWLEDGEMENT - PHP7 rendering on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1310 bytes in 0.064 second response time Effie Mouzeli Host has been depooled, checking https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:04:47] <marostegui>	 !log Enable slow query log 1/10 on db1089 (enwiki) T223151
[07:04:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:51] <stashbot>	 T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151
[07:05:37] <XioNoX>	 !log redirect ns2 to eqiad - T235805
[07:05:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:41] <stashbot>	 T235805: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805
[07:09:51] <wikibugs>	 (03CR) 10Muehlenhoff: puppetdb: enable multiple service urls and command_broadcast (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond)
[07:10:37] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 45.41 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[07:12:02] <wikibugs>	 (03PS1) 10Ayounsi: Add mgmt IPs for esams scs and asw2 [dns] - 10https://gerrit.wikimedia.org/r/545444 (https://phabricator.wikimedia.org/T235805)
[07:16:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10jijiki) `ps1-a6-eqiad` is shown as down in icinga, I believe that is expected?
[07:27:58] <wikibugs>	 (03CR) 10Muehlenhoff: CI rspec: update puppet version used in spec tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond)
[07:28:24] <hashar>	 !log logstash: refreshing index fields for logstash-* indices (via https://logstash.wikimedia.org/app/kibana#/management/kibana/indices/logstash-* ) # T234564
[07:28:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:29] <stashbot>	 T234564: Logstash discards messages from MediaWiki if they contain uncommon keys in the $context array - https://phabricator.wikimedia.org/T234564
[07:30:41] <XioNoX>	 !log powering down cr2-esams for relocation
[07:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s6 codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9446 and previous config saved to /var/cache/conftool/dbconfig/20191023-073556-marostegui.json
[07:36:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:01] <stashbot>	 T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018
[07:37:04] <wikibugs>	 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema)
[07:38:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s6 eqiad - T231018', diff saved to https://phabricator.wikimedia.org/P9447 and previous config saved to /var/cache/conftool/dbconfig/20191023-073831-marostegui.json
[07:38:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:23] <wikibugs>	 (03PS1) 10Ema: ATS: use TLS and DNS discovery to connect to kibana [puppet] - 10https://gerrit.wikimedia.org/r/545445 (https://phabricator.wikimedia.org/T210411)
[07:39:39] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10MoritzMuehlenhoff) 05Resolved→03Open Reopening, currently the same key is used in Cloud VPS and production, which is a security risk.
[07:46:50] <XioNoX>	 !log powering down cr2-esams for relocation (for real this time)
[07:46:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s7 codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9448 and previous config saved to /var/cache/conftool/dbconfig/20191023-074828-marostegui.json
[07:48:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:34] <stashbot>	 T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018
[07:49:20] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Emergency syslog message
[07:51:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s7 eqiad - T231018', diff saved to https://phabricator.wikimedia.org/P9449 and previous config saved to /var/cache/conftool/dbconfig/20191023-075106-marostegui.json
[07:51:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:49] <wikibugs>	 (03PS1) 10Ema: Add graphite.discovery.wmnet pointing to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/545470 (https://phabricator.wikimedia.org/T210411)
[07:54:20] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Emergency syslog message
[07:55:08] <godog>	 !log kafka-logging delete unused topic syslog-notice
[07:55:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:41] <icinga-wm>	 PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:55:53] <ema>	 XioNoX: this is expected, right?
[07:56:06] <XioNoX>	 yep
[07:56:13] <ema>	 alright, thanks
[07:56:24] <XioNoX>	 esams is depooled
[07:56:36] <XioNoX>	 everything amsterdam related is expected at this point :)
[07:56:52] <ema>	 yeah, I've been following the traffic rampup on the eqiad LVSs :)
[07:56:53] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:57:15] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:03:34] <wikibugs>	 (03PS3) 10Muehlenhoff: Extend wmf-userschema for additional MFA options [puppet] - 10https://gerrit.wikimedia.org/r/543402
[08:04:41] <wikibugs>	 (03PS1) 10Ema: graphite: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/545494 (https://phabricator.wikimedia.org/T210411)
[08:06:43] <wikibugs>	 (03PS1) 10Ema: secret: dummy key for graphite [labs/private] - 10https://gerrit.wikimedia.org/r/545495 (https://phabricator.wikimedia.org/T210411)
[08:07:18] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Juniper alarm active
[08:09:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s8 codfw - T231018', diff saved to https://phabricator.wikimedia.org/P9450 and previous config saved to /var/cache/conftool/dbconfig/20191023-080857-marostegui.json
[08:09:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:04] <stashbot>	 T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018
[08:10:42] <wikibugs>	 (03PS1) 10Ema: graphite: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/545496 (https://phabricator.wikimedia.org/T210411)
[08:10:50] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbmonitor: Deploy git repo as mwdeploy, otherwise no write permission [puppet] - 10https://gerrit.wikimedia.org/r/545282 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo)
[08:11:01] <wikibugs>	 (03PS2) 10Jcrespo: dbmonitor: Deploy git repo as mwdeploy, otherwise no write permission [puppet] - 10https://gerrit.wikimedia.org/r/545282 (https://phabricator.wikimedia.org/T224589)
[08:11:39] <godog>	 !log kafka-logging eqiad set 12 partitions for ^mwlog- ^logback- and eqiad.client.error topics
[08:11:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:27] <wikibugs>	 (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for graphite [labs/private] - 10https://gerrit.wikimedia.org/r/545495 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[08:14:47] <wikibugs>	 (03CR) 10Ema: "pcc looks fine: https://puppet-compiler.wmflabs.org/compiler1002/19010/" [puppet] - 10https://gerrit.wikimedia.org/r/545496 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[08:19:49] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Send "100 continue" responses on the ats-tls instance [puppet] - 10https://gerrit.wikimedia.org/r/545499 (https://phabricator.wikimedia.org/T234887)
[08:20:46] <wikibugs>	 (03Abandoned) 10Jbond: puppet-merge: switch to GitPython [puppet] - 10https://gerrit.wikimedia.org/r/544922 (owner: 10Jbond)
[08:21:01] <wikibugs>	 (03PS2) 10Jcrespo: dbmonitor: Install the right apache modules for buster [puppet] - 10https://gerrit.wikimedia.org/r/545286 (https://phabricator.wikimedia.org/T224589)
[08:22:32] <wikibugs>	 (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/19011/" [puppet] - 10https://gerrit.wikimedia.org/r/545499 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez)
[08:22:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights to x100 on s8 eqiad - T231018', diff saved to https://phabricator.wikimedia.org/P9451 and previous config saved to /var/cache/conftool/dbconfig/20191023-082246-marostegui.json
[08:22:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:53] <stashbot>	 T231018: specify group (api/vslow/etc) weights in terms of 0..100 instead of 0..1 - https://phabricator.wikimedia.org/T231018
[08:23:01] <godog>	 !log roll restart logstash in codfw/eqiad to pick up new kafka partitions
[08:23:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:25] <wikibugs>	 (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1002/19012/" [puppet] - 10https://gerrit.wikimedia.org/r/545286 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo)
[08:23:53] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbmonitor: Install the right apache modules for buster [puppet] - 10https://gerrit.wikimedia.org/r/545286 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo)
[08:24:51] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 70.2 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[08:26:20] <wikibugs>	 (03PS4) 10Jbond: CI rspec: update puppet version used in spec tests [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070)
[08:28:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2091:3312 after table compression', diff saved to https://phabricator.wikimedia.org/P9452 and previous config saved to /var/cache/conftool/dbconfig/20191023-082826-marostegui.json
[08:28:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:29] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) (owner: 10Jbond)
[08:31:35] <wikibugs>	 10Operations, 10Patch-For-Review: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) 🤔 ` Notice: /Stage[main]/Httpd/Httpd::Conf[defaults]/File[/etc/apache2/conf-enabled/00-defaults.conf]/ensure: created Info: /Stage[main]/Httpd/Httpd::Conf[defaults]/File[/etc/apac...
[08:31:41] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "minor comment inline, otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe)
[08:32:44] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe)
[08:34:50] <wikibugs>	 (03PS2) 10Ema: graphite: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/545496 (https://phabricator.wikimedia.org/T210411)
[08:34:53] <wikibugs>	 (03PS1) 10Ema: prometheus: aggregation rule for ats-be availability [puppet] - 10https://gerrit.wikimedia.org/r/545500
[08:39:39] <hashar>	 grr
[08:40:10] <wikibugs>	 10Operations, 10Patch-For-Review: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10MoritzMuehlenhoff) I think I know what happened: Initially the Puppet code was wrong and libapache-mod-php didn't get installed (which needs mpm_prefork). But "apache" still got installed...
[08:42:26] <godog>	 !log roll restart rsyslog on cirrus and wqds hosts to pick up changes to logback topic partitions
[08:42:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:29] <wikibugs>	 (03PS2) 10Vgutierrez: ATS: Send "100 continue" responses on the ats-tls instance [puppet] - 10https://gerrit.wikimedia.org/r/545499 (https://phabricator.wikimedia.org/T234887)
[08:45:57] <wikibugs>	 (03CR) 10Ema: [C: 03+1] ATS: Send "100 continue" responses on the ats-tls instance [puppet] - 10https://gerrit.wikimedia.org/r/545499 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez)
[08:46:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: Send "100 continue" responses on the ats-tls instance [puppet] - 10https://gerrit.wikimedia.org/r/545499 (https://phabricator.wikimedia.org/T234887) (owner: 10Vgutierrez)
[08:46:46] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "almost good!" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe)
[08:49:06] <wikibugs>	 (03CR) 10Jbond: puppetdb: enable multiple service urls and command_broadcast (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond)
[08:50:47] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[08:50:48] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:50:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:16] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: use TLS and DNS discovery to connect to kibana [puppet] - 10https://gerrit.wikimedia.org/r/545445 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[08:53:32] <wikibugs>	 (03PS2) 10Ema: ATS: use TLS and DNS discovery to connect to kibana [puppet] - 10https://gerrit.wikimedia.org/r/545445 (https://phabricator.wikimedia.org/T210411)
[08:54:07] <moritzm>	 !log installing systemd bugfix update on mw canaries
[08:54:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:25] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1317 is OK: HTTP OK: HTTP/1.1 200 OK - 77259 bytes in 0.211 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[08:56:11] <icinga-wm>	 PROBLEM - Check systemd state on mw1317 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:58:50] <wikibugs>	 (03PS9) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655)
[08:59:19] <wikibugs>	 (03CR) 10Ema: [C: 03+2] graphite: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/545494 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[08:59:41] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.downtime
[08:59:42] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:59:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:26] <moritzm>	 !log rebooting logstash2021 for some firmware tests
[09:00:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:39] <wikibugs>	 (03CR) 10Ema: [C: 03+2] graphite: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/545496 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[09:01:56] <wikibugs>	 10Operations, 10Patch-For-Review: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) That was more or less what I tried before, but it installs event version rather than prefork. Just to be sure, I tried your exact purges again, and I got the same error:   ` Notic...
[09:02:29] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Add graphite.discovery.wmnet pointing to graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/545470 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[09:02:31] <wikibugs>	 (03CR) 10Volans: wdqs: add data-reload cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe)
[09:04:34] <wikibugs>	 10Operations, 10Patch-For-Review: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) I ran manually `a2dismod mpm_event` and now it worked. I will check if this happens again on a clean install of dbmonitor1001 and add code to handle it.
[09:05:25] <wikibugs>	 (03PS1) 10Ema: ATS: use TLS and DNS discovery to connect to graphite [puppet] - 10https://gerrit.wikimedia.org/r/545504 (https://phabricator.wikimedia.org/T210411)
[09:06:39] <wikibugs>	 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema)
[09:07:11] <icinga-wm>	 PROBLEM - HTTP-dbtree on dbmonitor2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 212 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[09:07:56] <vgutierrez>	 jynus: ^^
[09:09:03] <jynus>	 yes, sorry, it got enabled then the ack gone
[09:09:12] <jynus>	 you can ignore it, I will downtime it again, sorrry
[09:19:59] <icinga-wm>	 PROBLEM - Host checker.tools.wmflabs.org is DOWN: CRITICAL - Host Unreachable (checker.tools.wmflabs.org)
[09:21:37] <marostegui>	 arturo: ^
[09:21:58] <arturo>	 known, thanks marostegui 
[09:22:07] <marostegui>	 thanks :)
[09:23:07] <icinga-wm>	 RECOVERY - Host checker.tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms
[09:24:43] <wikibugs>	 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) Now we "only" need to fix the php, with I would prefer not to, not because it would be difficult, but because it would be a waste of time, and I would prefer to create a simple flash + d3 microsite, sp...
[09:26:29] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "And due to -XX:G1NewSizePercent=15  , the Eden space would grow from 3G to 4,8G  which is probably fine :]" [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn)
[09:27:36] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 9: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond)
[09:27:39] <wikibugs>	 (03PS5) 10Vgutierrez: ATS: Deploy acme-chief version of the unified certificate globally [puppet] - 10https://gerrit.wikimedia.org/r/545208 (https://phabricator.wikimedia.org/T234803)
[09:27:59] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1317 is CRITICAL: Host mw1317 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[09:28:14] <effie>	 ^ that is ok 
[09:28:16] <effie>	 I will ack 
[09:28:38] <effie>	 jouncebot next 
[09:28:38] <jouncebot>	 In 1 hour(s) and 31 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T1100)
[09:28:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond)
[09:28:58] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10hashar) p:05Normal→03Low Adding the few project tags we are using nowadays.  Lowering priority since clearly w...
[09:30:00] <wikibugs>	 (03PS10) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/543845 (https://phabricator.wikimedia.org/T235655)
[09:31:01] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: Deploy acme-chief version of the unified certificate globally [puppet] - 10https://gerrit.wikimedia.org/r/545208 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez)
[09:31:41] <godog>	 !log bump rsyslog-notice topic to 6 partitions
[09:31:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:37] <icinga-wm>	 RECOVERY - HTTP-dbtree on dbmonitor2001 is OK: HTTP OK: HTTP/1.1 200 OK - 2391 bytes in 1.029 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[09:36:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/380304 (owner: 10Giuseppe Lavagetto)
[09:36:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Just for fun, let's do one fleet PCC run. How bad could it be? :P" [puppet] - 10https://gerrit.wikimedia.org/r/380304 (owner: 10Giuseppe Lavagetto)
[09:40:42] <wikibugs>	 (03PS1) 10Jcrespo: mysql: Migrate away from long-deprecated mysql module to mysqli [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589)
[09:40:42] <godog>	 !log roll restart logstash to pick up new rsyslog-notice partitions
[09:40:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:25] <godog>	 !log bounce burrow-logging-eqiad.service on kafkamon1001
[09:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:00] <wikibugs>	 (03PS2) 10Jcrespo: mysql: Migrate away from long-deprecated mysql module to mysqli [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589)
[09:46:13] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: use TLS and DNS discovery to connect to graphite [puppet] - 10https://gerrit.wikimedia.org/r/545504 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[09:46:49] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Provide an easy way of picking the traffic serving TLS certificate used by ATS - https://phabricator.wikimedia.org/T234803 (10Vgutierrez) >>! In T234803#5583888, @BBlack wrote: > Notes from IRC, etc: >  > The current patch (merging shortly: https://gerrit.wikimedi...
[09:49:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] monitoring: set wmcs servers to email when mgmt interfaces fail [puppet] - 10https://gerrit.wikimedia.org/r/545386 (https://phabricator.wikimedia.org/T223458) (owner: 10Bstorm)
[09:51:41] <wikibugs>	 (03CR) 10Volans: "Some comments inline, mostly minor or suggestions" (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe)
[09:52:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1" [puppet] - 10https://gerrit.wikimedia.org/r/545285 (owner: 10Giuseppe Lavagetto)
[09:53:47] <wikibugs>	 (03PS1) 10Jbond: puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/545507 (https://phabricator.wikimedia.org/T235655)
[09:54:36] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Gilles) @elitre it should be its own task, since it's a PDF failing to render and thi...
[09:54:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545507 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond)
[09:55:14] <wikibugs>	 (03PS4) 10Muehlenhoff: Extend wmf-userschema for additional MFA options [puppet] - 10https://gerrit.wikimedia.org/r/543402
[09:56:16] <wikibugs>	 (03PS3) 10Jcrespo: mysql: Migrate away from long-deprecated mysql module to mysqli [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589)
[09:57:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend wmf-userschema for additional MFA options [puppet] - 10https://gerrit.wikimedia.org/r/543402 (owner: 10Muehlenhoff)
[09:58:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "wow!" [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo)
[10:02:28] <wikibugs>	 (03PS1) 10Ema: ATS: pass --enable-reload to tslua as the last argument [puppet] - 10https://gerrit.wikimedia.org/r/545508
[10:02:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> and this would de facto make the install candidate the most recent version in the series at any given time." [puppet] - 10https://gerrit.wikimedia.org/r/544964 (https://phabricator.wikimedia.org/T235675) (owner: 10Alexandros Kosiaris)
[10:03:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo)
[10:04:52] <ema>	 !log cp1075: ats-backend-restart to test https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545508/
[10:04:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mysql: Migrate away from long-deprecated mysql module to mysqli [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo)
[10:09:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb: enable multiple service urls and command_broadcast [puppet] - 10https://gerrit.wikimedia.org/r/545507 (https://phabricator.wikimedia.org/T235655) (owner: 10Jbond)
[10:10:07] <wikibugs>	 (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mysql: Migrate away from long-deprecated mysql module to mysqli [software/dbtree] - 10https://gerrit.wikimedia.org/r/545505 (https://phabricator.wikimedia.org/T224589) (owner: 10Jcrespo)
[10:11:03] <jynus>	 !log deploying new version of dbtree T224589
[10:11:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:07] <stashbot>	 T224589: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589
[10:11:44] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ATS: pass --enable-reload to tslua as the last argument [puppet] - 10https://gerrit.wikimedia.org/r/545508 (owner: 10Ema)
[10:11:54] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: pass --enable-reload to tslua as the last argument [puppet] - 10https://gerrit.wikimedia.org/r/545508 (owner: 10Ema)
[10:13:47] <jynus>	 !log reverting dbtree revision to HEAD~1 T224589
[10:13:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:28] <wikibugs>	 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Legacy (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Yair_rand) >>! In T211881#5509001, @dr0ptp4kt wrote: > Hello all. We're going to turn this into a client-s...
[10:19:55] <wikibugs>	 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) `lines=10 [Wed Oct 23 10:17:48.055752 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_select_db() expects exactly 2 parameters, 1 given in /srv/dbtree/index.php on line 33 [W...
[10:21:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Also use wmf-user LDAP schema on "labs" LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/545512
[10:21:49] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Elitre) oK,will file separately then, TY,
[10:24:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac)
[10:25:03] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on mw1317 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli Server was misbehaving, TBD what well do https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:25:03] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1317 is CRITICAL: Host mw1317 is not in mediawiki-installation dsh group Effie Mouzeli Server was misbehaving, TBD what well do https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[10:25:26] <wikibugs>	 10Operations, 10Gerrit: Editing in Gerrit isn't saved after the update/migration to gerrit1001 - https://phabricator.wikimedia.org/T236143 (10MoritzMuehlenhoff) 05Open→03Invalid I can't reproduce this any longer, maybe it got resolved with the subsequent Gerrit restart to bump the Java size or similar. Clo...
[10:29:36] <wikibugs>	 (03Abandoned) 10Jbond: ulogd: filter out etcd broadcast messages [puppet] - 10https://gerrit.wikimedia.org/r/543149 (owner: 10Jbond)
[10:30:30] <wikibugs>	 (03PS2) 10Muehlenhoff: Also use wmf-user LDAP schema on "labs" LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/545512
[10:30:34] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/19015/" [puppet] - 10https://gerrit.wikimedia.org/r/545512 (owner: 10Muehlenhoff)
[10:33:23] <wikibugs>	 (03PS1) 10Ema: ATS: do not pass enable-reload to tslua [puppet] - 10https://gerrit.wikimedia.org/r/545522 (https://phabricator.wikimedia.org/T233274)
[10:33:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ATS: do not pass enable-reload to tslua [puppet] - 10https://gerrit.wikimedia.org/r/545522 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema)
[10:34:12] <wikibugs>	 (03PS2) 10Ema: ATS: do not pass enable-reload to tslua [puppet] - 10https://gerrit.wikimedia.org/r/545522 (https://phabricator.wikimedia.org/T233274)
[10:34:35] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) 05Resolved→03Open
[10:35:27] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: do not pass enable-reload to tslua [puppet] - 10https://gerrit.wikimedia.org/r/545522 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema)
[10:35:33] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274 (10Vgutierrez) The solution proposed in https://gerrit.wikimedia.org/r/543022 doesn't work as expected due to a bug on ATS. after a config reload the lua script loses the argtb
[10:39:57] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: lvs::configuration: do not assume port 80 by default [puppet] - 10https://gerrit.wikimedia.org/r/545525
[10:39:59] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::discovery::client: fix services file [puppet] - 10https://gerrit.wikimedia.org/r/545526
[10:42:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::discovery::client: fix services file [puppet] - 10https://gerrit.wikimedia.org/r/545526 (owner: 10Giuseppe Lavagetto)
[10:44:09] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error: 429, Too Many Requests while trying to access other resolutions for a PDF file - https://phabricator.wikimedia.org/T236240 (10Elitre)
[10:46:04] <ema>	 !log cp-ats: rolling ATS backend restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545522/ T233274
[10:46:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:08] <stashbot>	 T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274
[10:46:18] <wikibugs>	 10Operations, 10DBA, 10Traffic, 10WMF-Legal, and 3 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499 (10jcrespo)
[10:46:59] <wikibugs>	 10Operations: Migrate dbmonitor hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) So normally the fix for the above would be trivial, but the design decisions of making sql class a singleton are in my opinion not worthy fixing, because it would force to either a deeper refactoring o...
[10:49:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/545512 (owner: 10Muehlenhoff)
[10:51:02] <wikibugs>	 10Operations: Puppet breakage in automation-framework VMs - https://phabricator.wikimedia.org/T234452 (10Volans) There are also local modifications in the private repo fwiw.
[10:51:46] <wikibugs>	 10Operations: Puppet breakage in automation-framework VMs - https://phabricator.wikimedia.org/T234452 (10Volans) p:05Triage→03High
[10:51:52] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mysql: Migrate away from long-deprecated mysql module to mysqli" [software/dbtree] - 10https://gerrit.wikimedia.org/r/545531
[10:52:08] <wikibugs>	 (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "mysql: Migrate away from long-deprecated mysql module to mysqli" [software/dbtree] - 10https://gerrit.wikimedia.org/r/545531 (owner: 10Jcrespo)
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T1100).
[11:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:00:05] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: haproxy: don't use haproxy base module [puppet] - 10https://gerrit.wikimedia.org/r/545532 (https://phabricator.wikimedia.org/T236074)
[11:01:01] <Urbanecm>	 I'll deploy a patch
[11:01:12] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Set $wgArticleCountMethod to 'any' for frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545424 (https://phabricator.wikimedia.org/T236212) (owner: 10DannyS712)
[11:02:03] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgArticleCountMethod to 'any' for frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545424 (https://phabricator.wikimedia.org/T236212) (owner: 10DannyS712)
[11:02:53] <icinga-wm>	 PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[11:03:25] <hauskater>	 Urbanecm: what about https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/544882/ ?
[11:03:45] <hauskater>	 never did InterwikiSortOrders patches fwiw so I'm not able to CR
[11:04:05] <Urbanecm>	 hauskater: can do that as well :)
[11:04:30] <wikibugs>	 (03PS3) 10Urbanecm: Add custom Minerva wordmark for Hebrew wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542660 (https://phabricator.wikimedia.org/T234278) (owner: 10Ammarpad)
[11:04:38] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542660 (https://phabricator.wikimedia.org/T234278) (owner: 10Ammarpad)
[11:05:43] <wikibugs>	 (03Merged) 10jenkins-bot: Add custom Minerva wordmark for Hebrew wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542660 (https://phabricator.wikimedia.org/T234278) (owner: 10Ammarpad)
[11:06:00] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: logstash: support both mediawiki and parsoid-php types [puppet] - 10https://gerrit.wikimedia.org/r/545534
[11:06:13] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: cf8e2f1: Set $wgArticleCountMethod to any for frwikiquote (T236212) (duration: 01m 12s)
[11:06:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:17] <stashbot>	 T236212: Set $wgArticleCountMethod to 'any' for frwikiquote - https://phabricator.wikimedia.org/T236212
[11:06:53] <Urbanecm>	 git fetch
[11:07:01] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add Balinese to interwiki sort orders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544882 (https://phabricator.wikimedia.org/T234768) (owner: 10Jon Harald Søby)
[11:07:25] <hauskater>	 wrong window :)
[11:07:47] <wikibugs>	 (03Merged) 10jenkins-bot: Add Balinese to interwiki sort orders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544882 (https://phabricator.wikimedia.org/T234768) (owner: 10Jon Harald Søby)
[11:09:47] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright: SWAT: 0889da0: Add custom Minerva wordmark for Hebrew wikivoyage (1/2; T234278) (duration: 01m 01s)
[11:09:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:52] <stashbot>	 T234278: Add localized Wikivoyage wordmark to the Hebrew mobile frontend - https://phabricator.wikimedia.org/T234278
[11:11:05] <Urbanecm>	 hauskater: thanks, fixed :D
[11:12:23] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:13:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:14:07] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 58.78 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[11:14:09] <icinga-wm>	 PROBLEM - Host 91.198.174.122 is DOWN: CRITICAL - Time to live exceeded (91.198.174.122)
[11:15:03] <icinga-wm>	 RECOVERY - Host 91.198.174.122 is UP: PING WARNING - Packet loss = 64%, RTA = 81.78 ms
[11:15:25] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[11:15:40] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on maerlant is CRITICAL: connect to address 91.198.174.122 port 5666: No route to host nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T236244 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[11:15:44] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on maerlant - https://phabricator.wikimedia.org/T236244 (10ops-monitoring-bot)
[11:15:52] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:16:02] <icinga-wm>	 PROBLEM - Host 91.198.174.122 is DOWN: CRITICAL - Time to live exceeded (91.198.174.122)
[11:16:02] <icinga-wm>	 PROBLEM - Host 91.198.174.106 is DOWN: CRITICAL - Time to live exceeded (91.198.174.106)
[11:16:16] <Urbanecm>	 Not sure if that's related, but I was just kicked from the deployment host when connected via bast3002
[11:16:34] <icinga-wm>	 RECOVERY - Host 91.198.174.122 is UP: PING OK - Packet loss = 0%, RTA = 83.45 ms
[11:16:34] <Urbanecm>	 when connecting via bast1002, everything works correctly
[11:16:36] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 0889da0: Add custom Minerva wordmark for Hebrew wikivoyage (2/2; T234278) (duration: 01m 01s)
[11:16:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:40] <stashbot>	 T234278: Add localized Wikivoyage wordmark to the Hebrew mobile frontend - https://phabricator.wikimedia.org/T234278
[11:16:43] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: haproxy: don't use haproxy base module [puppet] - 10https://gerrit.wikimedia.org/r/545532 (https://phabricator.wikimedia.org/T236074)
[11:17:11] <hauskater>	 Urbanecm: I've got a report that there's a trouble restoring a file on commons
[11:17:12] <icinga-wm>	 RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:17:20] <icinga-wm>	 RECOVERY - Host 91.198.174.106 is UP: PING OK - Packet loss = 0%, RTA = 83.63 ms
[11:17:31] <hauskater>	 "Error restaurando archivo: El archivo «mwstore://local-multiwrite/local-public/8/89/Premios_L´Oreal_(012).jpg» se encuentra en un estado incoherente dentro de los sistemas de almacenamiento interno"
[11:18:02] <hauskater>	 Error restoring file. The file <<>> is in an incoherent state in our internal storage system
[11:18:07] <Urbanecm>	 that's interesting
[11:18:08] <hauskater>	 (rough translation)
[11:18:10] <wikibugs>	 (03PS2) 10Ema: prometheus: fix aggregation rule for ats-be availability [puppet] - 10https://gerrit.wikimedia.org/r/545500
[11:18:29] <hauskater>	 anything in Logstash? Maybe we can port the error via Phatality
[11:18:53] <Urbanecm>	 !log mwscript updateArticleCount.php --wiki=frwikiquote --update (T236212)
[11:18:55] <Urbanecm>	 hauskater: looking
[11:18:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:57] <stashbot>	 T236212: Set $wgArticleCountMethod to 'any' for frwikiquote - https://phabricator.wikimedia.org/T236212
[11:18:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: fix aggregation rule for ats-be availability [puppet] - 10https://gerrit.wikimedia.org/r/545500 (owner: 10Ema)
[11:19:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: haproxy: don't use haproxy base module [puppet] - 10https://gerrit.wikimedia.org/r/545532 (https://phabricator.wikimedia.org/T236074) (owner: 10Arturo Borrero Gonzalez)
[11:19:46] <Urbanecm>	 hauskater: for which timeframe should I look?
[11:19:55] <hauskater>	 last hour
[11:20:01] <hauskater>	 file name would help?
[11:20:12] <hauskater>	 https://commons.wikimedia.org/wiki/User_talk:MarcoAurelio#Error_al_restaurar
[11:20:18] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[11:20:18] <Urbanecm>	 thx hauskater 
[11:20:34] <hauskater>	 File:Premios_L´Oreal_(012).jpg
[11:20:37] <hauskater>	 that's the file name
[11:20:38] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 83.35 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[11:20:40] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:21:44] <Urbanecm>	 thx hauskater 
[11:22:26] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InterwikiSortOrders.php: SWAT: e21054e: Add Balinese to interwiki sort orders (T234768) (duration: 01m 01s)
[11:22:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:41] <stashbot>	 T234768: Create Balinese Wikipedia - https://phabricator.wikimedia.org/T234768
[11:23:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545534 (owner: 10Giuseppe Lavagetto)
[11:24:13] <Urbanecm>	 !log EU SWAT done
[11:24:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:03] <XioNoX>	 !log powering down cr1-esams
[11:26:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:55] <Urbanecm>	 hauskater: still no match
[11:27:06] <hauskater>	 I'll file a task
[11:28:19] <Urbanecm>	 thanks
[11:29:58] <hauskater>	 Done
[11:30:15] <hauskater>	 Urbanecm: maybe I can try and you can take a look at the logs to see if anything pops up, to help you?
[11:30:39] <Urbanecm>	 hauskater: that would be a solution 
[11:30:44] <hauskater>	 ok
[11:31:38] <hauskater>	 error
[11:31:40] <hauskater>	 Error undeleting file: The file "mwstore://local-multiwrite/local-public/8/89/Premios_L´Oreal_(012).jpg" is in an inconsistent state within the internal storage backends
[11:34:03] <hauskater>	 Urbanecm: ^ & T236246
[11:34:05] <stashbot>	 T236246: Error restoring file on Wikimedia Commons: "File:Premios L´Oreal (012).jpg" - https://phabricator.wikimedia.org/T236246
[11:34:08] <Urbanecm>	 ack
[11:35:58] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error restoring file on Wikimedia Commons: "File:Premios L´Oreal (012).jpg" - https://phabricator.wikimedia.org/T236246 (10MarcoAurelio)
[11:41:19] <Urbanecm>	 hauskater: it says `FileBackendMultiWrite::doOperationsInternal: failed sync check: ["mwstore://local-multiwrite/local-deleted/g/a/y/gayjvm1vy8agetj61ckj0suucb097z3.jpg","mwstore://local-multiwrite/local-public/8/89/Premios_L\u00b4Oreal_(012).jpg"]
[11:41:46] <hauskater>	 mwlog1001?
[11:42:15] <hauskater>	 sound pretty similar to the error I got via the UI
[11:43:59] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error restoring file on Wikimedia Commons: "File:Premios L´Oreal (012).jpg" - https://phabricator.wikimedia.org/T236246 (10Urbanecm) Logstash: https://logstash.wikimedia.org/goto/fb4822e96e27da8bfc3bb8273f4e6132  `  2019-10-2...
[11:44:01] <Urbanecm>	 logstash
[11:44:02] <Urbanecm>	 https://logstash.wikimedia.org/goto/fb4822e96e27da8bfc3bb8273f4e6132
[11:51:17] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: haproxy: tell the service to load all config files [puppet] - 10https://gerrit.wikimedia.org/r/545541 (https://phabricator.wikimedia.org/T236074)
[11:51:18] <hauskater>	 thanks Urbanecm 
[11:51:25] <Urbanecm>	 yw hauskater 
[11:51:30] <hauskater>	 hopefully someone will be able to take a look and unbreak
[11:51:51] <hauskater>	 but looking at the workboard... makes me pesimistic ;)
[11:53:01] <hauskater>	 Curiously I can see the deleted file via the special:undelete UI
[11:53:09] <hauskater>	 so... it's not like it's corrupted or missing
[11:53:35] <Urbanecm>	 https://www.irccloud.com/pastebin/Pgr0H0Uk/
[11:53:39] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: haproxy: tell the service to load all config files [puppet] - 10https://gerrit.wikimedia.org/r/545541 (https://phabricator.wikimedia.org/T236074) (owner: 10Arturo Borrero Gonzalez)
[11:53:58] <Urbanecm>	 I'd say this code shows the error we're seeing
[11:54:09] <Urbanecm>	 in includes\libs\filebackend\FileBackendMultiWrite.php
[11:54:39] <hauskater>	 Sounds it might be
[11:54:45] * hauskater lunch
[11:54:48] <Urbanecm>	 k
[11:57:34] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul)
[11:58:10] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error restoring file on Wikimedia Commons: "File:Premios L´Oreal (012).jpg" - https://phabricator.wikimedia.org/T236246 (10Ezarate) Thank you Marco and another volunteers, the OTRS ticket 2019101810005971 is licensing the fil...
[12:00:05] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T1200)
[12:02:22] <icinga-wm>	 PROBLEM - OSPF status on cr1-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:02:30] <icinga-wm>	 PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - The requested table is empty or does not exist https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:02:32] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:02:40] <icinga-wm>	 PROBLEM - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:02:54] <XioNoX>	 all of those are known an no big deal  ^ the downtime expire
[12:02:54] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 78, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:02:55] <XioNoX>	 d
[12:03:03] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Reedy)
[12:03:08] <icinga-wm>	 PROBLEM - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[12:03:34] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs3003 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams+prometheus/ops
[12:03:43] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - The requested table is empty or does not exist Ayounsi Onsite work https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:03:43] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr1-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP Ayounsi Onsite work https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:03:43] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 78, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Onsite work https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:03:43] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms Ayounsi Onsite work https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[12:03:43] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Onsite work https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:03:43] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP Ayounsi Onsite work https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:08:41] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul)
[12:09:37] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul)
[12:10:30] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul)
[12:19:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "INFO: [change] Nodes: 14 FAIL 366 ERROR" [puppet] - 10https://gerrit.wikimedia.org/r/380304 (owner: 10Giuseppe Lavagetto)
[12:19:54] <wikibugs>	 (03PS1) 10Ayounsi: Rename cr1-esams to cr3-esams [dns] - 10https://gerrit.wikimedia.org/r/545544 (https://phabricator.wikimedia.org/T235805)
[12:26:38] <icinga-wm>	 RECOVERY - BGP status on cr1-esams is OK: BGP OK - up: 0, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:26:48] <icinga-wm>	 PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[12:27:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1130 from the special slaves group on s5 and leave it back with its original pooling options  T223151', diff saved to https://phabricator.wikimedia.org/P9454 and previous config saved to /var/cache/conftool/dbconfig/20191023-122708-marostegui.json
[12:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:14] <stashbot>	 T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151
[12:31:22] <vgutierrez>	 !log restarting ats-tls on cache text nodes - T233274
[12:31:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:26] <stashbot>	 T233274: ATS lua script reload doesn't work as expected - https://phabricator.wikimedia.org/T233274
[12:32:32] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs3003 is OK: (C)0 le (W)0 le 1 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=esams+prometheus/ops
[12:33:23] <wikibugs>	 (03PS1) 10Ayounsi: Rename cr1-esams to cr3-esams (same IP, new box) [puppet] - 10https://gerrit.wikimedia.org/r/545546 (https://phabricator.wikimedia.org/T235805)
[12:34:08] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Rename cr1-esams to cr3-esams [dns] - 10https://gerrit.wikimedia.org/r/545544 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi)
[12:34:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add mgmt IPs for esams scs and asw2 [dns] - 10https://gerrit.wikimedia.org/r/545444 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi)
[12:34:51] <wikibugs>	 (03PS2) 10Ayounsi: Add mgmt IPs for esams scs and asw2 [dns] - 10https://gerrit.wikimedia.org/r/545444 (https://phabricator.wikimedia.org/T235805)
[12:36:36] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp1075 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS
[12:36:56] <wikibugs>	 (03PS2) 10Ayounsi: Rename cr1-esams to cr3-esams [dns] - 10https://gerrit.wikimedia.org/r/545544 (https://phabricator.wikimedia.org/T235805)
[12:37:26] <effie>	 !log Depool mwdebug1002 - T214734
[12:37:28] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "Adding context to some of the comments" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe)
[12:37:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:31] <stashbot>	 T214734: PHP Fatal error:  The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734
[12:37:32] <vgutierrez>	 hmmm checking cp1075
[12:38:10] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Rename cr1-esams to cr3-esams (same IP, new box) [puppet] - 10https://gerrit.wikimedia.org/r/545546 (https://phabricator.wikimedia.org/T235805) (owner: 10Ayounsi)
[12:39:11] <vgutierrez>	 oh right.. 1 failure being on 3/3 warnings and got reported
[12:39:23] <vgutierrez>	 cause I've restarted trafficserver-tls on that node
[12:39:39] <vgutierrez>	 bad  timing :)
[12:44:28] <icinga-wm>	 PROBLEM - Host re0.cr3-esams is DOWN: PING CRITICAL - Packet loss = 100%
[12:45:36] <wikibugs>	 (03PS1) 10Ottomata: Include hadoop client packages and config on dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229)
[12:46:03] <wikibugs>	 10Operations, 10serviceops: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10jijiki)
[12:46:55] <wikibugs>	 (03CR) 10Ottomata: Include hadoop client packages and config on dumps distribution servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata)
[12:49:43] <wikibugs>	 (03PS3) 10Jbond: apereo_cas: add ability to use groovy script to determine MFA [puppet] - 10https://gerrit.wikimedia.org/r/539336 (https://phabricator.wikimedia.org/T233937)
[12:50:40] <wikibugs>	 (03CR) 10Jbond: "updated with new LDAP paramaters" [puppet] - 10https://gerrit.wikimedia.org/r/539336 (https://phabricator.wikimedia.org/T233937) (owner: 10Jbond)
[12:51:58] <icinga-wm>	 PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:54:00] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 78, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:00:04] <jouncebot>	 liw and brennen: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T1300).
[13:01:29] <wikibugs>	 (03PS1) 10Lars Wirzenius: group1 wikis to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545557
[13:01:31] <wikibugs>	 (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545557 (owner: 10Lars Wirzenius)
[13:01:59] <wikibugs>	 (03PS1) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558
[13:02:26] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545557 (owner: 10Lars Wirzenius)
[13:02:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (owner: 10Effie Mouzeli)
[13:02:52] <icinga-wm>	 RECOVERY - Juniper alarms on cr2-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[13:04:24] <logmsgbot>	 !log ssastry@deploy1001 Started deploy [parsoid/deploy@451db1e]: Updating Parsoid to 5521ea74; Dummy Parsoid deploy to debug Parsoid/PHP deployment issues
[13:04:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:46] <icinga-wm>	 RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:05:59] <logmsgbot>	 !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.3
[13:06:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:07:00] <logmsgbot>	 !log liw@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.3 (duration: 01m 00s)
[13:07:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:08] <logmsgbot>	 !log ssastry@deploy1001 Finished deploy [parsoid/deploy@451db1e]: Updating Parsoid to 5521ea74; Dummy Parsoid deploy to debug Parsoid/PHP deployment issues (duration: 08m 44s)
[13:13:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:08] <wikibugs>	 (03CR) 10Ema: [C: 03+2] prometheus: fix aggregation rule for ats-be availability [puppet] - 10https://gerrit.wikimedia.org/r/545500 (owner: 10Ema)
[13:28:02] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-kibana.state on multatuli is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd
[13:34:01] <effie>	 !log disable puppet on mwdebug1002 - T214734
[13:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:05] <stashbot>	 T214734: PHP Fatal error:  The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp on mwdebug1002) - https://phabricator.wikimedia.org/T214734
[13:35:03] <XioNoX>	 !log migrate esams mgmt to new mgmt router
[13:35:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:51] <hashar>	 liw:  I am going to restart the CI Jenkins soonish
[13:37:58] <hashar>	 that might interfer with the train
[13:38:36] <icinga-wm>	 PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:56] <icinga-wm>	 PROBLEM - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:28] <icinga-wm>	 PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:40:08] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:41:42] <icinga-wm>	 PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:41:58] <icinga-wm>	 PROBLEM - Host mr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[13:43:19] <librenms-wmf>	 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80%
[13:46:02] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10fgiunchedi)
[13:46:05] <liw>	 hashar, I've deployed to group1, not currently deploying
[13:46:07] <wikibugs>	 (03PS2) 10BBlack: geodns: eqiad non-primary for all public users [dns] - 10https://gerrit.wikimedia.org/r/545385 (https://phabricator.wikimedia.org/T235805)
[13:46:13] <hashar>	 liw: great :)
[13:46:19] <wikibugs>	 (03CR) 10Jcrespo: "+1 for the mysql module process, I have yet to have a look at the bacula one, which Alex should also weight in." [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) (owner: 10Jbond)
[13:48:57] <wikibugs>	 (03PS1) 10Filippo Giunchedi: DNM: adjust logstash index template for ES 7 [puppet] - 10https://gerrit.wikimedia.org/r/545566 (https://phabricator.wikimedia.org/T235891)
[13:57:03] <wikibugs>	 (03PS2) 10Andrew Bogott: m5 grants: remove grants for 'labtestwiki' database [puppet] - 10https://gerrit.wikimedia.org/r/543955 (https://phabricator.wikimedia.org/T233236)
[13:57:16] <_joe_>	 !log manually changing the symlinked deployed version of parsoid on wtp1025 T236275
[13:57:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:20] <stashbot>	 T236275: Parsoid-php doesn't get updated after a code deploy - https://phabricator.wikimedia.org/T236275
[13:58:59] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps: stub out the (unused-on-VMs) profile::backup::ferm_directors [puppet] - 10https://gerrit.wikimedia.org/r/545567 (https://phabricator.wikimedia.org/T236239)
[14:00:58] <hashar>	 !log Restarting CI Jenkins
[14:01:00] <wikibugs>	 (03CR) 10Jcrespo: "Is this really part of T229209? Will help anyway if it isn't, but I don't understand the context." [puppet] - 10https://gerrit.wikimedia.org/r/545567 (https://phabricator.wikimedia.org/T236239) (owner: 10Andrew Bogott)
[14:01:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:47] <wikibugs>	 (03CR) 10Ori.livneh: "Sorry, I don't have access to beta. Alex cherry-picked it for me. Alex, can you remove it from there? (And could you add me to the beta pr" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh)
[14:10:45] <wikibugs>	 (03PS2) 10Andrew Bogott: cloud-vps: stub out some unused-on-vms puppetmaster bits [puppet] - 10https://gerrit.wikimedia.org/r/545567 (https://phabricator.wikimedia.org/T236239)
[14:11:22] <wikibugs>	 (03CR) 10Reedy: "Just added you to the beta project as an admin :)" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh)
[14:12:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: stub out some unused-on-vms puppetmaster bits [puppet] - 10https://gerrit.wikimedia.org/r/545567 (https://phabricator.wikimedia.org/T236239) (owner: 10Andrew Bogott)
[14:13:19] <librenms-wmf>	 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80%
[14:14:01] <ema>	 XioNoX: ^
[14:15:39] <wikibugs>	 (03CR) 10Eevans: rename service definition (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans)
[14:18:41] <wikibugs>	 (03PS1) 10BBlack: Revert "Depool esams for onsite work" [dns] - 10https://gerrit.wikimedia.org/r/545570 (https://phabricator.wikimedia.org/T235805)
[14:18:48] <wikibugs>	 10Operations, 10Puppet: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10jbond) p:05Triage→03Normal
[14:18:57] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "Depool esams for onsite work" [dns] - 10https://gerrit.wikimedia.org/r/545570 (https://phabricator.wikimedia.org/T235805) (owner: 10BBlack)
[14:19:03] <wikibugs>	 (03PS1) 10CDanis: Repool esams [dns] - 10https://gerrit.wikimedia.org/r/545571 (https://phabricator.wikimedia.org/T235805)
[14:19:20] <bblack>	 !log repooling esams
[14:19:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:33] <jynus>	 cdanis: beware of the duplicate patch
[14:19:35] <wikibugs>	 (03Abandoned) 10CDanis: Repool esams [dns] - 10https://gerrit.wikimedia.org/r/545571 (https://phabricator.wikimedia.org/T235805) (owner: 10CDanis)
[14:19:37] <wikibugs>	 (03PS1) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277)
[14:19:39] <cdanis>	 indeed
[14:19:43] <jynus>	 ah, you saw it
[14:20:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond)
[14:22:47] <wikibugs>	 (03PS2) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277)
[14:24:02] <icinga-wm>	 RECOVERY - Host asw-esams is UP: PING OK - Packet loss = 0%, RTA = 83.83 ms
[14:24:52] <icinga-wm>	 RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 83.77 ms
[14:24:58] <icinga-wm>	 RECOVERY - Host re0.cr3-esams is UP: PING OK - Packet loss = 0%, RTA = 83.75 ms
[14:25:08] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:26:19] <XioNoX>	 alright
[14:26:23] <XioNoX>	 finally
[14:26:58] <cdanis>	 \o/
[14:27:20] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%
[14:27:23] <XioNoX>	 FYI, OSPF doesn't want to establish if the MTU is not *exactly* the same on both sides
[14:27:48] <XioNoX>	 I spent 1h trying to figure out what was wrong with my ospf
[14:28:05] <bblack>	 MTU, the gift that will always keep on giving :P
[14:28:38] <icinga-wm>	 PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 42, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:28:40] <icinga-wm>	 RECOVERY - Host mr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 83.75 ms
[14:31:12] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 55.14 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[14:35:17] <wikibugs>	 (03PS1) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074)
[14:35:48] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.decommission
[14:35:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:07] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[14:36:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:13] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10decommission: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ema@cumin1001 for hosts: `cp3007.esams.wmnet` -  cp3007.esams.wmnet (**PASS**)   - Downtimed host on Icin...
[14:36:42] <icinga-wm>	 PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100%
[14:37:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans)
[14:37:30] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.decommission
[14:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:05] <logmsgbot>	 !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[14:38:08] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10decommission: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ema@cumin1001 for hosts: `cp3008.esams.wmnet` -  cp3008.esams.wmnet (**FAIL**)   - Downtimed host on Icin...
[14:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:46] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.decommission
[14:39:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:04] <icinga-wm>	 RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 46, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:40:20] <logmsgbot>	 !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[14:40:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:24] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10decommission: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ema@cumin1001 for hosts: `cp3010.esams.wmnet` -  cp3010.esams.wmnet (**PASS**)   - Downtimed host on Icin...
[14:40:29] <wikibugs>	 (03PS1) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332)
[14:41:53] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10decommission: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10ema) >>! In T208585#5599005, @ops-monitoring-bot wrote: >   - **Failed to power off, manual intervention required**: Remote IPMI for cp3008.mgmt.esams.wmnet failed (exit=...
[14:42:15] <wikibugs>	 (03CR) 10Mobrovac: [C: 03+1] rename service definition (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans)
[14:42:22] <icinga-wm>	 RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 93.22 ms
[14:44:36] <icinga-wm>	 PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100%
[14:44:56] <icinga-wm>	 PROBLEM - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100%
[14:45:14] <wikibugs>	 (03CR) 10BPirkle: [C: 03+1] rename service definition (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans)
[14:45:30] <icinga-wm>	 ACKNOWLEDGEMENT - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 35.94 le 60 Ema Expected eqiad traffic drop due to esams repool https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[14:46:04] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:46:16] <wikibugs>	 (03PS3) 10Eevans: rename service definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851)
[14:46:26] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] rename service definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans)
[14:47:18] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Juniper alarm active
[14:47:52] <icinga-wm>	 PROBLEM - Host mr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[14:48:48] <wikibugs>	 (03PS1) 10Jhedden: toolforge: add package and service deps to etcd profile [puppet] - 10https://gerrit.wikimedia.org/r/545579
[14:50:18] <wikibugs>	 (03PS3) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277)
[14:50:36] <wikibugs>	 (03PS2) 10Jbond: puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332)
[14:50:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: add package and service deps to etcd profile [puppet] - 10https://gerrit.wikimedia.org/r/545579 (owner: 10Jhedden)
[14:52:06] <wikibugs>	 (03PS2) 10Jhedden: toolforge: add package and service deps to etcd profile [puppet] - 10https://gerrit.wikimedia.org/r/545579
[14:52:11] <wikibugs>	 (03PS2) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074)
[14:54:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans)
[14:55:32] <wikibugs>	 (03PS3) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074)
[14:56:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/545579 (owner: 10Jhedden)
[14:59:57] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] toolforge: add package and service deps to etcd profile [puppet] - 10https://gerrit.wikimedia.org/r/545579 (owner: 10Jhedden)
[15:01:46] <icinga-wm>	 RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 83.72 ms
[15:01:48] <icinga-wm>	 RECOVERY - Host asw-esams is UP: PING OK - Packet loss = 0%, RTA = 93.04 ms
[15:01:55] <XioNoX>	 cabling issue ^
[15:02:22] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 76.05 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:02:42] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: ingress: fix serviceaccount names [puppet] - 10https://gerrit.wikimedia.org/r/545587 (https://phabricator.wikimedia.org/T236074)
[15:02:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:04:08] <icinga-wm>	 RECOVERY - Host mr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 83.70 ms
[15:09:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: ingress: fix serviceaccount names [puppet] - 10https://gerrit.wikimedia.org/r/545587 (https://phabricator.wikimedia.org/T236074) (owner: 10Arturo Borrero Gonzalez)
[15:10:58] <icinga-wm>	 RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:11:12] <icinga-wm>	 RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:15:30] <wikibugs>	 (03PS4) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074)
[15:16:16] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): puppet ca_server confusion - https://phabricator.wikimedia.org/T176437 (10Andrew) 05Open→03Declined Lots of things have changed since I wrote this; closing for now until I'm confused anew in the future.
[15:16:55] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Let's do it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle)
[15:17:11] <icinga-wm>	 PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:20:28] <wikibugs>	 (03CR) 10Volans: "Compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans)
[15:23:46] <wikibugs>	 10Puppet, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10Bstorm) a:05Andrew→03None
[15:24:08] <wikibugs>	 10Puppet, 10cloud-services-team (Kanban): Reduce the effects of puppet breakage on VPS - https://phabricator.wikimedia.org/T226270 (10Bstorm) a:05Andrew→03None
[15:30:34] <wikibugs>	 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10Papaul) @robh there is no dns3003 the last server is bast3003 so  only dns300[1-2]
[15:32:04] <wikibugs>	 (03PS1) 10Papaul: DNS: Add mgmt DNS for lvs3006 ganeti3002 dns3002 cp305[5-9] and cp3060 [dns] - 10https://gerrit.wikimedia.org/r/545599
[15:32:10] <wikibugs>	 10Operations, 10ops-esams, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10BBlack) confirming above - @papaul is correct.  The total set of new esams Linux boxes AFAIK is: 16x caches, 3x LVS, 2x DNS, 1x Bastion, 3x Ganeti.
[15:32:11] <marostegui>	 !log Enable slow query log 1/20 on db1089 (enwiki) T223151
[15:32:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:16] <stashbot>	 T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151
[15:32:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt DNS for lvs3006 ganeti3002 dns3002 cp305[5-9] and cp3060 [dns] - 10https://gerrit.wikimedia.org/r/545599 (owner: 10Papaul)
[15:33:25] <wikibugs>	 10Operations, 10ops-esams, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns300[123] - https://phabricator.wikimedia.org/T236217 (10RobH)
[15:33:27] <wikibugs>	 (03PS5) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074)
[15:33:40] <wikibugs>	 10Operations, 10ops-esams, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10RobH)
[15:35:26] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] "Looks like a good first step." [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn)
[15:35:59] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: parsoid: allow restarting safely php-fpm during deployments. [puppet] - 10https://gerrit.wikimedia.org/r/545600 (https://phabricator.wikimedia.org/T236275)
[15:36:03] <icinga-wm>	 PROBLEM - OSPF status on mr1-esams is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:36:32] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/544943 (owner: 10Jbond)
[15:36:54] <wikibugs>	 10Operations, 10Puppet, 10Proposal: Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10Andrew)
[15:37:17] <wikibugs>	 (03PS2) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253)
[15:37:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] parsoid: allow restarting safely php-fpm during deployments. [puppet] - 10https://gerrit.wikimedia.org/r/545600 (https://phabricator.wikimedia.org/T236275) (owner: 10Giuseppe Lavagetto)
[15:39:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) (owner: 10Effie Mouzeli)
[15:40:06] <wikibugs>	 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Designate seems very slow to delete records? - https://phabricator.wikimedia.org/T149057 (10JHedden) 05Open→03Resolved a:03JHedden Record deletes are working as expected now, likely resolved from the OpenStack upgrades and service improvem...
[15:42:23] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on mw1252 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1252&var-datasource=eqiad+prometheus/ops
[15:42:31] <wikibugs>	 (03PS1) 10Jbond: jenkins: use correct java version on buster [puppet] - 10https://gerrit.wikimedia.org/r/545603
[15:43:17] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: LVS: add config for parsoid-php service [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[15:43:28] <wikibugs>	 (03PS3) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253)
[15:43:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] LVS: add config for parsoid-php service [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[15:43:55] <icinga-wm>	 PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[15:45:14] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10aborrero)
[15:45:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545603 (owner: 10Jbond)
[15:45:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) (owner: 10Effie Mouzeli)
[15:45:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] jenkins: use correct java version on buster [puppet] - 10https://gerrit.wikimedia.org/r/545603 (owner: 10Jbond)
[15:45:46] <wikibugs>	 10Operations: reprepro: automate incoming processing - https://phabricator.wikimedia.org/T215812 (10Andrew)
[15:48:03] <wikibugs>	 (03PS4) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253)
[15:48:36] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: LVS: add config for parsoid-php service [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[15:50:12] <wikibugs>	 (03PS1) 10Jbond: Revert "jenkins: use correct java version on buster" [puppet] - 10https://gerrit.wikimedia.org/r/545605
[15:51:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "jenkins: use correct java version on buster" [puppet] - 10https://gerrit.wikimedia.org/r/545605 (owner: 10Jbond)
[15:52:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] LVS: add config for parsoid-php service [puppet] - 10https://gerrit.wikimedia.org/r/543243 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[15:52:39] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:54:15] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[15:54:17] <wikibugs>	 (03PS5) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253)
[15:54:31] <wikibugs>	 (03CR) 10Effie Mouzeli: "Looks ok https://puppet-compiler.wmflabs.org/compiler1002/19023/mw1222.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) (owner: 10Effie Mouzeli)
[15:54:35] <wikibugs>	 (03PS6) 10Dzahn: gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166)
[15:55:12] <_joe_>	 !log restarting pybal on lvs2006, then 2003 for picking up parsoid-php
[15:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:22] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn)
[15:56:32] <godog>	 mhh re: the 503s are still there albeit at a smaller rate, looks like from this dashboard that cp3030 might be in trouble? https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-3h&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend
[15:56:37] <icinga-wm>	 RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:56:40] <godog>	 cc bblack ^
[15:56:47] <icinga-wm>	 RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:57:55] <wikibugs>	 10Operations, 10Traffic: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10colewhite) [[ https://logstash.wikimedia.org/goto/0493475ebf5b04d14b38741e3c75261a | And now it's dropped off for a few hours. ]]
[15:58:24] <wikibugs>	 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10Andrew)
[15:59:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers wtp2018.codfw.wmnet, wtp2019.codfw.wmnet, wtp2015.codfw.wmnet, wtp2001.codfw.wmnet, wtp2020.codfw.wmnet, wtp2006.codfw.wmnet, wtp2009.codfw.wmnet, wtp2016.codfw.wmnet, wtp2008.codfw.wmnet, wtp2005.codfw.wmnet, wtp2011.codfw.wmnet, wtp2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBa
[16:00:57] <icinga-wm>	 RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:01:18] <_joe_>	 known ^^
[16:01:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: lvs: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/545608
[16:01:23] <_joe_>	 the pybal alert
[16:01:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "sigh." [puppet] - 10https://gerrit.wikimedia.org/r/545608 (owner: 10Giuseppe Lavagetto)
[16:02:03] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10colewhite) p:05Triage→03Normal
[16:02:37] <wikibugs>	 10Operations, 10Discovery-Search, 10vm-requests: setup/install airflow1001.eqiad.wmnet on ganeti - https://phabricator.wikimedia.org/T236181 (10colewhite) p:05Triage→03Normal
[16:03:27] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - CRITICAL - parsoid-php_443: Servers wtp2018.codfw.wmnet, wtp2019.codfw.wmnet, wtp2020.codfw.wmnet, wtp2001.codfw.wmnet, wtp2015.codfw.wmnet, wtp2006.codfw.wmnet, wtp2009.codfw.wmnet, wtp2016.codfw.wmnet, wtp2008.codfw.wmnet, wtp2005.codfw.wmnet, wtp2011.codfw.wmnet, wtp2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBa
[16:04:28] <mutante>	 ^ known 
[16:05:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:05:47] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:08:14] <_joe_>	 ok, good
[16:08:25] <_joe_>	 I will run puppet on the icinga hosts in a few minutes
[16:13:01] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] DNS: Add mgmt DNS for lvs3006 ganeti3002 dns3002 cp305[5-9] and cp3060 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/545599 (owner: 10Papaul)
[16:14:01] <wikibugs>	 (03CR) 10BPirkle: [C: 03+1] "For completeness, +1 to rebased patch set 3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans)
[16:14:19] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:15:13] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 80, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:17:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10wiki_willy) Hi @jijiki - I think there are a couple things that @Jclark-ctr needs to check and resolve, before @RobH can configure it.  After that, the alert should go away.   Th...
[16:18:15] <robh>	 i got distracted with payments cert stuff
[16:18:29] <wikibugs>	 10Operations, 10Core Platform Team, 10serviceops: php-fpm  invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10jijiki)
[16:21:08] <wikibugs>	 10Operations, 10Core Platform Team, 10serviceops: php-fpm  invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10jijiki) mw1317 will be reimaged, but not yet. We will keep it around (but off production) until someone can have a closer look
[16:22:52] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: parsoid: allow restarting safely php-fpm during deployments. [puppet] - 10https://gerrit.wikimedia.org/r/545600 (https://phabricator.wikimedia.org/T236275)
[16:24:20] <wikibugs>	 10Operations, 10MediaWiki-Maintenance-scripts, 10cloud-services-team (Kanban): processEchoEmailBatch.php failing for labtestwiki - https://phabricator.wikimedia.org/T236145 (10Andrew) This ought to be fixed now -- please let me know if it is not!
[16:24:57] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: parsoid: allow restarting safely php-fpm during deployments. [puppet] - 10https://gerrit.wikimedia.org/r/545600 (https://phabricator.wikimedia.org/T236275)
[16:26:44] <wikibugs>	 (03PS6) 10Eevans: cassandra config updates for 3.11.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803)
[16:27:51] <wikibugs>	 (03PS2) 10Dzahn: DNS: Add mgmt DNS for lvs3006 ganeti3002 dns3002 cp305[5-9] and cp3060 [dns] - 10https://gerrit.wikimedia.org/r/545599 (owner: 10Papaul)
[16:29:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DNS: Add mgmt DNS for lvs3006 ganeti3002 dns3002 cp305[5-9] and cp3060 [dns] - 10https://gerrit.wikimedia.org/r/545599 (owner: 10Papaul)
[16:30:54] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) >>! In T191183#5597740, @hashar wrote: > Lowering priority since clearly we have no bandwith to work on addin...
[16:31:38] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10WMF-NDA-Requests: Please add @jrobell and @spatton to WMF-NDA (access to private Phabricator tasks) - https://phabricator.wikimedia.org/T161822 (10Lgruwell-WMF) Not sure the process here, but I approve of Spatton and Jrobell having access to WMF-NDA.
[16:33:10] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: parsoid: allow restarting safely php-fpm during deployments. [puppet] - 10https://gerrit.wikimedia.org/r/545600 (https://phabricator.wikimedia.org/T236275)
[16:33:15] <librenms-wmf>	 04Critical Alert for device ps1-23-ulsfo.mgmt.ulsfo.wmnet - Device rebooted
[16:34:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "deployed - all working except cp3057 is not found - i dont see yet why that is despite the fix" [dns] - 10https://gerrit.wikimedia.org/r/545599 (owner: 10Papaul)
[16:35:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/19026/wtp1025.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/545600 (https://phabricator.wikimedia.org/T236275) (owner: 10Giuseppe Lavagetto)
[16:35:46] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH)
[16:35:48] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Papaul) @BBlack here is the information for the CP servers in rack 15   cp3055 : xe-5/0/15 cp3056: xe-5/0/16 cp3057: xe-5/0/17 cp3058: xe-5/0/18 cp3059: xe-5/0...
[16:37:12] <wikibugs>	 10Operations, 10ops-esams, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Papaul) @BBlack dn3002 racked in rack 15 switch information   xe-5/0/14
[16:37:39] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[16:38:05] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[16:38:07] <wikibugs>	 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Papaul) ganeti3002 switch information   xe-5/0/13
[16:38:13] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:38:17] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:39:13] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[16:39:41] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[16:39:49] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:39:53] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:42:37] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10RobH)
[16:42:39] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:42:58] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10RobH)
[16:43:15] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-23-ulsfo.mgmt.ulsfo.wmnet recovered from Device rebooted
[16:43:40] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10RobH)
[16:44:41] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:45:03] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul) @BBlack lvs3005 switch information   xe-5/0/12
[16:45:17] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:45:53] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:46:15] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: parsoid: switch command with sudo rule to check-and-restart-php [puppet] - 10https://gerrit.wikimedia.org/r/545618
[16:46:19] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:46:53] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[16:48:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] parsoid: switch command with sudo rule to check-and-restart-php [puppet] - 10https://gerrit.wikimedia.org/r/545618 (owner: 10Giuseppe Lavagetto)
[16:49:39] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Dzahn) @lexnasser Please create a new SSH key that is not used in cloud and let us know the public part so we can update the production access.
[16:49:54] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Dzahn) a:05Dzahn→03lexnasser
[16:54:52] <icinga-wm>	 PROBLEM - Host re0.cr2-esams is DOWN: PING CRITICAL - Packet loss = 100%
[16:55:00] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on parsoid.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.28 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:55:09] <_joe_>	 sigh
[16:55:16] <_joe_>	 I have no idea why it's paging
[16:55:16] <icinga-wm>	 PROBLEM - Juniper alarms on cr2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[16:55:17] <librenms-wmf>	 04Critical Alert for device asw2-esams.mgmt.esams.wmnet - Juniper alarm active
[16:55:19] <_joe_>	 or better, I know
[16:55:27] <volans>	 _joe_: new service or need help?
[16:55:31] <_joe_>	 but please ignore, I need to understand what did I do wrong
[16:55:33] <_joe_>	 new service
[16:55:36] <volans>	 ok
[16:55:46] <bblack>	 XioNoX: esams network stuff above?
[16:55:53] <apergos>	 ok
[16:56:08] <_joe_>	 volans: can you downtime the other one in codfw please?
[16:56:11] <_joe_>	 it will page too
[16:56:13] <volans>	 sure
[16:56:18] <_joe_>	 thanks
[16:56:19] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10lexnasser) Here's another public ED25519 key: AAAAC3NzaC1lZDI1NTE5AAAAIOBTDDmL8isvso6xqOJB5qkk3n8xuM0XxFc1Q34ZnZRj  Let me know which service is associated with which k...
[16:56:46] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on parsoid.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.28 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[16:56:52] <volans>	 ETOOLATE
[16:56:54] <_joe_>	 heh
[16:56:55] <volans>	 sorry
[16:56:57] <_joe_>	 sorry
[16:57:00] <_joe_>	 no I noticed late too
[16:57:01] <volans>	 was about to
[16:57:09] <XioNoX>	 yeah the juniper alarms you can ignore for now
[16:57:22] <XioNoX>	 there is some cables shuffling going on
[16:57:24] <volans>	 _joe_: it's only the HTTP
[16:57:27] <volans>	 the HTTPS is green
[16:57:31] <apergos>	 heh oh well
[16:57:38] <volans>	 does it even listen to http?
[16:57:39] <_joe_>	 yes, I don't get why it defined the http too
[16:57:46] <volans>	 ok
[16:57:47] <_joe_>	 volans: it does but the port is filtered
[16:58:28] <icinga-wm>	 RECOVERY - Juniper alarms on cr2-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[16:58:32] <_joe_>	 I guess there is some magic I forgot about going on here
[16:58:40] <akosiaris>	 need help?
[16:58:47] <_joe_>	 akosiaris: might be
[16:59:20] <jynus>	 parsoid?
[16:59:28] <_joe_>	 jynus: ignore
[16:59:44] <jynus>	 sorry
[16:59:45] <jynus>	 read too late
[16:59:50] <_joe_>	 np :)
[16:59:58] <wikibugs>	 (03CR) 10Nuria: [C: 04-1] "Let's make sure staff uses staff e-mail though, I think that should be easy to change." [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite)
[17:00:02] <_joe_>	 akosiaris: I don't get why the http check is defined too
[17:00:04] <icinga-wm>	 RECOVERY - Host re0.cr2-esams is UP: PING OK - Packet loss = 0%, RTA = 88.35 ms
[17:00:05] <akosiaris>	 parsoid is listening just fine on 8000
[17:00:25] <akosiaris>	 and the firewall rule is ok as well
[17:00:29] <akosiaris>	 e.g. on wtp1025
[17:00:35] <_joe_>	 akosiaris: this is parsoid-php
[17:00:42] <_joe_>	 it listens on 443, but not 80
[17:00:46] <akosiaris>	 sigh, port 80
[17:00:49] <akosiaris>	 never mind
[17:00:55] <_joe_>	 but somehow what I wrote in lvs::configuration activated both
[17:01:12] <akosiaris>	 there is some really weird logic in one place about the icinga checks
[17:01:14] <volans>	 it uses  Lvs::Monitor_service_http_https
[17:01:58] <_joe_>	 volans: does it?
[17:02:02] <volans>	 from puppetboard
[17:02:03] <volans>	 yes
[17:02:10] <volans>	 Lvs::Monitor_service_http_https[parsoid.svc.codfw.wmnet]
[17:02:14] <_joe_>	 yeah
[17:02:37] <_joe_>	 then I don't get how api and api-https can coexist
[17:02:44] <_joe_>	 oh right I see now
[17:02:46] <_joe_>	 gosh
[17:02:51] <volans>	 that contains    Monitoring::Service[parsoid.svc.codfw.wmnet]
[17:02:55] <volans>	 that is the http veersion
[17:02:56] <_joe_>	 the wizardry lvs::monitor
[17:02:56] <volans>	 of the check
[17:03:10] <icinga-wm>	 PROBLEM - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100%
[17:03:31] <cdanis>	 _joe_: I think there is a quick fix here
[17:03:33] <_joe_>	 volans: yeah so we get some duplicates that end up coinciding in the output, ugh
[17:03:40] <_joe_>	 cdanis: there are a couple, yes
[17:03:52] <_joe_>	 the easiest one is the one I'm going to apply now
[17:04:21] <volans>	 if not needed I need to step out
[17:04:32] <icinga-wm>	 RECOVERY - Host asw-esams is UP: PING OK - Packet loss = 0%, RTA = 83.77 ms
[17:04:41] <_joe_>	 this is not an emergency
[17:04:46] <_joe_>	 parsoid-php is not in production
[17:04:57] <_joe_>	 it's just ironic I tried to fix that check so that it won't page
[17:05:06] <wikibugs>	 (03PS1) 10CDanis: lvs parsoid-php workaround [puppet] - 10https://gerrit.wikimedia.org/r/545619
[17:05:08] <_joe_>	 and some abstraction we created years ago is biting me 
[17:05:29] <volans>	 :D
[17:05:32] <volans>	 ttyl
[17:05:33] <_joe_>	 cdanis: check_https_lvs
[17:05:40] <_joe_>	 if it exists
[17:05:49] <cdanis>	 lol it does
[17:05:51] <cdanis>	 wrong copy and paste
[17:06:46] <_joe_>	 check_https_lvs_on_port
[17:06:52] <_joe_>	 but doesn't support the hostname
[17:06:54] <_joe_>	 ahah
[17:07:16] <cdanis>	 monitor_service_http_https calls check_http_lvs
[17:07:34] <_joe_>	 check_https_url
[17:07:47] <_joe_>	 this is what you probably want
[17:07:52] <cdanis>	 yes
[17:07:53] <cdanis>	 you are right
[17:08:13] <wikibugs>	 (03PS2) 10CDanis: lvs parsoid-php workaround [puppet] - 10https://gerrit.wikimedia.org/r/545619
[17:08:20] <akosiaris>	 and relies on $check_command to differentiate between the simple form and adding both http/https
[17:09:14] <_joe_>	 akosiaris: yeah I expected that specifying just the uri would result in a check of the port to call
[17:09:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs parsoid-php workaround [puppet] - 10https://gerrit.wikimedia.org/r/545619 (owner: 10CDanis)
[17:09:35] <akosiaris>	 3rd time this is biting us in a couple of months
[17:09:55] <_joe_>	 time to fix that horror and rewrite it in puppet 4+
[17:09:55] <akosiaris>	 It's probably about time we rework the entire structure + corresponding code
[17:10:03] <cdanis>	 https://puppet-compiler.wmflabs.org/compiler1002/19031/icinga1001.wikimedia.org/
[17:10:08] <akosiaris>	 it's been there since 2014?
[17:10:14] <_joe_>	 yeah 
[17:10:19] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19031/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/545619 (owner: 10CDanis)
[17:10:20] <_joe_>	 earlier possibly
[17:10:22] <akosiaris>	 full of assumptions to accommodate the status quo stanza
[17:10:33] <wikibugs>	 (03CR) 10Cwhite: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite)
[17:11:18] <_joe_>	 yeah but more in general, I'd like to rethink how we set up a new service from scratch
[17:11:27] <cdanis>	 it is far too complicated right now
[17:11:28] <_joe_>	 ideally I'd like to make 1, 2 puppet commits tops
[17:11:31] <_joe_>	 yes
[17:11:41] <cdanis>	 and even those who understand every part get things wrong sometimes ;)
[17:12:08] <_joe_>	 cdanis: tbh I completely forgot how the tricks we do in ruby in lvs::monitor allow duplicate declarations :D
[17:12:44] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:12:47] <_joe_>	 I would also love not to have to restart pybal when we add a service, but that's not exaclty easy
[17:13:20] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on parsoid.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 14776 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:13:34] <_joe_>	 it still adds both, sigh wtf
[17:13:44] <cdanis>	 that isn't what it looked like should happen in pcc
[17:13:45] <_joe_>	 at least it's all green
[17:14:12] <akosiaris>	 no it doesn't... I only see 2 now, 1 per DC
[17:14:27] <akosiaris>	 https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=parsoid.svc#
[17:14:32] <_joe_>	 yes it removed the https definitions
[17:14:36] <_joe_>	 and kept the http
[17:14:38] <cdanis>	 yes
[17:14:39] <_joe_>	 who call https
[17:14:40] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on parsoid.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 14776 bytes in 0.240 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:14:41] <_joe_>	 ahahahahah
[17:14:44] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:14:45] <akosiaris>	 ;-)
[17:14:45] <_joe_>	 ok whatever
[17:14:51] <cdanis>	 it's a workaround of a workaround of a workaround
[17:14:53] <cdanis>	 what did you expect ;)
[17:14:54] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[17:14:55] <_joe_>	 I need to go afk at least for a few hours
[17:15:00] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:15:01] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[17:15:08] <_joe_>	 can someone look at the availability issue?
[17:15:08] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:15:19] <cdanis>	 _joe_: it's being looked at
[17:15:20] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:15:24] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:15:48] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:15:54] <akosiaris>	 some varnish backend again ?
[17:15:58] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[17:16:04] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:16:04] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[17:16:24] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:16:26] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:16:47] <_joe_>	 I would say restbase
[17:16:54] <_joe_>	 given we have errors coming from ats
[17:17:05] <_joe_>	 (they have referer=envoy)
[17:17:35] <_joe_>	 oh no right now it's commons' api
[17:17:55] <_joe_>	 and cp1077 it seems
[17:18:11] <akosiaris>	 my guesses say restbase1081
[17:18:19] <_joe_>	 that doesn't exist 
[17:18:28] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:18:34] <akosiaris>	 cp1081*
[17:18:35] <akosiaris>	 dammit
[17:18:40] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:18:43] <_joe_>	 look at the last few minutes
[17:18:50] <_joe_>	 it was 1081 before
[17:18:59] <_joe_>	 but the last spike is 1077 indubitably
[17:19:13] <cdanis>	 and there's cp1089 in the middle, maxing out on connections to backends
[17:19:22] <cdanis>	 but it's spread amongst servers reasonably well
[17:19:30] <cdanis>	 so it's something about the traffic or appserver behavior
[17:19:35] <cdanis>	 which is bouncing between varnishes
[17:19:56] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:20:26] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:20:38] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[17:20:46] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:20:46] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[17:20:54] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:21:40] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:21:50] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[17:22:00] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:22:00] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[17:22:21] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:22:24] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:25:26] <akosiaris>	 !log restart varnish-be on cp1081 as a response to HTTP availability alerts
[17:25:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:11] <akosiaris>	 I did restart -be anyway. I t seems to have recovered, I 'd correlation is not causation, but maybe it was in this case?
[17:28:11] <cdanis>	 akosiaris: I don't think so, according to https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-3h&to=now there were lots of other backend instances suffering as well
[17:28:40] <cdanis>	 still others at or close to their parallelism limit for connections to appservers
[17:28:56] <cdanis>	 to me that indicate something pathological about the traffic being handled
[17:30:03] <akosiaris>	 cp1087's mailbox lag is through the roof as well
[17:30:12] <akosiaris>	 chances are we are going to see a problem again
[17:31:40] <akosiaris>	 !log restart varnish-be on cp1089 as a response to HTTP availability alerts. High mailbox lag
[17:31:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:07] <akosiaris>	 let's see now
[17:32:23] <bblack>	 and restarting wipes caches too, so it's not without detriment to roll through restarting them all either
[17:32:48] <akosiaris>	 it's the backends, it shouldn't, right?
[17:33:08] <akosiaris>	 I mean the cache is on disk (for that weird definition of disk that is now varnish on disk)
[17:33:43] <bblack>	 the disk cache is ephemeral, it's wiped on every restart of the daemon
[17:34:14] <akosiaris>	 sigh, I forgot about that
[17:34:15] <bblack>	 we can handle it within reason, but there will be a spike of increased misses for a while as you roll through them
[17:34:15] <cdanis>	 akosiaris: you can see this for instance on https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&panelId=8&fullscreen&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-3h&to=now
[17:34:56] <cdanis>	 cp1077 again has lots of inuse connections to appservers and api_appservers
[17:35:43] <akosiaris>	 so whatever it is is shifting between varnishes?
[17:37:17] <akosiaris>	 yeah the failed fetches are now on to cp1077
[17:37:34] <akosiaris>	 ok no more whack a mole, need to find what's going on
[17:37:49] <cdanis>	 it is possibly just one URL that has gotten very slow and is being hammered
[17:37:56] <bblack>	 well two URLs right?
[17:38:00] <bblack>	 api + appservers
[17:38:09] <bblack>	 they're separate pools, so that part I don't get
[17:38:23] <cdanis>	 mm true
[17:39:03] <cdanis>	 there are elevated inuse connections on most varnishes though, although generally one is much more pronounced at any given time
[17:39:25] <bblack>	 right
[17:39:41] <bblack>	 but only to api/appservers, not elevated to other distinct backend services?
[17:40:21] <cdanis>	 more often than not, just api/appservers; sometimes, also restbase -- but it's hard to tease that apart, of course
[17:40:33] <bblack>	 RB looks pretty elevatedin some of that too, yeah
[17:40:54] <bblack>	 there can of course be systemic effects where they all mix together in varnish, too
[17:49:20] <godog>	 looks like a ton of objects are being created e.g. on cp1077 currently affected https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?panelId=13&fullscreen&orgId=1&var-server=cp1077&var-datasource=eqiad%20prometheus%2Fops&from=1571831334213&to=1571852934213
[17:49:36] <wikibugs>	 (03PS1) 10Dzahn: admins: temp disable lexnasser's shell account [puppet] - 10https://gerrit.wikimedia.org/r/545624
[17:49:50] <wikibugs>	 (03PS2) 10Dzahn: admins: temp disable lexnasser's shell account [puppet] - 10https://gerrit.wikimedia.org/r/545624
[17:49:50] <godog>	 are those inspectable in varnish ?
[17:49:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admins: temp disable lexnasser's shell account [puppet] - 10https://gerrit.wikimedia.org/r/545624 (owner: 10Dzahn)
[17:51:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admins: temp disable lexnasser's shell account [puppet] - 10https://gerrit.wikimedia.org/r/545624 (owner: 10Dzahn)
[17:52:11] <wikibugs>	 (03PS3) 10Dzahn: admins: temp disable lexnasser's shell account [puppet] - 10https://gerrit.wikimedia.org/r/545624
[17:54:50] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[17:55:02] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:55:02] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[17:55:30] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:55:30] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:56:24] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[17:56:38] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:56:38] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[17:57:06] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[17:57:06] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T1800).
[18:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:01:46] <brennen>	 Daimona: around?  i can deploy for https://phabricator.wikimedia.org/T236286
[18:02:44] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:02:56] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:02:56] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:03:24] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:03:26] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:03:44] <wikibugs>	 10Operations, 10ops-esams, 10netops: set up cr3-esams - https://phabricator.wikimedia.org/T174616 (10ayounsi) 05Open→03Resolved Done.
[18:03:47] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Wikimedia-Incident: Configure interface damping on primary links - https://phabricator.wikimedia.org/T196432 (10ayounsi)
[18:03:50] <wikibugs>	 10Operations, 10ops-esams, 10netops: Complete router migration from cr1-esams to cr3-esams - https://phabricator.wikimedia.org/T184067 (10ayounsi)
[18:04:01] <wikibugs>	 10Operations, 10ops-esams, 10Epic: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking) - https://phabricator.wikimedia.org/T184061 (10ayounsi)
[18:04:02] <icinga-wm>	 PROBLEM - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100%
[18:04:03] <wikibugs>	 10Operations, 10ops-esams, 10netops: Complete router migration from cr1-esams to cr3-esams - https://phabricator.wikimedia.org/T184067 (10ayounsi) 05Open→03Resolved a:03ayounsi Done.
[18:04:18] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:04:22] <icinga-wm>	 PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100%
[18:04:30] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:04:30] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:05:00] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:05:00] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:06:26] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:06:30] <icinga-wm>	 PROBLEM - Host multatuli.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:07:02] <icinga-wm>	 PROBLEM - Host cp3030.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:07:02] <icinga-wm>	 PROBLEM - Host cp3038.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:07:11] <Daimona>	 brennen: So and so
[18:07:12] <icinga-wm>	 PROBLEM - Host cp3035.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:07:25] <Daimona>	 I'd say yes for the next 5 minutes or so
[18:07:42] <icinga-wm>	 PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:07:43] <brennen>	 it's probably not an issue, but i may wait until 
[18:07:44] <icinga-wm>	 PROBLEM - Host lvs3004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:07:51] <brennen>	 er, may wait until error traffic subsides...
[18:07:52] <icinga-wm>	 PROBLEM - Host lvs3003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:07:52] <icinga-wm>	 PROBLEM - Host lvs3001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:07:52] <icinga-wm>	 PROBLEM - Host lvs3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:07:52] <icinga-wm>	 PROBLEM - Host maerlant.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:08] <icinga-wm>	 PROBLEM - Host cp3039.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:16] <icinga-wm>	 PROBLEM - Host nescio.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:28] <icinga-wm>	 PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:32] <bblack>	 yes, please hold deploys for now
[18:08:36] <brennen>	 ack
[18:08:52] <icinga-wm>	 PROBLEM - Host cp3033.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:52] <icinga-wm>	 PROBLEM - Host bast3002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:52] <icinga-wm>	 PROBLEM - Host cp3034.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:52] <icinga-wm>	 PROBLEM - Host cp3036.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:52] <icinga-wm>	 PROBLEM - Host cp3040.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:52] <icinga-wm>	 PROBLEM - Host cp3042.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:52] <icinga-wm>	 PROBLEM - Host cp3041.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:08:53] <icinga-wm>	 PROBLEM - Host cp3043.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:09:02] <icinga-wm>	 PROBLEM - Host cp3044.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:09:02] <icinga-wm>	 PROBLEM - Host cp3046.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:09:04] <icinga-wm>	 PROBLEM - Host cp3047.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:09:04] <icinga-wm>	 PROBLEM - Host cp3045.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:09:04] <icinga-wm>	 PROBLEM - Host cp3049.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:09:18] <icinga-wm>	 PROBLEM - Host cp3032.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:09:26] <icinga-wm>	 PROBLEM - Host mr1-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[18:09:36] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:09:38] <mutante>	 the management router crashed
[18:09:40] <icinga-wm>	 RECOVERY - Host asw-esams is UP: PING OK - Packet loss = 0%, RTA = 83.88 ms
[18:09:40] <icinga-wm>	 RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 83.70 ms
[18:09:59] <godog>	 I think they were downtimed too
[18:10:45] <Daimona>	 Oh, didn't even notice them... In case I disappear, testing is pretty easily: head to Special:AbuseFilter/new and ensure that the "Actions to take when matched" section is not empty
[18:10:46] <mutante>	 i don't think they were. unexpected
[18:10:59] <Daimona>	 Let me provide examples
[18:11:14] <Daimona>	 This is how it should like: https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:AbuseFilter/new
[18:11:49] <Daimona>	 Whereas currently it's empty (see e.g. https://phabricator.wikimedia.org/F30876704)
[18:12:02] <XioNoX>	 everything should be back to normal
[18:12:06] <icinga-wm>	 PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 100%
[18:12:13] <mutante>	 i was about to silence the alerts but then decided it's nicer to see it coming back
[18:12:31] <mutante>	 XioNoX: thanks!
[18:12:44] <wikibugs>	 10Operations, 10serviceops: php-fpm  invalid opcode on mw1317 - https://phabricator.wikimedia.org/T236292 (10CCicalese_WMF) It does not look like there is work for #core_platform_team to do on this at this point, but @tstarling may want to take a look.
[18:12:47] <hauskater>	 No 'recovery' alerts?
[18:13:09] <XioNoX>	 I think it crashed again
[18:13:16] <XioNoX>	 what the
[18:13:24] <icinga-wm>	 PROBLEM - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100%
[18:14:24] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:14:57] <wikibugs>	 (03PS1) 10CRusnov: librenms: Handle the case where hardware is null [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/545628
[18:15:35] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] librenms: Handle the case where hardware is null [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/545628 (owner: 10CRusnov)
[18:15:39] <wikibugs>	 (03PS1) 10BBlack: Example defensive timeout config [puppet] - 10https://gerrit.wikimedia.org/r/545629
[18:15:44] <wikibugs>	 (03PS2) 10CRusnov: librenms: Handle the case where hardware is null [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/545628
[18:17:42] <icinga-wm>	 PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[18:17:45] <XioNoX>	 alright mr1 is dead
[18:17:56] <XioNoX>	 paravoid: ^
[18:22:30] <icinga-wm>	 PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[18:22:46] <logmsgbot>	 !log mforns@deploy1001 Started deploy [analytics/refinery@1110d59]: deploying refinery up to 1110d59c3983bcff4986bce1baf885f05ee06ba5
[18:22:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admins: temp disable lexnasser's shell account [puppet] - 10https://gerrit.wikimedia.org/r/545624 (owner: 10Dzahn)
[18:25:24] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:25:42] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:25:44] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:25:56] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:26:12] <icinga-wm>	 PROBLEM - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - commonswiki_content_1556151793(67gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed
[18:26:24] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:26:36] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:26:42] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:26:48] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:26:48] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:26:58] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:27:10] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:27:30] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:27:34] <wikibugs>	 (03PS7) 10Dzahn: gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166)
[18:28:00] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:28:14] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:28:20] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:28:24] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:28:46] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:28:54] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:28:54] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:29:27] <logmsgbot>	 !log mforns@deploy1001 Finished deploy [analytics/refinery@1110d59]: deploying refinery up to 1110d59c3983bcff4986bce1baf885f05ee06ba5 (duration: 06m 40s)
[18:29:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:00] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:37:54] <wikibugs>	 (03CR) 10Nuria: [C: 04-1] "I think so, but let's ping user on ticket to make sure he knows it is happening." [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite)
[18:39:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "I'm ready for you to deploy this whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542506 (https://phabricator.wikimedia.org/T223907) (owner: 10BryanDavis)
[18:39:21] <wikibugs>	 (03PS1) 10Dzahn: admins: re-enable shell account for lexnasser with new key [puppet] - 10https://gerrit.wikimedia.org/r/545630 (https://phabricator.wikimedia.org/T235688)
[18:40:29] <andrewbogott>	 brennen or liw, did the train go ok this morning?  Can we close https://phabricator.wikimedia.org/T236166?
[18:42:32] <brennen>	 andrewbogott: yeah, should be good.
[18:42:40] <andrewbogott>	 great, thanks
[18:43:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite)
[18:45:06] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:45:48] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:45:58] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:46:00] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:46:10] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:46:28] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:46:28] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:47:12] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:47:30] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:49:02] <logmsgbot>	 !log milimetric@deploy1001 Started deploy [analytics/refinery@3aaabf6]: Minor: fix two scripts
[18:49:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:44] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:51:30] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:51:50] <wikibugs>	 (03PS1) 10BBlack: cache_text: raise appservers/api conn limits to 10K [puppet] - 10https://gerrit.wikimedia.org/r/545634
[18:52:02] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:52:13] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] cache_text: raise appservers/api conn limits to 10K [puppet] - 10https://gerrit.wikimedia.org/r/545634 (owner: 10BBlack)
[18:52:26] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:52:30] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] cache_text: raise appservers/api conn limits to 10K [puppet] - 10https://gerrit.wikimedia.org/r/545634 (owner: 10BBlack)
[18:52:36] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:52:54] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:52:54] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:53:50] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:54:00] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:55:56] <wikibugs>	 (03PS2) 10Dzahn: admins: re-enable shell account for lexnasser with new key [puppet] - 10https://gerrit.wikimedia.org/r/545630 (https://phabricator.wikimedia.org/T235688)
[18:56:08] <icinga-wm>	 PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[18:56:36] <wikibugs>	 (03PS3) 10Bstorm: monitoring: set wmcs servers to email when mgmt interfaces fail [puppet] - 10https://gerrit.wikimedia.org/r/545386 (https://phabricator.wikimedia.org/T223458)
[18:56:56] <logmsgbot>	 !log milimetric@deploy1001 Finished deploy [analytics/refinery@3aaabf6]: Minor: fix two scripts (duration: 07m 53s)
[18:56:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admins: re-enable shell account for lexnasser with new key [puppet] - 10https://gerrit.wikimedia.org/r/545630 (https://phabricator.wikimedia.org/T235688) (owner: 10Dzahn)
[18:59:48] <XioNoX>	 alright, mr1 is now booting from the USB drive
[19:00:38] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] monitoring: set wmcs servers to email when mgmt interfaces fail [puppet] - 10https://gerrit.wikimedia.org/r/545386 (https://phabricator.wikimedia.org/T223458) (owner: 10Bstorm)
[19:03:50] <icinga-wm>	 RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:03:52] <icinga-wm>	 RECOVERY - Host asw-esams is UP: PING OK - Packet loss = 0%, RTA = 90.73 ms
[19:03:58] <icinga-wm>	 RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 83.69 ms
[19:04:06] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:05:00] <icinga-wm>	 RECOVERY - Host multatuli.mgmt is UP: PING OK - Packet loss = 0%, RTA = 83.94 ms
[19:05:33] <icinga-wm>	 RECOVERY - Host cp3030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.50 ms
[19:05:33] <icinga-wm>	 RECOVERY - Host cp3038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.50 ms
[19:05:33] <icinga-wm>	 RECOVERY - Host cp3035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 83.82 ms
[19:05:43] <brennen>	 bblack / XioNoX: clear for deploys at this point?
[19:05:52] <icinga-wm>	 RECOVERY - Host lvs3004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.40 ms
[19:05:59] <icinga-wm>	 RECOVERY - Host lvs3001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.39 ms
[19:05:59] <icinga-wm>	 RECOVERY - Host lvs3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.89 ms
[19:05:59] <icinga-wm>	 RECOVERY - Host lvs3003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.61 ms
[19:05:59] <icinga-wm>	 RECOVERY - Host maerlant.mgmt is UP: PING OK - Packet loss = 0%, RTA = 89.32 ms
[19:06:27] <icinga-wm>	 RECOVERY - Host nescio.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.10 ms
[19:06:27] <icinga-wm>	 RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 93.18 ms
[19:06:29] <icinga-wm>	 RECOVERY - Host bast3002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.82 ms
[19:06:39] <icinga-wm>	 RECOVERY - Host cp3033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 98.92 ms
[19:06:39] <icinga-wm>	 RECOVERY - Host cp3034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 98.39 ms
[19:06:39] <icinga-wm>	 RECOVERY - Host cp3036.mgmt is UP: PING OK - Packet loss = 0%, RTA = 97.72 ms
[19:06:39] <icinga-wm>	 RECOVERY - Host cp3040.mgmt is UP: PING OK - Packet loss = 0%, RTA = 96.43 ms
[19:06:39] <icinga-wm>	 RECOVERY - Host cp3039.mgmt is UP: PING OK - Packet loss = 0%, RTA = 96.52 ms
[19:06:39] <icinga-wm>	 RECOVERY - Host cp3041.mgmt is UP: PING OK - Packet loss = 0%, RTA = 95.81 ms
[19:06:39] <icinga-wm>	 RECOVERY - Host cp3042.mgmt is UP: PING OK - Packet loss = 0%, RTA = 95.21 ms
[19:06:40] <icinga-wm>	 RECOVERY - Host cp3043.mgmt is UP: PING OK - Packet loss = 0%, RTA = 98.04 ms
[19:06:50] <wikibugs>	 (03PS1) 10Jbond: puppet: clean up unsed parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640
[19:06:53] <icinga-wm>	 RECOVERY - Host cp3044.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.62 ms
[19:06:53] <icinga-wm>	 RECOVERY - Host cp3046.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.33 ms
[19:06:57] <icinga-wm>	 RECOVERY - Host cp3045.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.36 ms
[19:06:57] <icinga-wm>	 RECOVERY - Host cp3047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.07 ms
[19:06:57] <icinga-wm>	 RECOVERY - Host cp3049.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.96 ms
[19:07:09] <icinga-wm>	 RECOVERY - Host cp3032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 88.33 ms
[19:07:21] <icinga-wm>	 RECOVERY - Host mr1-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 83.81 ms
[19:09:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppet: clean up unsed parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640 (owner: 10Jbond)
[19:09:26] <bblack>	 brennen: I think for now we need to keep holding a bit, we still don't really understand what's going on with massive request parallelism/timeouts
[19:10:24] <XioNoX>	 esams mgmt is back to an good enough state for tonight
[19:10:26] <brennen>	 bblack: cool, thanks for update.  i may not be able to test the patch i've got here anyway, so it can probably wait until J.ames_F is online.
[19:10:37] <bblack>	 ok
[19:10:58] <mutante>	 XioNoX: ack, i hope you guys get some rest after long day now. what a timing
[19:11:53] <icinga-wm>	 PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[19:12:31] <icinga-wm>	 PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[19:14:18] <wikibugs>	 (03PS2) 10Jbond: puppet: clean up unsed parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640
[19:19:03] <wikibugs>	 (03CR) 10Alex Monk: [C: 03+1] puppet: clean up unsed parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640 (owner: 10Jbond)
[19:19:44] <wikibugs>	 (03PS3) 10Jbond: puppet: clean up unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640
[19:23:08] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] admin: add gsingers to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/544197 (https://phabricator.wikimedia.org/T235260) (owner: 10Herron)
[19:23:18] <librenms-wmf>	 04Critical Alert for device mr1-esams.wikimedia.org - Juniper alarm active
[19:25:46] <wikibugs>	 (03PS2) 10Cwhite: admin: add gsingers to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/544197 (https://phabricator.wikimedia.org/T235260) (owner: 10Herron)
[19:28:30] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] admin: add gsingers to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/544197 (https://phabricator.wikimedia.org/T235260) (owner: 10Herron)
[19:51:30] <wikibugs>	 (03PS1) 10Anomie: RejectParserCacheValue to reject possibly-corrupted entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545647 (https://phabricator.wikimedia.org/T235188)
[20:00:05] <jouncebot>	 cscott, arlolra, subbu, halfak, and accraze: That opportune time is upon us again. Time for a Services – Graphoid / Parsoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T2000).
[20:13:53] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:15:41] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[20:15:53] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:16:03] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[20:16:11] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:16:15] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:17:05] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:17:17] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[20:17:29] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:17:39] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[20:17:47] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:17:51] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:18:23] <wikibugs>	 (03PS1) 10Effie Mouzeli: hhvm: remove hhvm leftovers from apache configs [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792)
[20:29:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: increase heap_size from 20G to 32G [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn)
[20:29:33] <wikibugs>	 (03PS1) 10MarcoAurelio: Restrict uploads on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545655 (https://phabricator.wikimedia.org/T236307)
[20:31:21] <icinga-wm>	 PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[20:31:35] <icinga-wm>	 PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops
[20:31:37] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:32:23] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:32:31] <icinga-wm>	 PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[20:32:39] <icinga-wm>	 PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[20:32:43] <icinga-wm>	 PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[20:32:51] <icinga-wm>	 PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[20:34:15] <icinga-wm>	 PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:35:55] <shdubsh>	 ^^ stat1007 looks pretty busy running an R program
[20:36:29] <mutante>	 shdubsh: unfortunately that's common.  restart nagios-nrpe-server should fix it all
[20:36:39] <mutante>	 it always gets killed first by OOM killer
[20:36:59] <mutante>	 and stat1007 often has this issue that user jobs use all the RAM
[20:37:11] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:37:17] <mutante>	 it's https://phabricator.wikimedia.org/T212824
[20:37:19] <icinga-wm>	 RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[20:37:25] <shdubsh>	 odd,  memory utilization is really low
[20:37:27] <icinga-wm>	 RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[20:37:31] <icinga-wm>	 RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[20:37:35] <chaomodus>	                                           it's always the same explanation
[20:37:39] <icinga-wm>	 RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[20:37:45] <icinga-wm>	 RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[20:37:46] <shdubsh>	 !log restart nagios-nrpe-server on stat1007
[20:37:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:59] <icinga-wm>	 RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops
[20:37:59] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:38:48] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "I'm still learning how these files work, but seems legit!" [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli)
[20:38:55] <shdubsh>	 interesting, there was a lot of memory utilization just before
[20:39:40] <mutante>	 these are things run manually by people.. so who knows
[20:39:49] <icinga-wm>	 RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[20:40:01] <mutante>	 it's often R.. ack
[20:40:33] <mutante>	 sometimes i sent a message to | wall  that it's causing an issue
[20:41:09] <shdubsh>	 seems strange that the oom killer taking out R also takes out nrpe
[20:41:24] <shdubsh>	 oh, heh
[20:41:33] <shdubsh>	 nagios-nrpe-server.service: Failed to fork: Cannot allocate memory
[20:41:40] <shdubsh>	 that'll do it
[20:42:09] <wikibugs>	 (03PS1) 10BBlack: Basic install for new esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294)
[20:42:30] <mutante>	 unfortunately it is often (always?) the first victim of the killer
[20:42:36] <mutante>	 then turning into that icinga spam
[20:46:38] <chaomodus>	 it must have a low priority or somtheing
[20:53:25] <wikibugs>	 (03PS1) 10Ayounsi: New esams stuff [homer/public] - 10https://gerrit.wikimedia.org/r/545660 (https://phabricator.wikimedia.org/T235805)
[20:54:35] <mutante>	 here is the suggestion to put the users into a different slice  https://phabricator.wikimedia.org/T212824#4967798
[20:55:35] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[20:57:09] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[21:02:04] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] gerrit: change gerrit master_host to gerrit1001, remove duplicate [puppet] - 10https://gerrit.wikimedia.org/r/545342 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn)
[21:05:37] <wikibugs>	 (03PS1) 10BBlack: basic DNS entries for new esams hosts [dns] - 10https://gerrit.wikimedia.org/r/545662 (https://phabricator.wikimedia.org/T236294)
[21:06:51] <wikibugs>	 (03PS4) 10Dzahn: webperf: add backups for arclamp application data [puppet] - 10https://gerrit.wikimedia.org/r/543005 (https://phabricator.wikimedia.org/T235425)
[21:07:32] <mutante>	 it's so nice to see bast3003 being added
[21:10:03] <mutante>	 bblack: maybe it should be bast3004. because technically the decom ticket for bast3003 is open https://phabricator.wikimedia.org/T216199
[21:10:30] <mutante>	 there was already much ambiguity hence those ticket comments from back then
[21:10:38] <bblack>	 lol
[21:10:47] <bblack>	 thanks for the heads up, agree, should rename it to bast3004 :)
[21:11:00] <mutante>	 'k :)
[21:11:53] <wikibugs>	 (03PS2) 10BBlack: Basic install for new esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294)
[21:12:46] <bblack>	 fixed
[21:12:52] <wikibugs>	 (03PS2) 10BBlack: basic DNS entries for new esams hosts [dns] - 10https://gerrit.wikimedia.org/r/545662 (https://phabricator.wikimedia.org/T236294)
[21:12:53] <bblack>	 gotta run :)
[21:21:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] webperf: add backups for arclamp application data [puppet] - 10https://gerrit.wikimedia.org/r/543005 (https://phabricator.wikimedia.org/T235425) (owner: 10Dzahn)
[21:26:17] <icinga-wm>	 PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:27:45] <icinga-wm>	 PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:29:27] <mutante>	 ^ that would be me adding bacula service.. looking
[21:32:31] <icinga-wm>	 RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:32:39] <icinga-wm>	 RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:32:44] <mutante>	 !log webperf1002/2002 - starting bacula-fd service that is failed after initial puppet run turning them into backup::hosts
[21:32:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:56] <Krenair>	 up to bast3003 already?
[21:33:02] <mutante>	 Krenair: 3004 :p
[21:33:07] <Krenair>	 that seems quick. doesn't feel like hooft.esmas was that long ago
[21:33:32] <mutante>	 Krenair: "bast3002 was broken and to be replaced with another server, bast3003, which was formerly amslvs4." :p
[21:33:39] <mutante>	 they kept breaking
[21:34:03] <Krenair>	 I'm assuming that stuff would all be way out of warranty by now :P
[21:34:22] <mutante>	 yea, which is why we had to find _something_ to use as bastion
[21:34:26] <Krenair>	 heh
[21:34:41] <mutante>	 but now finally new hardware, yay
[21:34:51] <Krenair>	 nice
[21:44:51] <papaul>	 Krenair: we have a new server bast3003
[21:45:59] <icinga-wm>	 PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[21:47:23] <mutante>	 papaul: please rename to bast3004 because https://phabricator.wikimedia.org/T216199
[21:47:46] <mutante>	 (b.black did in upcoming DNS changes.. but labels)
[21:48:11] <mutante>	 also, have some rest :)
[21:50:18] <papaul>	 mutante: can't sleep don't know why
[21:50:37] <mutante>	 papaul: jetlag :)
[21:50:47] <papaul>	 but will have to clearfy that tomorrow since on the new order we have a new server called bast3003 too
[21:51:27] <mutante>	 yes please, that sounds like it would cause more confusion 
[21:51:48] <mutante>	 and the already confusing old ticket
[21:51:58] <mutante>	 for final decom of bast3003 
[21:52:13] <papaul>	 mutante: understood 
[22:00:21] <logmsgbot>	 !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server)
[22:00:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:41] <logmsgbot>	 !log twentyafterfour@deploy1001 Finished deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server) (duration: 00m 21s)
[22:00:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:31] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: add data-reload cookbook (0312 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe)
[22:14:30] <logmsgbot>	 !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server)
[22:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:40] <logmsgbot>	 !log twentyafterfour@deploy1001 Finished deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server) (duration: 01m 10s)
[22:15:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:14] <wikibugs>	 (03PS10) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588)
[22:16:16] <wikibugs>	 (03PS1) 10Mathew.onipe: fix unused format [cookbooks] - 10https://gerrit.wikimedia.org/r/545672
[22:16:18] <wikibugs>	 (03PS1) 10Mathew.onipe: Better query to host check [cookbooks] - 10https://gerrit.wikimedia.org/r/545673
[22:16:20] <twentyafterfour>	 Disconnecting authenticating user phab-deploy ....: Too many authentication failures [preauth]
[22:16:29] <twentyafterfour>	 wth
[22:19:59] <logmsgbot>	 !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server)
[22:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:03] <logmsgbot>	 !log twentyafterfour@deploy1001 Finished deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server) (duration: 00m 05s)
[22:20:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:15] <logmsgbot>	 !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server)
[22:20:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:36] <logmsgbot>	 !log twentyafterfour@deploy1001 Finished deploy [phabricator/deployment@e4e2b22]: deploy to phab1001 (currently a warm spare server) (duration: 00m 21s)
[22:20:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:41] <icinga-wm>	 PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[22:41:39] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 674.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[22:42:25] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 704.68 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[22:42:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] basic DNS entries for new esams hosts [dns] - 10https://gerrit.wikimedia.org/r/545662 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[22:44:24] <brennen>	 James_F: think https://phabricator.wikimedia.org/T236286 should be on mwdebug1001; mind testing?
[22:47:31] <icinga-wm>	 PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[22:47:39] <wikibugs>	 (03CR) 10Dzahn: "dns3001 missing in DHCP?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[22:51:14] <James_F>	 brennen: Sorry, testing now.
[22:51:54] <James_F>	 brennen: Yeah, LGTM.
[22:52:10] <brennen>	 James_F: rad, thank you.
[22:55:53] <logmsgbot>	 !log brennen@deploy1001 Synchronized php-1.35.0-wmf.3/extensions/AbuseFilter: SWAT: [[gerrit:545620|Unbreak filter edit form (T236286)]] (duration: 01m 05s)
[22:55:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:55:57] <stashbot>	 T236286: Uncaught Error: Widget not found when editing filters - https://phabricator.wikimedia.org/T236286
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191023T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:08:57] <wikibugs>	 (03PS2) 10Mathew.onipe: Better query-to-host check [cookbooks] - 10https://gerrit.wikimedia.org/r/545673
[23:08:59] <wikibugs>	 (03PS11) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588)
[23:09:38] <wikibugs>	 (03CR) 10Mathew.onipe: query_service: prepare query_service for reusbility (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe)
[23:11:11] <wikibugs>	 (03PS20) 10Mathew.onipe: query_service: rename wdqs module to query_service [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297)
[23:11:13] <wikibugs>	 (03PS27) 10Mathew.onipe: query_service: prepare query_service for reusbility [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297)
[23:11:15] <wikibugs>	 (03PS25) 10Mathew.onipe: query_service: rename profile/wdqs to profile/query_service [puppet] - 10https://gerrit.wikimedia.org/r/538849 (https://phabricator.wikimedia.org/T232297)
[23:11:17] <wikibugs>	 (03PS23) 10Mathew.onipe: query_service: separate categories from main blazegraph profile [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297)
[23:11:20] <wikibugs>	 (03PS24) 10Mathew.onipe: query_service: properly adapt query_service profile [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297)
[23:11:21] <wikibugs>	 (03PS24) 10Mathew.onipe: query_service: properly adapt hiera configs [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297)
[23:24:08] <wikibugs>	 (03CR) 10Mathew.onipe: "PCC is Ok: https://puppet-compiler.wmflabs.org/compiler1001/19032/" [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe)
[23:26:24] <wikibugs>	 (03CR) 10Mathew.onipe: "PCC is good: https://puppet-compiler.wmflabs.org/compiler1002/19033/" [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe)
[23:26:35] <icinga-wm>	 PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[23:26:35] <icinga-wm>	 PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[23:27:09] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 29.94 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[23:27:47] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 42.68 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[23:30:21] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 105 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[23:30:57] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 80.15 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[23:31:01] <wikibugs>	 (03PS1) 10Alex Monk: Swap toolforge proxies to use acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/545679
[23:31:34] <wikibugs>	 (03PS2) 10Alex Monk: Swap toolforge proxies to use acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/545679 (https://phabricator.wikimedia.org/T235252)
[23:43:26] <icinga-wm>	 PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports