[00:01:24] <wikibugs>	 (03CR) 10Dzahn: "https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#sendemail.connectTimeout" [puppet] - 10https://gerrit.wikimedia.org/r/507853 (owner: 10Paladox)
[00:06:32] <mutante>	 !log restarting gerrit to pick up config changes for 2 mail threads and lower timeout (gerrit:507852, gerrit: 507853)
[00:06:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:08:42] <mutante>	 paladox: had a little scare because i saw errors but unrelated
[00:08:48] <paladox>	 oh
[00:08:59] <mutante>	 the external id exception thing
[00:09:04] <paladox>	 ah
[00:09:19] <paladox>	 you mean with a similar error to https://phabricator.wikimedia.org/T222336 ?
[00:09:21] <mutante>	 ConfigInvalidException: Invalid external ID config for note 
[00:09:32] <mutante>	 ConfigInvalid sounded bad ..after changing config.. at first
[00:09:36] <wikibugs>	 (03CR) 10Paladox: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/507853 (owner: 10Paladox)
[00:09:39] <mutante>	 but that's different 
[00:10:05] <XioNoX>	 !log remove static route to 208.80.155.128/25 on cr1/2-eqiad - T193496
[00:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:10:09] <stashbot>	 T193496: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496
[00:10:17] <paladox>	 ah ok
[00:11:02] <icinga-wm>	 PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports]
[00:11:18] <icinga-wm>	 PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org]
[00:11:34] <XioNoX>	 ah
[00:11:36] <XioNoX>	 looking
[00:11:52] <icinga-wm>	 PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports]
[00:13:01] <mutante>	 too
[00:13:10] <mutante>	 i thought it wasnt even used yet?
[00:13:20] <mutante>	 the private key that is
[00:13:41] <mutante>	 oh..nevermind :) this is of course just from the gerrit restart
[00:13:50] <mutante>	 you can safely ignore or just run puppet once
[00:14:01] <XioNoX>	 ok!
[00:15:32] <icinga-wm>	 PROBLEM - puppet last run on schema2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas]
[00:16:22] <icinga-wm>	 RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[00:17:06] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/16301/netmon1002.wikimedia.org/change.netmon1002.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[00:22:39] <wikibugs>	 (03PS4) 10Ayounsi: LibreNMS, file files permission, add app key, add logrotate [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706)
[00:27:33] <wikibugs>	 (03PS1) 10Ayounsi: LibreNMS, add fake db_user and db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/507906
[00:27:54] <wikibugs>	 (03PS2) 10Ayounsi: LibreNMS, add fake db_user and db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/507906
[00:28:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] LibreNMS, add fake db_user and db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/507906 (owner: 10Ayounsi)
[00:29:12] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] LibreNMS, add fake db_user and db_pass [labs/private] - 10https://gerrit.wikimedia.org/r/507906 (owner: 10Ayounsi)
[00:29:43] <wikibugs>	 (03PS5) 10Dzahn: LibreNMS, file files permission, add app key, add logrotate [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[00:29:54] <wikibugs>	 (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[00:30:28] <icinga-wm>	 PROBLEM - Host elastic2038 is DOWN: PING CRITICAL - Packet loss = 100%
[00:31:52] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.codfw.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.codfw.wmnet, port=9443): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:32:03] <mutante>	 !log powercycling elastic2038
[00:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:33:02] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, number_of_nodes: 14, active_shards: 3158, unassigned_shards: 225, active_primary_shards: 1128, initializing_shards: 0, delayed_unassigned_shards: 225, status: yellow, number_of_data_nodes: 14, number_of_pending_tasks: 0, relocating_shards: 0
[00:33:02] <icinga-wm>	 ight_fetch: 0, active_shards_percent_as_number: 93.34909843334319, task_max_waiting_in_queue_millis: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:34:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "compiling works now. lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[00:34:33] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/16303/" [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[00:34:38] <icinga-wm>	 RECOVERY - Host elastic2038 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms
[00:34:47] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] LibreNMS, file files permission, add app key, add logrotate [puppet] - 10https://gerrit.wikimedia.org/r/507716 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[00:37:50] <icinga-wm>	 RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[00:38:46] <wikibugs>	 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Dzahn) @Gehel It crashed today. Just went down and had no console output. Then i powercycled it.. Then it was back.  Note the service came back before the host was back.....
[00:42:04] <icinga-wm>	 RECOVERY - puppet last run on schema2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[00:43:42] <icinga-wm>	 RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[00:44:47] <wikibugs>	 10Operations: cronspam from smart-data-dump due to facter bug - https://phabricator.wikimedia.org/T222326 (10Dzahn) p:05Triage→03Normal Thanks! That stopped the spam, so priority isn't that high now.
[00:45:01] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: LibreNMS upgrade to 1.51 - https://phabricator.wikimedia.org/T207706 (10ayounsi) 05Open→03Resolved a:03ayounsi Everything is done. Future upgrades should be much smoother.
[00:47:40] <librenms-wmf>	 04Critical Testing transport from LibreNMS
[00:51:32] <mutante>	 XioNoX: ^ this is a good thing?:)
[00:51:47] <XioNoX>	 yeah, I pushed the "test" button
[00:51:51] <mutante>	 :)
[00:52:06] <mutante>	 nice that the upgrade is done now :)
[00:53:32] <wikibugs>	 10Operations, 10Cloud-Services, 10netops, 10cloud-services-team (Kanban): Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10ayounsi) 05Open→03Resolved 208.80.155.128/25 has been cleaned up from Netbox and DNS (see above).
[01:12:31] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr3-ulsfo.wikimedia.org recovered from Juniper environment status
[01:12:37] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr4-ulsfo.wikimedia.org recovered from Juniper environment status
[01:12:47] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr2-eqord.wikimedia.org recovered from Juniper environment status
[01:15:21] <Reedy>	 god that yellow is obnoxious
[01:16:42] <mutante>	 should be green :)
[01:16:58] <mutante>	 it's not a warning. it's "no more warning" 
[01:17:07] <Reedy>	 More it's not readable on the grey background
[01:17:22] <mutante>	 well, that's black for me
[01:17:33] <Reedy>	 https://imgur.com/a/cNyxlu6
[01:17:39] <Reedy>	 It's a little better zoomed in...
[01:21:50] <XioNoX>	 looks nice on black background :)
[01:22:14] <mutante>	 needs skin support and per-user settings :p
[01:22:34] <wikibugs>	 (03PS1) 10Alex Monk: network: remove old labs public range [puppet] - 10https://gerrit.wikimedia.org/r/507907 (https://phabricator.wikimedia.org/T193496)
[01:23:07] <rxy>	 08,01 ̶W̶a̶r̶n̶i̶n̶g
[01:23:12] <James_F>	 Reedy: Looks fine to me. ;-) https://usercontent.irccloud-cdn.com/file/468mrdhG/image.png
[01:24:46] <Krenair>	 I think I have the worst
[01:25:00] <Krenair>	 https://i.imgur.com/kz2Q4rM.png
[01:25:51] <paladox>	 your's looks much better Krenair 
[01:26:10] <Reedy>	 Can at least read yours
[01:26:44] <paladox>	 mine looks like: https://phabricator.wikimedia.org/F28909362
[01:26:59] <Reedy>	 about as bad as mine :P
[01:27:06] <paladox>	 lol
[01:27:40] <Krenair>	 yeah true my shade of yellow is better, but mine can't even handle the first character
[01:29:45] <Reedy>	 software sucks
[01:31:32] <James_F>	 Character encoding and rendering sucks.
[01:31:40] <James_F>	 If only we'd invented computers before writing. ;-)
[01:31:59] <paladox>	 heh
[01:35:53] <mutante>	 Reedy: https://gist.github.com/TehNut/8f58c78170be56abd8b4
[01:36:21] <mutante>	 DarkenQuassel?:)
[01:36:50] <mutante>	 https://i.imgur.com/amMKaJz.png
[02:21:42] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Close the engineering mailing list - https://phabricator.wikimedia.org/T222308 (10Tgr) Not sure what counts as consensus for something like this, but +1 from me: 100% of the traffic is CC-d from the wikitech or ops list; it's too underspecified to be useful to the peopl...
[03:13:50] <icinga-wm>	 PROBLEM - puppet last run on analytics1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:25:08] <icinga-wm>	 PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:40:26] <icinga-wm>	 RECOVERY - puppet last run on analytics1058 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[03:45:48] <icinga-wm>	 PROBLEM - puppet last run on restbase1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:51:46] <icinga-wm>	 RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[04:12:22] <icinga-wm>	 RECOVERY - puppet last run on restbase1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[04:25:59] <wikibugs>	 (03PS1) 10Ayounsi: LibreNMS, change $install_dir to group librenms [puppet] - 10https://gerrit.wikimedia.org/r/507909 (https://phabricator.wikimedia.org/T207706)
[04:27:48] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[04:28:29] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/16304/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/507909 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[04:34:26] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[04:40:23] <Krenair>	 weird, I was just getting served pages without styling
[04:46:24] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:54:24] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[04:55:09] <_joe_>	 !log progressively depooling parsoid servers in codfw to assess load tolerance
[04:55:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:57:19] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=parsoid,dc=codfw,name=wtp20(1[7-9]|20).*
[04:57:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:36] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:03:38] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:03:42] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:03:52] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:04:06] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:04:12] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:04:52] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[05:05:00] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:05:02] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:05:04] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[05:05:06] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[05:05:18] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:05:24] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=parsoid,dc=codfw,name=wtp201[5-6].*
[05:05:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:06:22] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:06:38] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[05:06:50] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[05:06:54] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[05:07:16] <Oshwah>	 FYI - getting back 503 errors when loading user pages.
[05:08:59] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2844 bytes in 1.420 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:09:08] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1527 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[05:09:46] <JJMC89|away>	 Meta is giving me 503s ... probably what icinga-wm is going on about
[05:09:55] <Oshwah>	 twinkleoptions.js (probably others) not loading either
[05:10:08] <icinga-wm>	 PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[05:10:28] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2403 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[05:11:25] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2775 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:11:28] <icinga-wm>	 RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[05:11:38] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2818 bytes in 1.658 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:14:04] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2825 bytes in 0.455 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:14:06] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16151 bytes in 0.302 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:14:31] <Oshwah>	 Packet loss and other errors are decreasing
[05:14:52] * Oshwah is watching the graphs
[05:15:06] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:15:10] <Oshwah>	 JJMC89: Try now
[05:15:20] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2817 bytes in 0.461 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:15:31] <Oshwah>	 Might still get errors, but they seem to be recovering...
[05:16:18] <JJMC89>	 Oshwah: Its getting better. Still some 503s when loading some js.
[05:16:19] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16137 bytes in 0.529 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:16:27] <Oshwah>	 JJMC89: Same with me
[05:16:28] <musikanimal>	 the stylesheets aren't loading for me at https://en.wikipedia.org/wiki/User:MusikAnimal
[05:16:42] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16189 bytes in 0.484 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:17:03] <Oshwah>	 My scripts aren't loading or pulling any data (my live scripts are getting errors with the API)
[05:17:06] <Krenair>	 I was seeing some problems wit stylesheets earlier
[05:17:10] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.01207 https://grafana.wikimedia.org/dashboard/db/logstash
[05:17:12] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.04798 https://grafana.wikimedia.org/dashboard/db/logstash
[05:17:13] <Krenair>	 but it appears okay now?
[05:17:28] <Oshwah>	 Krenair: Page loading on https looks better
[05:17:34] <JJMC89>	 Yea, styles weren't loading a little bit ago for me too.
[05:17:47] <Oshwah>	 .js / .css still having some issues
[05:18:21] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16202 bytes in 1.300 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:19:17] <musikanimal>	 strangely, the CSS loads in incognito or another browser, but clearing the cache in my main browser window doesn't work
[05:19:26] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16199 bytes in 0.472 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:19:45] <_joe_>	 uhm
[05:19:53] <Oshwah>	 musikanimal: Interesting
[05:19:56] <_joe_>	 can I ask you in which contintent are you located?
[05:20:08] <Oshwah>	 Everything now looks A-OK; keeping watch...
[05:20:19] <musikanimal>	 proudly and openly a New Yorker
[05:20:53] <JJMC89>	 I'm in the U.S.
[05:20:55] <musikanimal>	 aha, I think it was the DNS cache. Clearing that worked
[05:21:00] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16190 bytes in 1.264 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:21:11] <musikanimal>	 or it was just coincidence
[05:21:26] <Krenair>	 not sure there's been any big DNS changes overnight...
[05:22:08] <Krenair>	 nope, nothing in dns.git
[05:22:10] <Krenair>	 I saw problems earlier and I'm in Europe
[05:22:33] <musikanimal>	 coincidence then. Everything looks good now
[05:22:53] <musikanimal>	 so only user pages were affected here? That was my observation
[05:23:12] <_joe_>	 musikanimal: indeed, coincidence
[05:23:36] <JJMC89>	 musikanimal: I was getting issues all over meta and enwiki.
[05:25:00] <Krenair>	 musikanimal, I encountered it browsing enwiki mainspace
[05:26:16] <musikanimal>	 I had it just for userpages, and only certain userpages. I kept hitting "Random article" and all of those loaded fine
[05:26:25] <_joe_>	 the problem was one varnish server with issues
[05:26:29] <_joe_>	 more or less known
[05:26:34] <_joe_>	 in our main active dc
[05:26:38] <_joe_>	 https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-3h&to=now
[05:26:42] <_joe_>	 see cp1077
[05:27:30] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:27:34] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:27:36] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:27:36] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:27:36] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:27:38] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:27:54] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:28:08] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:29:06] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:29:22] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:29:28] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:30:10] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:30:14] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:30:18] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:30:26] <TheSandDoctor>	 I am also running into issues. My UTC clock disappeared on enwiki
[05:30:28] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:30:34] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:30:42] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:31:20] <Krenair>	 _joe_, looks like maybe the problems aren't over?
[05:31:31] <_joe_>	 sigh
[05:31:36] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:31:36] <_joe_>	 indeed
[05:31:46] <Krenair>	 maybe time to depool that varnish?
[05:32:09] <_joe_>	 Krenair: no it needs a restart
[05:32:33] <TheSandDoctor>	 my UTC clock is now back, for now at least
[05:32:42] <XioNoX>	 _joe_: I'm around, can I help with anything?
[05:32:47] <_joe_>	 !log restarting varnish backend on cp1077
[05:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:05] <_joe_>	 XioNoX: this ^^ should solve things in theory
[05:33:28] <XioNoX>	 ok, don't hesitate to ping me if needed
[05:33:58] <Krenair>	 I'm still seeing problems
[05:34:04] <_joe_>	 I'm not sure that's the only issue btw
[05:34:16] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:34:16] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:34:20] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:34:28] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:34:36] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:34:42] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:34:47] <TheSandDoctor>	 "Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes."
[05:34:50] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:34:58] <Oshwah>	 Still getting 503 error returns when looking at page history
[05:34:58] <TheSandDoctor>	 when I tried loading enwiki spi @_joe_
[05:35:05] <Krenair>	 I'm refreshing https://en.wikipedia.org/wiki/Emily_Watson and getting 503s
[05:35:14] <Krenair>	 (while logged in)
[05:35:24] <TheSandDoctor>	 503 logged in as well
[05:35:26] <Krenair>	 intermittently
[05:35:29] <TheSandDoctor>	 better way to describe my error
[05:35:30] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:35:36] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:35:38] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:35:43] <_joe_>	 Krenair: I can load that just fine
[05:35:44] <TheSandDoctor>	 same as Krenair
[05:35:48] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:35:52] <Krenair>	 cp1085
[05:35:56] <_joe_>	 but can you post me your X-cache header?
[05:35:56] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:35:58] <_joe_>	 ok
[05:36:00] <_joe_>	 as I thought
[05:36:01] <Krenair>	 shows up on each of the 503 pages I get
[05:36:02] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:36:08] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:36:12] <_joe_>	 I can't really restart that as well now
[05:36:22] <Krenair>	 my x-cache header... x-cache	
[05:36:22] <Krenair>	 cp1085 int, cp3032 miss, cp3030 pass
[05:36:44] <Oshwah>	 Graphs are starting to curve back into the direction of when the original issue was present just earlier...
[05:36:54] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:36:56] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:37:02] <Oshwah>	 But not all of them (yet?)
[05:38:03] * Oshwah is keeping watch
[05:38:14] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:39:45] <Krenair>	 it seems to be working okay for me now
[05:40:06] <Oshwah>	 ^
[05:40:34] <Oshwah>	 Graphs are recovering. A much smaller blip than the earlier 503 errors...
[05:40:52] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:41:13] <_joe_>	 we are on it, I also called other people to help
[05:41:18] <_joe_>	 we will keep an eye
[05:41:26] <_joe_>	 I don't think we're out of the woods, still
[05:41:54] * Oshwah panics and starts screaming
[05:41:55] <Oshwah>	 kk np
[05:42:04] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:42:08] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:42:10] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:42:10] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:42:12] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:42:28] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[05:42:36] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:42:42] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:43:06] <Izhidez>	 Oshwah: https://www.youtube.com/watch?v=VnT7pT6zCcA
[05:43:26] <_joe_>	 Izhidez: oh that's something I lived better not seeing :DD
[05:43:35] <Oshwah>	 lol
[05:43:40] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:43:44] <Izhidez>	 lolol
[05:43:44] <Oshwah>	 Oh, beeker...
[05:43:54] <Oshwah>	 Beaker*
[05:46:24] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[05:47:16] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[05:47:26] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[05:47:30] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[05:47:54] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[05:49:18] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[05:51:05] * TheSandDoctor has never seen a wild panicking @Oshwah before :D :P
[05:55:46] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[05:56:28] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[05:57:12] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[05:58:22] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[06:03:46] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[06:05:42] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[06:20:58] <wikibugs>	 (03CR) 10Smalyshev: wdqs: add WDQS restart cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe)
[06:28:54] <icinga-wm>	 PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:46:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) Thanks Chris! Once @robh has added the production DNS entries I can take over and install them myself :-)
[07:06:42] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:06:58] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:07:06] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:07:34] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[07:07:46] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[07:07:46] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[07:07:48] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:07:48] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:07:52] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[07:08:02] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[07:08:10] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[07:10:54] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[07:11:42] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[07:11:46] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:11:46] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:11:56] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:12:06] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[07:12:12] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:12:20] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:16:28] <ema>	 !log cp1089: varnish-backend-restart
[07:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:54] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[07:17:10] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[07:17:26] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[07:18:05] <wikibugs>	 10Operations, 10netops: cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6) - https://phabricator.wikimedia.org/T222424 (10elukey)
[07:21:56] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[07:22:14] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[07:39:53] <wikibugs>	 (03Abandoned) 10Vgutierrez: authdns: Avoid caching dns-01 challenges [puppet] - 10https://gerrit.wikimedia.org/r/503935 (https://phabricator.wikimedia.org/T219414) (owner: 10Vgutierrez)
[07:43:40] <wikibugs>	 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez)
[07:43:57] <wikibugs>	 (03PS1) 10Ema: cache: reimage cp4024 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507915 (https://phabricator.wikimedia.org/T219967)
[07:45:52] <ema>	 !log depool cp4024 and reimage as upload_ats T219967
[07:45:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:57] <stashbot>	 T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo   - https://phabricator.wikimedia.org/T219967
[07:47:07] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: reimage cp4024 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507915 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[07:50:34] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[07:50:47] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp4024.ulsfo.wmnet'] ` The log can be...
[07:50:56] <wikibugs>	 (03CR) 10Vgutierrez: "basically a NOOP for ATS instances with plain text inbound: https://puppet-compiler.wmflabs.org/compiler1002/16305/" [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[07:55:48] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational
[07:59:15] <wikibugs>	 (03CR) 10Vgutierrez: "cosmetic changes in records.config for existing instances with caching enabled caused by regrouping of cache-related settings in the templ" [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[08:00:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] config: Move ACMEChiefConfig to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507801 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez)
[08:00:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] dns: Move DNS operations to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507802 (owner: 10Vgutierrez)
[08:01:04] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] CI: Run tests with minimum and latest dependencies [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507803 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez)
[08:01:07] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Prevalidate CN/SNI list [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507804 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez)
[08:01:11] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Release 0.17 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507805 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez)
[08:02:12] <wikibugs>	 (03Merged) 10jenkins-bot: config: Move ACMEChiefConfig to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507801 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez)
[08:02:33] <wikibugs>	 (03Merged) 10jenkins-bot: dns: Move DNS operations to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507802 (owner: 10Vgutierrez)
[08:03:45] <wikibugs>	 (03CR) 10jenkins-bot: config: Move ACMEChiefConfig to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507801 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez)
[08:03:47] <wikibugs>	 (03Merged) 10jenkins-bot: CI: Run tests with minimum and latest dependencies [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507803 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez)
[08:03:49] <wikibugs>	 (03Merged) 10jenkins-bot: acme_chief: Prevalidate CN/SNI list [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507804 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez)
[08:03:51] <wikibugs>	 (03Merged) 10jenkins-bot: Release 0.17 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507805 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez)
[08:04:07] <wikibugs>	 (03CR) 10jenkins-bot: dns: Move DNS operations to its own module [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507802 (owner: 10Vgutierrez)
[08:06:08] <wikibugs>	 (03CR) 10jenkins-bot: acme_chief: Prevalidate CN/SNI list [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507804 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez)
[08:06:10] <wikibugs>	 (03CR) 10jenkins-bot: CI: Run tests with minimum and latest dependencies [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507803 (https://phabricator.wikimedia.org/T213820) (owner: 10Vgutierrez)
[08:06:13] <wikibugs>	 (03CR) 10jenkins-bot: Release 0.17 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/507805 (https://phabricator.wikimedia.org/T220518) (owner: 10Vgutierrez)
[08:06:59] <wikibugs>	 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez)
[08:09:04] <icinga-wm>	 RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational
[08:13:02] <icinga-wm>	 PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:27:24] <jynus>	 !log starting table recompression on new backup source hosts on eqiad and codfw (stop replication) T220572
[08:27:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:29] <stashbot>	 T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host - https://phabricator.wikimedia.org/T220572
[08:33:05] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4024.ulsfo.wmnet'] `  and were **ALL** successful.
[08:38:57] <icinga-wm>	 RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational
[08:44:04] <wikibugs>	 (03Abandoned) 10Awight: [DNM] Enable Extension:JADE in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440124 (https://phabricator.wikimedia.org/T183381) (owner: 10Awight)
[08:47:40] <ema>	 !log pool cp4024 w/ ATS backend T219967
[08:47:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:44] <stashbot>	 T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo   - https://phabricator.wikimedia.org/T219967
[08:49:37] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1139 and db1140 [puppet] - 10https://gerrit.wikimedia.org/r/507925 (https://phabricator.wikimedia.org/T220572)
[08:58:23] <icinga-wm>	 RECOVERY - EDAC syslog messages on db1107 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1107&var-datasource=eqiad+prometheus/ops
[09:03:41] <wikibugs>	 (03PS17) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[09:05:25] <wikibugs>	 (03CR) 10Mathew.onipe: icinga: create and apply cirrus config check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[09:12:20] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Update DCAT-AP link [dumps/dcat] - 10https://gerrit.wikimedia.org/r/497435 (owner: 10Awight)
[09:17:48] <wikibugs>	 (03PS18) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[09:20:59] <gehel>	 !log ban elastic2038 from elastic clusters pending memory issue investigation - T217398
[09:21:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:03] <stashbot>	 T217398: elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398
[09:24:46] <wikibugs>	 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Gehel) It looks like we need to investigate this a bit more  >>! In T217398#4999270, @Papaul wrote: > The next step will be to monitor the system and see if we have the s...
[09:25:04] <wikibugs>	 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Gehel) 05Resolved→03Open
[09:26:13] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 26.77 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[09:27:08] <wikibugs>	 10Operations, 10Discovery-Search (Current work): Decrease shard alert threshold for omega and psi elasticsearch clusters - https://phabricator.wikimedia.org/T222432 (10Gehel)
[09:30:29] <wikibugs>	 (03PS19) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[09:30:33] <wikibugs>	 (03PS1) 10Ammarpad: Add localized project logo for sahwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507931 (https://phabricator.wikimedia.org/T222065)
[09:31:07] <wikibugs>	 (03CR) 10Mathew.onipe: "PCC is Ok: https://puppet-compiler.wmflabs.org/compiler1002/16308/" [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[09:35:04] <wikibugs>	 (03PS1) 10星耀晨曦: Enable FlaggedRevisions on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932
[09:35:36] <wikibugs>	 10Operations, 10Discovery-Search (Current work): Decrease shard alert threshold for omega and psi elasticsearch clusters - https://phabricator.wikimedia.org/T222432 (10Gehel) 05Open→03Invalid Actually, the check timed out. Which make sense if it was routed to the problematic server, before it was marked in...
[09:35:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable FlaggedRevisions on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (owner: 10星耀晨曦)
[09:37:06] <ema>	 the alert above is due to a incoming traffic spike we got in eqsin-text: https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&from=1556869777118&to=1556876197358&var-site=eqsin&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4
[09:39:23] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 99.46 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[09:42:47] <wikibugs>	 (03PS2) 10星耀晨曦: Enable FlaggedRevisions on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (https://phabricator.wikimedia.org/T221933)
[09:43:14] <wikibugs>	 (03PS3) 10Jbond: prometheus: add timeout paramter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561
[09:43:48] <wikibugs>	 (03PS4) 10Jbond: prometheus: add timeout paramter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561
[09:44:25] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond)
[09:45:00] <wikibugs>	 (03PS5) 10Jbond: Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561
[09:45:49] <wikibugs>	 (03PS1) 10Ema: cache: reimage cp4025 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507935 (https://phabricator.wikimedia.org/T219967)
[09:49:24] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=parsoid,dc=codfw,name=wtp201[3-4].*
[09:49:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:27] <ema>	 !log depool cp4025 and reimage as upload_ats T219967
[09:49:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:31] <stashbot>	 T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo   - https://phabricator.wikimedia.org/T219967
[09:49:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Prometheus: add timeout parameter to query method [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond)
[09:51:06] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: reimage cp4025 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/507935 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[09:52:12] <Lucas_WMDE>	 heads-up: I’m about to deploy a backport for T222347 (UBN!, as far as I know those are okay to fix on Fridays)
[09:52:13] <stashbot>	 T222347: wbsearchentities now returns an error with type=lexeme - https://phabricator.wikimedia.org/T222347
[09:55:16] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp4025.ulsfo.wmnet'] ` The log can be...
[09:57:31] <Lucas_WMDE>	 backport is on mwdebug1002, currently testing
[10:00:19] <Lucas_WMDE>	 uploading files on Commons still works
[10:00:22] <Lucas_WMDE>	 and so does editing captions
[10:00:35] <Lucas_WMDE>	 so I assume the gate-and-submit failure on that backport did not indicate a real problem
[10:00:36] <Lucas_WMDE>	 syncing
[10:01:36] <tarrow>	 great :)
[10:02:20] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.3/extensions/WikibaseLexemeCirrusSearch/: [[gerrit:507847|Fix reference to classes that moved (T222347)]] (duration: 00m 55s)
[10:02:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:24] <stashbot>	 T222347: wbsearchentities now returns an error with type=lexeme - https://phabricator.wikimedia.org/T222347
[10:03:27] <icinga-wm>	 PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle.
[10:05:19] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1275 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 330 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[10:07:34] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10aborrero) ` => ctrl slot=0 pd all show status     physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 146 GB): OK    physicaldrive 1I:1:2 (port 1I:box 1:bay 2,...
[10:10:01] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Allow proxyfetch to check more than one url at a time [debs/pybal] - 10https://gerrit.wikimedia.org/r/507740
[10:10:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Allow proxyfetch to check more than one url at a time (035 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/507740 (owner: 10Giuseppe Lavagetto)
[10:11:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1007.eqiad.wmnet ` The log can be...
[10:13:31] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[10:13:52] <jijiki>	 checking mw1275
[10:17:33] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[10:17:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: initial attempt at a varnishkafka exporter (032 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[10:19:51] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp3038 is CRITICAL: CRITICAL: expiry mailbox lag is 2086526 https://wikitech.wikimedia.org/wiki/Varnish
[10:25:30] <jijiki>	 !log Depool mw1275
[10:25:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1007.eqiad.wmnet'] `  Of which those **FAILED**: ` ['cloudvirt1007.eqiad.wmnet'] `
[10:33:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1007.eqiad.wmnet ` The log can be...
[10:33:57] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1007.eqiad.wmnet'] `  Of which those **FAILED**: ` ['cloudvirt1007.eqiad.wmnet'] `
[10:34:19] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1007.eqiad.wmnet ` The log can be...
[10:34:28] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1007.eqiad.wmnet'] `  Of which those **FAILED**: ` ['cloudvirt1007.eqiad.wmnet'] `
[10:35:21] <icinga-wm>	 RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[10:37:39] <wikibugs>	 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4025.ulsfo.wmnet'] `  and were **ALL** successful.
[10:40:17] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1007.eqiad.wmnet ` The log can be...
[10:41:28] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki)
[10:42:14] <jbond42>	 !log T220380 upload zull_2.5.1-wmf7 to jessie-wikimedia 
[10:42:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:19] <stashbot>	 T220380: Upload Zuul 2.5.1-wmf7 package to apt.wikimedia.org - https://phabricator.wikimedia.org/T220380
[10:43:28] <wikibugs>	 (03PS7) 10Filippo Giunchedi: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933)
[10:43:30] <wikibugs>	 (03PS7) 10Filippo Giunchedi: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933)
[10:43:49] <jbond42>	 !log T220380 remove zull_2.5.0-8-gcbc7f62-wmf4jessie1 from jessie-wikimedia/thirdparty
[10:43:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:31] <icinga-wm>	 PROBLEM - Varnish traffic logger - varnishstatsd on cp4025 is CRITICAL: NRPE: Command check_varnishstatsd not defined https://wikitech.wikimedia.org/wiki/Varnish
[10:44:33] <icinga-wm>	 PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp4025 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined https://wikitech.wikimedia.org/wiki/Confd
[10:44:45] <icinga-wm>	 PROBLEM - IPsec on cp4025 is CRITICAL: NRPE: Command check_IPsec not defined
[10:44:51] <ema>	 ignore, that's me ^
[10:44:53] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: Upload Zuul 2.5.1-wmf7 package to apt.wikimedia.org - https://phabricator.wikimedia.org/T220380 (10jbond) 05Open→03Resolved a:03jbond I have updated the new package and removed the thirdparty package let me know if you see any issues  ` contint2001 ~...
[10:44:59] <icinga-wm>	 PROBLEM - Varnish HTTP upload-backend - port 3128 on cp4025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 1982 bytes in 0.840 second response time https://wikitech.wikimedia.org/wiki/Varnish
[10:45:21] <ema>	 will be fixed in a minute when puppet runs on icinga1001
[10:46:21] <ema>	 (icinga still thinks cp4025 runs varnish as the backend software, ha)
[10:46:36] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: Upload Zuul 2.5.1-wmf7 package to apt.wikimedia.org - https://phabricator.wikimedia.org/T220380 (10hashar) Looks all good. Thank you very much :)
[10:47:49] <ema>	 !log pool cp4025 w/ ATS backend T219967
[10:47:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:53] <stashbot>	 T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo   - https://phabricator.wikimedia.org/T219967
[10:50:06] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150)
[10:51:32] <wikibugs>	 (03PS24) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594)
[10:56:59] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[10:57:04] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10fgiunchedi) >>! In T196066#5153205, @Ottomata wrote: > Hm but also, whatever we replace varnishkafka with...
[10:57:27] <wikibugs>	 (03PS17) 10Vgutierrez: trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594)
[10:57:29] <wikibugs>	 (03PS4) 10Vgutierrez: prometheus: Support several instances of the trafficserver exporter [puppet] - 10https://gerrit.wikimedia.org/r/506659 (https://phabricator.wikimedia.org/T221217)
[10:57:31] <wikibugs>	 (03PS3) 10Vgutierrez: nagios_common: Provide check_https_hostheader_port_url check [puppet] - 10https://gerrit.wikimedia.org/r/507006 (https://phabricator.wikimedia.org/T221594)
[10:57:33] <wikibugs>	 (03PS6) 10Vgutierrez: trafficserver: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217)
[10:57:35] <wikibugs>	 (03PS29) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594)
[10:57:49] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[10:58:27] <godog>	 looks like a spike, already recovered
[10:59:06] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Reenable notifications on db1139 and db1140 [puppet] - 10https://gerrit.wikimedia.org/r/507925 (https://phabricator.wikimedia.org/T220572)
[10:59:08] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Add db1139 and db1140 mysql instances to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/507942 (https://phabricator.wikimedia.org/T220572)
[10:59:26] <wikibugs>	 (03PS30) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594)
[10:59:50] <wikibugs>	 (03PS1) 10Matthias Mullie: SDC: Enable feature flag for depicts in UW on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507943 (https://phabricator.wikimedia.org/T217024)
[11:00:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Add db1139 and db1140 mysql instances to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/507942 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo)
[11:01:01] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Add db1139 and db1140 mysql instances to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/507942 (https://phabricator.wikimedia.org/T220572)
[11:01:37] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:06:35] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[11:08:15] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) > I wouldn't spend much time on jessie problems but rather focusing moving to stretch or buster at this point.  That kind of contr...
[11:10:29] <_joe_>	 !log purging opcache on mw1275
[11:10:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:07] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1275 is OK: HTTP OK: HTTP/1.1 200 OK - 81126 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[11:11:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10aborrero) The trick was to manually set PXE boot for the debian installer and while the installer is working manually switch to disk boot.  ` </>hpiLO...
[11:16:00] <wikibugs>	 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10elukey) >>! In T222112#5153856, @CDanis wrote: > I think you should just be able to remove the "custom all value" in the dashboard settings and hav...
[11:18:29] <wikibugs>	 (03PS1) 10Jcrespo: backups: Decommission dbstore1001, dbstore2001 and dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/507944 (https://phabricator.wikimedia.org/T220002)
[11:25:57] <wikibugs>	 (03CR) 10Cparle: [C: 03+1] SDC: Enable feature flag for depicts in UW on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507943 (https://phabricator.wikimedia.org/T217024) (owner: 10Matthias Mullie)
[11:36:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1007.eqiad.wmnet'] `  and were **ALL** successful.
[11:48:56] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) 05Open→03Resolved a:03hashar So my primary concern was having a freshly created WMCS instance to be broken on creation. I th...
[11:50:05] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[11:50:43] <wikibugs>	 10Operations: cron-spam: /usr/local/sbin/check-cumin-aliases - https://phabricator.wikimedia.org/T222443 (10jbond) p:05Triage→03Normal
[11:52:13] <wikibugs>	 (03PS1) 10Jbond: cumin: update list of cumin_masters [puppet] - 10https://gerrit.wikimedia.org/r/507948 (https://phabricator.wikimedia.org/T222443)
[11:53:59] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[12:06:24] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: repool cloudvirt1007 [puppet] - 10https://gerrit.wikimedia.org/r/507949 (https://phabricator.wikimedia.org/T221047)
[12:07:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "Please andrew +1 (or even +2 and merge :-P)" [puppet] - 10https://gerrit.wikimedia.org/r/507949 (https://phabricator.wikimedia.org/T221047) (owner: 10Arturo Borrero Gonzalez)
[12:08:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10aborrero)
[12:08:46] <wikibugs>	 (03PS20) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[12:08:48] <wikibugs>	 (03PS1) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932)
[12:09:02] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10aborrero)
[12:09:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10aborrero) 05Open→03Resolved
[12:15:15] <wikibugs>	 (03PS2) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932)
[12:17:32] <wikibugs>	 (03PS3) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932)
[12:17:34] <wikibugs>	 (03PS21) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[12:21:23] <ema>	 !log cp3038: varnish-backend-restart
[12:21:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:35] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp3038 is OK: OK: expiry mailbox lag is 0 https://wikitech.wikimedia.org/wiki/Varnish
[12:26:07] <gehel>	 !log replaying 30 minutes of eqiad search traffic on codfw - T221121
[12:26:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:11] <stashbot>	 T221121: Capacity planning for elastic search  - https://phabricator.wikimedia.org/T221121
[12:28:34] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10fgiunchedi) No worries at all @hashar !  This upgrade problem was unexpected and rather annoying for sure, my expectation is that subseque...
[12:30:41] <wikibugs>	 (03PS22) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932)
[12:36:53] <wikibugs>	 (03CR) 10Mathew.onipe: "PCC is ok: https://puppet-compiler.wmflabs.org/compiler1002/16313/" [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe)
[12:40:30] <wikibugs>	 (03PS1) 10Ema: varnish: retry requests upon 502 errors [puppet] - 10https://gerrit.wikimedia.org/r/507953 (https://phabricator.wikimedia.org/T219967)
[12:44:02] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Tarrow) @akosiaris and @WMDE-leszek I think the needed changes to proceed with the helm chart are now done. Feel free to poke m...
[12:48:03] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150)
[12:48:57] <wikibugs>	 (03PS3) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150)
[12:49:48] <wikibugs>	 (03PS4) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150)
[12:54:26] <wikibugs>	 (03PS5) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150)
[13:01:28] <wikibugs>	 (03PS6) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150)
[13:10:17] <wikibugs>	 (03CR) 10Mathew.onipe: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[13:13:44] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: add WDQS restart cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe)
[13:13:57] <wikibugs>	 (03PS2) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832)
[13:22:48] <wikibugs>	 (03PS2) 10Jbond: refactor: Refactor script and use the PyYAML [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188
[13:33:15] <icinga-wm>	 PROBLEM - proton endpoints health on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[13:33:21] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:33:49] <icinga-wm>	 PROBLEM - Check size of conntrack table on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[13:34:01] <icinga-wm>	 PROBLEM - Check systemd state on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[13:34:03] <icinga-wm>	 PROBLEM - Disk space on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[13:34:05] <icinga-wm>	 PROBLEM - configured eth on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[13:34:13] <icinga-wm>	 PROBLEM - dhclient process on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[13:34:33] <icinga-wm>	 PROBLEM - DPKG on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[13:37:01] <icinga-wm>	 PROBLEM - puppet last run on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[13:39:18] <wikibugs>	 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10jijiki)
[13:39:29] <wikibugs>	 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10jijiki) p:05Triage→03Normal
[13:45:23] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused
[13:46:52] <wikibugs>	 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10elukey) Note to self: remember that doing the above breaks all the kafka graphs
[13:49:08] <jijiki>	 !log Restart npre on proton1001
[13:49:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:37] <icinga-wm>	 RECOVERY - Check size of conntrack table on proton1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[13:49:51] <icinga-wm>	 RECOVERY - Check systemd state on proton1001 is OK: OK - running: The system is fully operational
[13:49:53] <icinga-wm>	 RECOVERY - Disk space on proton1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[13:49:55] <icinga-wm>	 RECOVERY - configured eth on proton1001 is OK: OK - interfaces up
[13:50:03] <icinga-wm>	 RECOVERY - dhclient process on proton1001 is OK: PROCS OK: 0 processes with command name dhclient
[13:50:25] <icinga-wm>	 RECOVERY - DPKG on proton1001 is OK: All packages OK
[13:50:27] <icinga-wm>	 RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[13:50:31] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on proton1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:50:31] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[13:52:08] <wikibugs>	 (03PS7) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150)
[13:52:59] <icinga-wm>	 RECOVERY - puppet last run on proton1001 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures
[13:55:47] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational
[14:07:25] <wikibugs>	 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10elukey) It seems that the following happens when using the default all value (using the current cpu usage query since it ends up in the same proble...
[14:11:00] <wikibugs>	 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10elukey) This works with the default all value:  ` avg by (instance) (irate(node_cpu{cluster="$cluster",mode!="idle",instance=~"($kafka_broker).*"}[...
[14:13:03] <wikibugs>	 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10Reedy)
[14:15:41] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on proton1001 is OK: OK: synced at Fri 2019-05-03 14:15:40 UTC.
[14:17:47] <wikibugs>	 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10Lucas_Werkmeister_WMDE) That deployment (fixing T222347) was a backport ([I3e4bf4b12d](https://gerrit.wikimedia.org/r/507847)), which I force-submitted because...
[14:19:54] <wikibugs>	 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10jijiki) @Lucas_Werkmeister_WMDE We will look into it, it only happened on a single server so we believe, for now,  that it could not be related to the change pe...
[14:26:15] <wikibugs>	 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10elukey) Swapped all the occurrences of `instance=~"$kafka_broker"` with `instance=~"($kafka_broker).*"`, and the dashboard seems loading faster now...
[14:32:28] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) I don't think Magnus would build it into librdkafka itself.  But, a 3rd party C library could b...
[14:42:34] <wikibugs>	 (03CR) 10Ottomata: "Nit:  I think this repo might better belong in operations/software, rather than operations/debs.  I think operations/debs exists for packa" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[14:43:10] <wikibugs>	 10Operations, 10observability, 10Wikimedia-Incident: figure out why Kafka dashboard hammers Prometheus, and fix it - https://phabricator.wikimedia.org/T222112 (10CDanis) 05Open→03Resolved a:03CDanis It does seem //much// faster now, thanks @elukey !  Impact of loading 30 days on Prometheus is also mini...
[14:45:39] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[14:45:53] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[14:53:35] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[14:55:05] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[14:55:43] <wikibugs>	 10Operations, 10serviceops, 10Wikimedia-production-error: PHP Fatal Errors on mw1275 after deployment - https://phabricator.wikimedia.org/T222452 (10Joe) some things from my very initial analysis: - I tried to purge first the directory that the deployment had invalidated, the error didn't go away - I tried p...
[14:56:14] <_joe_>	 !log  repool mw1275
[14:56:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:05] <_joe_>	 !log repooling the wtp* servers depooled in codfw for load testing
[15:02:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:56] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=parsoid,dc=codfw
[15:02:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:51] <revi>	 any ops member who would be able to help me with https://phabricator.wikimedia.org/T222033 ?
[15:29:04] <wikibugs>	 (03PS1) 10ArielGlenn: make page ranges for stubs be ints [dumps] - 10https://gerrit.wikimedia.org/r/507976
[15:29:06] <wikibugs>	 (03PS1) 10ArielGlenn: convert exception error strings to utf8, thanks a lot python3 [dumps] - 10https://gerrit.wikimedia.org/r/507977
[15:31:18] <Reedy>	 _joe_: Any interest in an apache change for revi? :P
[15:31:35] <revi>	 well realized not right now because doing some OS works
[15:33:06] <_joe_>	 Reedy: on friday?
[15:33:17] <Reedy>	 If you're feeling brave
[15:33:22] <_joe_>	 you were trolling right
[15:33:24] <Reedy>	 If not, scheduling to do it next week :)
[15:33:28] <_joe_>	 no that would be feeling stupid
[15:33:47] <_joe_>	 apache on friday, never
[15:33:49] <_joe_>	 did once
[15:33:53] <_joe_>	 ruined my weekend
[15:34:21] <revi>	 at least gimme some timeframe :P
[15:48:25] <revi>	 _joe_: can we at least agree on that we can (definitely) do it next week? :P
[15:48:46] <_joe_>	 revi: what change?
[15:48:51] <_joe_>	 and sure next week
[15:48:52] <revi>	 https://phabricator.wikimedia.org/T222033
[15:49:00] <revi>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/506895
[15:53:26] <wikibugs>	 (03CR) 10Alex Monk: [C: 04-1] elastalert: new module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi)
[15:54:07] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:55:27] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:55:34] <_joe_>	 uh, utf-8
[16:00:44] <elukey>	 buuuu memcached
[16:00:54] <wikibugs>	 10Operations, 10media-storage, 10observability: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications - https://phabricator.wikimedia.org/T222362 (10Dzahn) p:05Triage→03Normal
[16:05:29] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) Yup looks all right so far :-]
[16:05:39] <icinga-wm>	 PROBLEM - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:14] <elukey>	 anybody working on cp2009?
[16:06:29] <wikibugs>	 (03CR) 10Dzahn: "yea, that's right. there used the redirects.conf and redirects.dat and you had to upload both but that changed and is not the case anymore" [puppet] - 10https://gerrit.wikimedia.org/r/506895 (https://phabricator.wikimedia.org/T222033) (owner: 10Revi)
[16:06:30] <elukey>	 CC: ema --^
[16:07:34] <wikibugs>	 (03CR) 10Hashar: "And that is apparently a noop as far as CI/zuul/jenkins-bot are concerned :]" [puppet] - 10https://gerrit.wikimedia.org/r/504973 (https://phabricator.wikimedia.org/T182756) (owner: 10Hashar)
[16:07:51] <mutante>	 elukey: i tried to get on mgmt and interesting response:
[16:07:51] <mutante>	 /usr/bin/clpd: Input/output error
[16:08:05] <elukey>	 very nice from cp2009
[16:08:15] <ema>	 elukey: I'm not working on it, no. It's part of the ATS test cluster though, so no serving prod traffic
[16:08:23] <elukey>	 ack then :)
[16:08:25] <ema>	 ty
[16:08:46] <mutante>	 should we make a ticket for the mgmt of it?
[16:09:03] <mutante>	 might need DRAC reset or something
[16:09:15] <wikibugs>	 (03PS3) 10Cwhite: initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066)
[16:09:28] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] puppet_compiler: add cron to delete old output files [puppet] - 10https://gerrit.wikimedia.org/r/507623 (https://phabricator.wikimedia.org/T222072) (owner: 10Dzahn)
[16:11:09] <mutante>	 when we have alerts for hosts that are part of test clusters it makes me think if we should have puppet code that skips icinga if defined as test cluster
[16:11:39] <mutante>	 but then.. you might want to know when test hosts have iesues too.. for the purpose of testing
[16:12:09] <ema>	 mutante: to be fair, that *was* a prod cluster and is now a 'test' cluster simply because no frontends are routed to those hosts
[16:12:21] <ema>	 they're gonna become actually test hosts once I find the time to reimage them :)
[16:12:53] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn not serving prod traffic
[16:12:57] <mutante>	 ema: ah! alright, got it
[16:13:48] <wikibugs>	 (03CR) 10Alex Monk: "Needs rebasing, there is now a object_replicator_concurrency parameter to swift::storage which doesn't necessarily fit in well with this?" [puppet] - 10https://gerrit.wikimedia.org/r/344387 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar)
[16:16:50] <wikibugs>	 10Operations, 10ops-codfw: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10Dzahn)
[16:17:10] <wikibugs>	 (03CR) 10Cwhite: initial attempt at a varnishkafka exporter (031 comment) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[16:22:49] <wikibugs>	 (03PS1) 10Paladox: Merge tag 'v2.16.8' into HEAD [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507987
[16:24:15] <wikibugs>	 10Operations, 10ops-codfw: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10Dzahn)
[16:24:40] <wikibugs>	 10Operations, 10ops-codfw: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10Dzahn)
[16:26:17] <wikibugs>	 10Operations, 10ops-codfw: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10Dzahn) And yea.. why did the host die? Is it possible to reboot ?
[16:28:11] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2009 down and mgmt console not reachable - https://phabricator.wikimedia.org/T222459 (10Dzahn) p:05Triage→03Normal
[16:36:09] <wikibugs>	 10Operations, 10ops-codfw: pull decom hardware and ship to Harry/OIT @ SF office - https://phabricator.wikimedia.org/T222383 (10HMarcus) Hi all,  While in the process of swapping out our old equipment with new DC hardware, I noticed there are several spare SSDs with you guys as well. Is there any chance we can...
[16:36:54] <wikibugs>	 (03CR) 10Paladox: [V: 03+2 C: 03+2] Merge tag 'v2.16.8' into HEAD [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507987 (owner: 10Paladox)
[16:40:30] <wikibugs>	 (03PS1) 10Paladox: Remove plugin quota [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507990
[16:41:28] <wikibugs>	 (03CR) 10Paladox: [V: 03+2 C: 03+2] Remove plugin quota [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507990 (owner: 10Paladox)
[16:53:35] <wikibugs>	 10Operations, 10ops-codfw: PDUs with Infeed < 0.5Amps - https://phabricator.wikimedia.org/T222464 (10ayounsi)
[16:53:51] <wikibugs>	 (03PS1) 10Paladox: Update plugins for stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507991
[16:54:11] <wikibugs>	 10Operations, 10cloud-services-team: labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10aborrero) a:03aborrero I'm taking a look.
[16:54:30] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10aborrero)
[17:07:26] <arturo>	 !log T222148 drop udev from openstack-mitaka-jessie/jessie-wikimedia (related to T216497)
[17:07:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:31] <stashbot>	 T222148: labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148
[17:07:32] <stashbot>	 T216497: CloudVPS: workaround archival of jessie-backports repo - https://phabricator.wikimedia.org/T216497
[17:07:39] <wikibugs>	 (03PS2) 10CRusnov: Ganeti module: Add timeout support [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390
[17:08:19] <arturo>	 !log T222148 drop libudev1 from openstack-mitaka-jessie/jessie-wikimedia (related to T216497)
[17:08:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:18] <arturo>	 !log T222148 aborrero@labtestpuppetmaster2001:~ $ sudo apt-get install libudev1 udev systemd systemd-sysv libsystemd0 
[17:09:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:22] <wikibugs>	 (03CR) 10CRusnov: "Thanks for the review as always :)" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov)
[17:10:57] <arturo>	 !log T222148 aborrero@labpuppetmaster1001:~ $ sudo apt-get install libudev1 udev systemd systemd-sysv libsystemd0 
[17:11:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:45] <arturo>	 !log T222148 aborrero@labpuppetmaster1002:~ $ sudo apt-get install libudev1 udev systemd systemd-sysv libsystemd0 
[17:11:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:42] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 1:" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov)
[17:13:39] <wikibugs>	 (03PS4) 10Cwhite: initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066)
[17:15:46] <arturo>	 !log T222148 aborrero@labstore1004:~ $ sudo apt-get install libudev1 udev systemd systemd-sysv libsystemd0
[17:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:50] <stashbot>	 T222148: labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148
[17:16:43] <arturo>	 !log T222148 aborrero@labstore1005:~ $ sudo apt-get install libudev1 udev systemd systemd-sysv libsystemd0
[17:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:29] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10aborrero) 05Open→03Resolved thanks @MoritzMuehlenhoff for the heads up. All seems fixed now.
[17:19:52] <wikibugs>	 (03PS13) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378)
[17:23:22] <wikibugs>	 (03PS14) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378)
[17:23:49] <wikibugs>	 (03CR) 10Cwhite: initial attempt at a varnishkafka exporter (032 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite)
[17:28:19] <wikibugs>	 10Operations, 10Cloud-VPS, 10Traffic, 10netops, 10cloud-services-team (Kanban): Evaluate the possibility to add Juniper images to Openstack - https://phabricator.wikimedia.org/T180179 (10aborrero) 05Open→03Stalled I moved this task to the `Graveyard` column in our kanban board because there are no ac...
[17:28:22] <wikibugs>	 (03CR) 10CRusnov: "Puppet compiler looks good." [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov)
[17:29:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack: eqiad1: repool cloudvirt1007 [puppet] - 10https://gerrit.wikimedia.org/r/507949 (https://phabricator.wikimedia.org/T221047) (owner: 10Arturo Borrero Gonzalez)
[17:29:42] <wikibugs>	 (03CR) 10CRusnov: Add a check_netbox_report icinga check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov)
[17:29:57] <wikibugs>	 (03PS15) 10CRusnov: Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378)
[17:32:34] <wikibugs>	 (03CR) 10Rush: [C: 03+1] "Let's ask Mortiz to eyeball this if he has a moment but I believe this is OK." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506582 (https://phabricator.wikimedia.org/T221888) (owner: 10Andrew Bogott)
[17:33:24] <wikibugs>	 (03PS1) 10Andrew Bogott: Enable alerts for cloudvirt1007 [puppet] - 10https://gerrit.wikimedia.org/r/507997
[17:33:31] <icinga-wm>	 PROBLEM - puppet last run on proton1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:34:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Enable alerts for cloudvirt1007 [puppet] - 10https://gerrit.wikimedia.org/r/507997 (owner: 10Andrew Bogott)
[17:34:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "i see nothing wrong with the usage of nrpe::monitor_service and the rest of the puppet/hiera part if it compiles, and it does, then that l" [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov)
[17:35:34] <mutante>	 win 5
[17:35:39] <Krenair>	 is there a generic 'upgrade away from jessie' task somewhere?
[17:38:18] <mutante>	 i am aware of one to "track remaining trusty servers" and several individual tasks to upgrade things to stretch.. but not of a "tracking" task to link all those
[17:38:51] <mutante>	 i think after trusty is killed that will trigger a "remaining jessie" one
[17:39:33] <Krenair>	 ok
[17:39:39] <mutante>	 Krenair: haha, well https://phabricator.wikimedia.org/T168494
[17:39:47] <wikibugs>	 (03CR) 10CRusnov: Add a check_netbox_report icinga check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov)
[17:39:48] <mutante>	 that question seemed familiar
[17:40:13] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] Add a check_netbox_report icinga check [puppet] - 10https://gerrit.wikimedia.org/r/505657 (https://phabricator.wikimedia.org/T215378) (owner: 10CRusnov)
[17:40:22] <mutante>	 closed as invalid when i suggested it once
[17:40:40] <Krenair>	 well it has been a couple of years since then
[17:40:44] <Krenair>	 I'm looking at my list of jessie stuff
[17:40:45] <mutante>	 haha "removed a project: Tracking-Neverending."
[17:41:03] <Krenair>	 I think that project got renamed since then FWIW mutante, it might not have said neverending at the time
[17:41:14] <mutante>	 oh.. yea. likely
[17:41:35] <mutante>	 well, feel free to reopen or make a new one and we will see
[17:42:14] <Krenair>	 in particular I'm wondering about the irc server
[17:42:20] <Krenair>	 it went precise -> jessie, skipping trusty
[17:42:33] <Krenair>	 so I'm wondering if it'll skip stretch and go straight to buster
[17:43:40] <wikibugs>	 (03PS2) 10Jforrester: [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726
[17:44:29] <mutante>	 i don't know. i dont think that the past skip is an indication for future upgrade
[17:44:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 (owner: 10Jforrester)
[17:46:11] <wikibugs>	 (03PS3) 10Jforrester: [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726
[17:47:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726 (owner: 10Jforrester)
[17:47:31] <wikibugs>	 10Operations: tracking task: jessie -> stretch - https://phabricator.wikimedia.org/T168494 (10Krenair) I'm wondering if now it is time to reopen this as jessie -> stretch/buster and track misc hosts at least... From the looks of things the following misc hosts are jessie: `actinium alcyone alsafi aluminium darms...
[17:48:22] <wikibugs>	 (03PS3) 10Dzahn: puppet_compiler: add cron to delete old output files [puppet] - 10https://gerrit.wikimedia.org/r/507623 (https://phabricator.wikimedia.org/T222072)
[17:49:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] puppet_compiler: add cron to delete old output files [puppet] - 10https://gerrit.wikimedia.org/r/507623 (https://phabricator.wikimedia.org/T222072) (owner: 10Dzahn)
[17:49:30] <wikibugs>	 (03PS4) 10Jforrester: [DNM] CommonSettings: Factor out variant config generation into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507726
[17:50:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez)
[17:50:19] <wikibugs>	 10Operations, 10netops: cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6) - https://phabricator.wikimedia.org/T222424 (10ayounsi) We don't monitor our IX peers much. We probably should configure BGP route damping, https://www.juniper.net/documentation/en_US/junos/topics/usage-guidelines/policy-using-routing...
[17:53:49] <Krenair>	 at first glance in that list I'm seeing stuff relating to mailman, otrs, ldap, irc, gerrit
[17:54:16] <wikibugs>	 (03PS2) 10Jforrester: CommonSettings: Factor out load of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507727
[17:54:18] <wikibugs>	 (03PS2) 10Jforrester: CommonSettings: Factor out write of variant config into MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507728
[17:54:20] <wikibugs>	 (03PS2) 10Jforrester: [WIP] writeToStaticCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729
[17:56:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] writeToStaticCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (owner: 10Jforrester)
[17:57:29] <wikibugs>	 (03PS1) 10CRusnov: profile::netbox: Deploy config for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/508000
[18:00:15] <Krenair>	 mutante, know anything about schleifenbauer?
[18:00:41] <mutante>	 Krenair: that's the company making the PSUs (used in esams)
[18:00:50] <Krenair>	 ah
[18:00:50] <mutante>	 power supplies + stuff
[18:01:34] <wikibugs>	 (03PS2) 10CRusnov: profile::netbox: Deploy config for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/508000
[18:02:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Deploy config for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/508000 (owner: 10CRusnov)
[18:04:30] <icinga-wm>	 RECOVERY - puppet last run on proton1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[18:06:18] <wikibugs>	 (03PS3) 10CRusnov: profile::netbox: Deploy config for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/508000
[18:09:25] <wikibugs>	 10Operations, 10puppet-compiler, 10Jenkins, 10Patch-For-Review: compiler1002.puppet-diffs.eqiad.wmflabs disk is full - https://phabricator.wikimedia.org/T222072 (10Dzahn) cron deployed on compiler1002 and tested to run manually. works. removed about 2G (from 44G to 42G size of the output dir).  is there mo...
[18:09:32] <wikibugs>	 (03CR) 10CRusnov: "Compiles to expected output." [puppet] - 10https://gerrit.wikimedia.org/r/508000 (owner: 10CRusnov)
[18:11:59] <wikibugs>	 (03CR) 10Dzahn: profile::netbox: Deploy config for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508000 (owner: 10CRusnov)
[18:15:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] profile::netbox: Deploy config for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/508000 (owner: 10CRusnov)
[18:18:54] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] profile::netbox: Deploy config for icinga checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508000 (owner: 10CRusnov)
[18:19:09] <wikibugs>	 (03PS4) 10CRusnov: profile::netbox: Deploy config for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/508000
[18:20:59] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10Papaul) a:05Papaul→03RobH @RobH you mentioned that the port was disable it looks like it is not  https://librenms.wikimedia.org/alerts/
[18:22:11] <wikibugs>	 (03PS6) 10Herron: WIP puppetmaster-standalone - add dynamic envs that map to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/507846
[18:23:44] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 50.12, 25.14, 18.06
[18:24:38] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10RobH) a:05RobH→03Papaul @papaul: Please trace and disable the port, as it is unclear on the stack which port it was.
[18:26:22] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 13.77, 22.15, 18.33
[18:29:00] <icinga-wm>	 PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:29:15] <chaomodus>	 ^ that's me, working on it
[18:33:12] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/16328/" [puppet] - 10https://gerrit.wikimedia.org/r/506557 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[18:33:47] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] kafkatee: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506557 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[18:33:59] <wikibugs>	 (03PS2) 10Dzahn: kafkatee: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/506557 (https://phabricator.wikimedia.org/T194724)
[18:40:41] <wikibugs>	 (03PS1) 10CRusnov: profile::netbox: Remove explicit directory creation causing failure [puppet] - 10https://gerrit.wikimedia.org/r/508002
[18:40:50] <icinga-wm>	 PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:42:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "yep, per our IRC talk the conflict comes from the uwsgi class creating /etc/${title} and 'netbox' is the title. Though an existing "!defin" [puppet] - 10https://gerrit.wikimedia.org/r/508002 (owner: 10CRusnov)
[18:43:32] <wikibugs>	 (03CR) 10Dzahn: "noop on analytics1030 and mwlog1001 - not even a compiler change on an-coord" [puppet] - 10https://gerrit.wikimedia.org/r/506557 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[18:44:10] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] profile::netbox: Remove explicit directory creation causing failure [puppet] - 10https://gerrit.wikimedia.org/r/508002 (owner: 10CRusnov)
[18:45:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] LibreNMS, change $install_dir to group librenms [puppet] - 10https://gerrit.wikimedia.org/r/507909 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[18:46:01] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] admins: remove ability to run commands as user 'apache' [puppet] - 10https://gerrit.wikimedia.org/r/506750 (https://phabricator.wikimedia.org/T78076) (owner: 10Dzahn)
[18:50:12] <icinga-wm>	 RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:51:26] <icinga-wm>	 RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[19:02:53] <wikibugs>	 10Operations, 10ops-codfw: PDUs with Infeed < 0.5Amps - https://phabricator.wikimedia.org/T222464 (10Dzahn) p:05Triage→03Normal
[19:08:36] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10Papaul) ` papaul@asw-a-codfw# run show interfaces ge-5/0/15 descriptions  Interface       Admin Link Description ge-5/0/15       down  down DI...
[19:20:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] LibreNMS, change $install_dir to group librenms [puppet] - 10https://gerrit.wikimedia.org/r/507909 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[19:20:09] <wikibugs>	 (03PS1) 10Dzahn: vagrant::mediawiki: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/508009 (https://phabricator.wikimedia.org/T194724)
[19:20:18] <wikibugs>	 (03PS2) 10Ayounsi: LibreNMS, change $install_dir to group librenms [puppet] - 10https://gerrit.wikimedia.org/r/507909 (https://phabricator.wikimedia.org/T207706)
[19:20:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] vagrant::mediawiki: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/508009 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[19:21:09] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Create a "Wikimedians of Florida" mailing list - https://phabricator.wikimedia.org/T222473 (10Gaurav)
[19:23:28] <wikibugs>	 (03PS2) 10Dzahn: vagrant::mediawiki: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/508009 (https://phabricator.wikimedia.org/T194724)
[19:26:09] <wikibugs>	 (03PS1) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011
[19:26:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (owner: 10CDanis)
[19:27:38] <wikibugs>	 (03PS2) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011
[19:28:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (owner: 10CDanis)
[19:29:06] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "Tested locally; Works for me.  Thanks!" [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac)
[19:30:19] <wikibugs>	 (03PS3) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011
[19:30:22] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/16330/app-editor-tasks.mobile.eqiad.wmflabs/change.app-editor-tasks.mobile.eqiad.wmflabs." [puppet] - 10https://gerrit.wikimedia.org/r/508009 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn)
[19:30:56] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] admins: remove ability to run commands as user 'apache' [puppet] - 10https://gerrit.wikimedia.org/r/506750 (https://phabricator.wikimedia.org/T78076) (owner: 10Dzahn)
[19:31:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (owner: 10CDanis)
[19:32:22] <wikibugs>	 (03PS4) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011
[19:32:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (owner: 10CDanis)
[19:33:47] <wikibugs>	 (03PS5) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011
[19:37:53] <wikibugs>	 10Operations, 10ops-codfw: PDUs with Infeed < 0.5Amps - https://phabricator.wikimedia.org/T222464 (10ayounsi)
[19:40:16] <wikibugs>	 (03PS1) 10Ayounsi: LibreNMS, follow the symlink to $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/508014 (https://phabricator.wikimedia.org/T207706)
[19:41:11] <wikibugs>	 (03PS2) 10Ayounsi: LibreNMS, follow the symlink to $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/508014 (https://phabricator.wikimedia.org/T207706)
[19:42:11] <wikibugs>	 (03PS6) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011
[19:43:24] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/16335/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/508014 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[19:44:12] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] LibreNMS, follow the symlink to $install_dir [puppet] - 10https://gerrit.wikimedia.org/r/508014 (https://phabricator.wikimedia.org/T207706) (owner: 10Ayounsi)
[19:47:57] <wikibugs>	 (03PS3) 10Jforrester: Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003
[19:48:09] <wikibugs>	 (03PS4) 10Jforrester: Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003
[19:49:35] <wikibugs>	 (03CR) 10Jforrester: Invariant config cleanup: I - Initial DB and performance items (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 (owner: 10Jforrester)
[19:49:39] <wikibugs>	 (03PS5) 10Jforrester: Invariant config cleanup: I - Initial DB and performance items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003
[20:09:11] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Create a "Wikimedians of Florida" mailing list - https://phabricator.wikimedia.org/T222473 (10Dzahn) a:03Dzahn
[20:17:58] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Create a "Wikimedians of Florida" mailing list - https://phabricator.wikimedia.org/T222473 (10Dzahn) 05Open→03Resolved Hello Gaurav ,  Done. I see you guys are already running the Wikimedians in Colorado list, so i assume you are familiar and i will keep it short....
[20:26:37] <wikibugs>	 10Operations, 10ops-eqiad, 10Gerrit, 10serviceops, 10Release-Engineering-Team (Watching / External): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Dzahn) a:03Dzahn
[20:39:56] <wikibugs>	 10Operations, 10Discovery-Search: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Dzahn) p:05Triage→03Normal
[20:40:33] <wikibugs>	 10Operations, 10netops: cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6) - https://phabricator.wikimedia.org/T222424 (10Dzahn) p:05Triage→03Normal
[20:40:37] <wikibugs>	 10Operations, 10netops, 10observability: cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6) - https://phabricator.wikimedia.org/T222424 (10Dzahn)
[22:09:28] <XioNoX>	 !log clear v4 BGP to AS17451 on cr1-eqsin/cr4-ulsfo 
[22:09:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:02] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:16:02] <icinga-wm>	 PROBLEM - BFD status on cr2-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:16:40] <Krenair>	 XioNoX
[22:16:58] <XioNoX>	 thx
[22:17:22] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:17:22] <icinga-wm>	 RECOVERY - BFD status on cr2-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:17:30] <XioNoX>	 seems like unplanned
[22:17:34] <XioNoX>	 woop
[22:27:20] <XioNoX>	 cr1-eqiad cr2-eqdfw might alert as well
[22:27:53] <XioNoX>	 Seems like GTT is having issues
[22:49:21] <wikibugs>	 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 3 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10Dzahn)
[23:48:43] <thcipriani>	 threads have been piling up all day in gerrit. Seems to be getting worse instead of better. I think I'm going to give gerrit a restart before I walk away for the weekend.
[23:49:37] <thcipriani>	 !log gerrit restart due to threads piling up
[23:49:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:49:48] <chaomodus>	 any indication as to the cause?
[23:50:06] <paladox>	 not yet
[23:50:51] <thcipriani>	 !log gerrit back
[23:50:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:02] <thcipriani>	 chaomodus: I've been keeping notes as I go on https://phabricator.wikimedia.org/T221026
[23:51:37] <thcipriani>	 I've been tweaking things here and there, but there has been no silver bullet.
[23:52:11] <chaomodus>	 Mmh
[23:52:23] <thcipriani>	 threaddumps, gc monitor: everything looks mostly normal afaict (after all the tweaks anyway)
[23:52:37] <chaomodus>	 This is often the java experience (tm), I used to admin Solr with similar experiences
[23:53:08] <paladox>	 though the threads did look alot better after the java upgrade
[23:53:21] <paladox>	 since they go up then come back down now
[23:53:30] <thcipriani>	 not today :(
[23:53:33] <paladox>	 instead of just staying stuck (and graduating going up)
[23:53:54] <chaomodus>	 I see g1 has been considered, properly tuned it may have a lot of benifit if full GCs can be avoided
[23:54:07] <chaomodus>	 or has that already been deployjd
[23:54:13] <thcipriani>	 chaomodus: we are actually currently using g1 and it has been a big boon
[23:54:18] <chaomodus>	 nice
[23:54:33] <wikibugs>	 (03PS1) 10CRusnov: ganeti-netbox sync: Sync host status also [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066
[23:54:44] <thcipriani>	 there were many concurrent full GCs happening with the parallel GC. Lots of pause time.
[23:54:52] <paladox>	 ^^
[23:55:23] <chaomodus>	 well that's something at least
[23:55:31] <chaomodus>	 it's still getting some sort of thread deadlock though?
[23:56:02] <paladox>	 it appears threads are going higher then nromal
[23:56:30] <paladox>	 e.g going up to 17.0 today
[23:56:31] <thcipriani>	 seemingly. Although I haven't really caught anything in thread dumps.
[23:56:44] <wikibugs>	 (03PS2) 10CRusnov: ganeti-netbox sync: Sync host status also [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066
[23:56:50] <thcipriani>	 well. Nothing since the latest rounds of changes anyway.
[23:57:22] <thcipriani>	 http threads were getting stuck behind a sendemail thread, but there was some java bug that may have sorted that one out
[23:57:51] <thcipriani>	 now it seems like: http threads are all runnable, but there are a lot of them stacking up for no reason I can see
[23:58:07] <chaomodus>	 like the connections aren't timing out or something?
[23:58:38] <thcipriani>	 no timeouts, no locked resources between threads, no change in traffic at that time
[23:58:58] <chaomodus>	 I mean it's notcleanning up complete connections maybe
[23:59:07] <chaomodus>	 idk much about what ? tomcat? 
[23:59:11] <thcipriani>	 jetty
[23:59:14] <chaomodus>	 ah