[00:01:43] PROBLEM - snapshot of s4 in codfw on db1115 is CRITICAL: snapshot for s4 at codfw taken more than 4 days ago: Most recent backup 2019-12-03 23:33:06 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [00:02:05] RECOVERY - Disk space on netflow2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops [01:09:46] !log andrew@deploy1001 Started deploy [horizon/deploy@841693b]: (no justification provided) [01:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:34] !log andrew@deploy1001 Finished deploy [horizon/deploy@841693b]: (no justification provided) (duration: 01m 48s) [01:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:57] !log andrew@deploy1001 Started deploy [horizon/deploy@accbbd1]: (no justification provided) [01:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:51] !log andrew@deploy1001 Finished deploy [horizon/deploy@accbbd1]: (no justification provided) (duration: 01m 53s) [01:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:25] (03PS1) 10Andrew Bogott: Horizon: update apache site to use Horizon's new wsgi.py [puppet] - 10https://gerrit.wikimedia.org/r/555704 (https://phabricator.wikimedia.org/T239974) [01:36:56] !log andrew@deploy1001 Started deploy [horizon/deploy@accbbd1]: (no justification provided) [01:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:44] !log andrew@deploy1001 Finished deploy [horizon/deploy@accbbd1]: (no justification provided) (duration: 00m 07s) [01:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:00] !log andrew@deploy1001 Started deploy [horizon/deploy@accbbd1]: (no justification provided) [01:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:36] !log andrew@deploy1001 Finished deploy [horizon/deploy@accbbd1]: (no justification provided) (duration: 01m 49s) [01:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:03] !log andrew@deploy1001 Started deploy [horizon/deploy@accbbd1]: (no justification provided) [01:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:50] !log andrew@deploy1001 Finished deploy [horizon/deploy@accbbd1]: (no justification provided) (duration: 01m 47s) [01:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:34] !log andrew@deploy1001 Started deploy [horizon/deploy@accbbd1]: (no justification provided) [01:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:28] !log andrew@deploy1001 Finished deploy [horizon/deploy@accbbd1]: (no justification provided) (duration: 01m 55s) [01:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:51:43] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update apache site to use Horizon's new wsgi.py [puppet] - 10https://gerrit.wikimedia.org/r/555704 (https://phabricator.wikimedia.org/T239974) (owner: 10Andrew Bogott) [01:54:40] (03PS1) 10Andrew Bogott: horizon: replace apache site venv dir with an .erb lookup [puppet] - 10https://gerrit.wikimedia.org/r/555705 (https://phabricator.wikimedia.org/T239974) [01:58:19] (03CR) 10Andrew Bogott: [C: 03+2] horizon: replace apache site venv dir with an .erb lookup [puppet] - 10https://gerrit.wikimedia.org/r/555705 (https://phabricator.wikimedia.org/T239974) (owner: 10Andrew Bogott) [02:17:55] !log andrew@deploy1001 Started deploy [horizon/deploy@ed2243c]: (no justification provided) [02:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:45] !log andrew@deploy1001 Finished deploy [horizon/deploy@ed2243c]: (no justification provided) (duration: 01m 50s) [02:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:40] !log andrew@deploy1001 Started deploy [horizon/deploy@ff0a0e7]: (no justification provided) [02:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:58:33] !log andrew@deploy1001 Finished deploy [horizon/deploy@ff0a0e7]: (no justification provided) (duration: 01m 53s) [02:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:01:54] (03PS1) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [04:02:32] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [04:48:00] (03PS2) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [04:48:36] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [04:54:15] (03PS3) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [04:54:42] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [08:40:49] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:42:37] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:51:21] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [09:52:13] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [10:41:31] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=eqiad [10:42:23] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/dashboard/file/swift?panelId=9&fullscreen&orgId=1&var-DC=codfw [13:11:53] (03PS1) 10TechneSiyam: Modified InitialiseSettings with 1.5x and 2x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555723 [13:11:55] (03PS1) 10TechneSiyam: Modified InitialiseSettings with 1.5x and 2x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555724 [13:41:45] PROBLEM - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:23] (03CR) 10Masumrezarock100: [C: 03+1] "LGTM. Perhaps @Urbanecm can run a script to cleanup existing redirects like he did for Commons?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555692 (https://phabricator.wikimedia.org/T240050) (owner: 10IAmNetx) [14:15:10] (03CR) 10Urbanecm: [C: 03+1] "> Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555692 (https://phabricator.wikimedia.org/T240050) (owner: 10IAmNetx) [14:50:19] (03CR) 10Urbanecm: [C: 04-1] Modified InitialiseSettings with 1.5x and 2x logos (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555723 (owner: 10TechneSiyam) [14:50:27] (03Abandoned) 10Urbanecm: Modified InitialiseSettings with 1.5x and 2x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555724 (owner: 10TechneSiyam) [15:54:59] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:56:47] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:41:34] (03PS1) 10TechneSiyam: Modified InitialiseSettings with 1.5x and 2x logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/555730 [17:15:44] (03PS1) 10ArielGlenn: Revert "configure adds-changes dumps to skip locking for now" [puppet] - 10https://gerrit.wikimedia.org/r/555732 [17:16:01] (03PS2) 10ArielGlenn: Revert "configure adds-changes dumps to skip locking for now" [puppet] - 10https://gerrit.wikimedia.org/r/555732 [17:17:07] (03CR) 10ArielGlenn: [C: 03+2] Revert "configure adds-changes dumps to skip locking for now" [puppet] - 10https://gerrit.wikimedia.org/r/555732 (owner: 10ArielGlenn) [18:08:11] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [18:08:15] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [18:08:23] PROBLEM - PyBal backends health check on lvs1014 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp1088.eqiad.wmnet, cp1084.eqiad.wmnet, cp1080.eqiad.wmnet, cp1076.eqiad.wmnet are marked down but pooled: uploadlb_443: Servers cp1076.eqiad.wmnet, cp1084.eqiad.wmnet, cp1088.eqiad.wmnet, cp1078.eqiad.wmnet, cp1080.eqiad.wmnet, cp1082.eqiad.wmnet, cp1090.eqiad.wmnet are marked down but pooled https://wikitech.wikimedi [18:08:23] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp1076.eqiad.wmnet, cp1078.eqiad.wmnet, cp1080.eqiad.wmnet, cp1082.eqiad.wmnet, cp1090.eqiad.wmnet are marked down but pooled: uploadlb_443: Servers cp1088.eqiad.wmnet, cp1082.eqiad.wmnet, cp1084.eqiad.wmnet, cp1076.eqiad.wmnet, cp1080.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:08:27] PROBLEM - LVS HTTPS IPv4 #page on upload-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:09:18] checking [18:10:23] following along :) [18:10:56] * shdubsh here [18:11:05] what's up [18:11:21] hello [18:11:27] following up in _security [18:11:49] * volans|off here [18:13:37] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [18:13:45] RECOVERY - LVS HTTPS IPv4 #page on upload-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 859 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:15:25] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [18:15:41] RECOVERY - PyBal backends health check on lvs1014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:15:41] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:55:21] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:57:05] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:15:21] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.55 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:17:09] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.07083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:25:48] (03PS1) 10BryanDavis: cloud: update maintain-views to handle dblists with comments [puppet] - 10https://gerrit.wikimedia.org/r/555740 (https://phabricator.wikimedia.org/T239415) [19:27:34] (03CR) 10jerkins-bot: [V: 04-1] cloud: update maintain-views to handle dblists with comments [puppet] - 10https://gerrit.wikimedia.org/r/555740 (https://phabricator.wikimedia.org/T239415) (owner: 10BryanDavis) [19:30:27] (03PS2) 10BryanDavis: cloud: update maintain-views to handle dblists with comments [puppet] - 10https://gerrit.wikimedia.org/r/555740 (https://phabricator.wikimedia.org/T239415) [19:44:19] heh bd808 been there and did that same update on my stuff... [19:55:12] apergos: shhhhh... we aren't here. ;) [19:56:07] neither am I! see ya ;-) [21:08:23] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:04:58] (03PS1) 10Kosta Harlan: Document workaround for certificate issue on macOS [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/555751 [22:16:27] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:58:57] PROBLEM - Host backup2001 is DOWN: PING CRITICAL - Packet loss = 100%