[00:00:04] Deploy window No deploys! (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191129T0000) [00:22:07] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:24:45] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:25:25] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:28:55] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 28559 bytes in 8.337 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:28:59] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28427 bytes in 1.467 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:29:53] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: HTTP OK: HTTP/1.1 200 OK - 28536 bytes in 1.245 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:38:45] (03CR) 10TerraCodes: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551778 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [00:47:18] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1329.eqiad.wmnet'] ` and were **ALL** successful. [01:06:00] PROBLEM - mediawiki-installation DSH group on mw1309 is CRITICAL: Host mw1309 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:10:42] PROBLEM - mediawiki-installation DSH group on mw1315 is CRITICAL: Host mw1315 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:54:01] (03CR) 10Vgutierrez: [C: 03+1] daemon: fix memory leak in Python 3.7+ [software/keyholder] - 10https://gerrit.wikimedia.org/r/553481 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans) [02:17:13] 10Operations, 10Traffic: cp1087 reboot - https://phabricator.wikimedia.org/T239449 (10Vgutierrez) 05Open→03Resolved p:05Triage→03Normal a:03Vgutierrez Yep, this is most likely another occurrence of T238305 [02:17:29] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [02:17:51] 10Operations, 10Traffic: cp1087 reboot - https://phabricator.wikimedia.org/T239449 (10Vgutierrez) [02:17:54] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [03:44:24] PROBLEM - mediawiki-installation DSH group on mw1329 is CRITICAL: Host mw1329 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [04:09:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:30:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:47:06] (03PS1) 10Vgutierrez: ATS: Trigger update-ocsp-all iff non acme-chief certs are deployed [puppet] - 10https://gerrit.wikimedia.org/r/553526 [04:52:35] (03PS2) 10Vgutierrez: ATS: Trigger update-ocsp-all iff non acme-chief certs are deployed [puppet] - 10https://gerrit.wikimedia.org/r/553526 [04:55:36] * Krinkle staging on mwdebug1001 - https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/553372/ [04:58:56] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5963 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:00:38] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 2 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:00:40] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/19679/" [puppet] - 10https://gerrit.wikimedia.org/r/553526 (owner: 10Vgutierrez) [05:08:33] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.5/includes/exception/MWExceptionHandler.php: 532f4aba96d85 (duration: 01m 03s) [05:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1134 after schema change', diff saved to https://phabricator.wikimedia.org/P9781 and previous config saved to /var/cache/conftool/dbconfig/20191129-055845-marostegui.json [05:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:24] PROBLEM - Wikitech and wt-static content in sync on cloudweb2001-dev is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (200946s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [06:14:57] (03PS1) 10Marostegui: mariadb: Remove db2062 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/553530 (https://phabricator.wikimedia.org/T238726) [06:15:19] (03PS1) 10Marostegui: wmnet: Remove production dns for db2062 [dns] - 10https://gerrit.wikimedia.org/r/553531 (https://phabricator.wikimedia.org/T238726) [06:15:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:15] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2062 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/553530 (https://phabricator.wikimedia.org/T238726) (owner: 10Marostegui) [06:16:58] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production dns for db2062 [dns] - 10https://gerrit.wikimedia.org/r/553531 (https://phabricator.wikimedia.org/T238726) (owner: 10Marostegui) [06:17:39] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Marostegui) a:05Marostegui→03Papaul [06:17:43] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Marostegui) [06:18:06] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Marostegui) Host ready for @Papaul to take over [06:19:10] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:10:15] RECOVERY - mediawiki-installation DSH group on mw1309 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:15:03] RECOVERY - mediawiki-installation DSH group on mw1315 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:22:53] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Remove unused config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552502 (owner: 10Kosta Harlan) [07:25:07] !log reimage mw1312.eqiad.wmnet mw1308.eqiad.wmnet mw1261.eqiad.wmnet [07:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:10] !reimage mw1262.eqiad.wmnet mw1278.eqiad.wmnet [07:26:30] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1312.eqiad.wmnet', 'mw1308.eqiad.wmnet', 'mw1261.eqiad.wmnet'] ` The log can be found in `/var/log/... [07:26:33] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1262.eqiad.wmnet', 'mw1278.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191... [07:46:16] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [07:46:43] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [07:48:37] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [07:48:39] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [07:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:42] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [07:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:49] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:12] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/553463 (owner: 10Muehlenhoff) [07:55:05] PROBLEM - dhclient process on mw1262 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [07:55:43] ^ host is being reimaged [07:58:17] RECOVERY - mediawiki-installation DSH group on mw1329 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:00:45] PROBLEM - Wikitech and wt-static content in sync on labweb1001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (207674s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [08:00:45] PROBLEM - Wikitech and wt-static content in sync on labweb1002 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (207674s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [08:18:56] !log reimage mw2223.codfw.wmnet mw2222.codfw.wmnet mw2221.codfw.wmnet mw2220.codfw.wmnet [08:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:18] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw2223.codfw.wmnet', 'mw2222.codfw.wmnet', 'mw2221.codfw.wmnet', 'mw2220.codfw.wmnet'] ` The log can... [08:19:37] (03CR) 10Muehlenhoff: [C: 03+1] "Nice detective work, looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/553460 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans) [08:21:40] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/553482 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [08:32:09] RECOVERY - dhclient process on mw1262 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [08:34:56] (03CR) 10Muehlenhoff: "A straight replace won't work here, e.g. eventlog1002 uses half a TB of data in /srv, while the recipe now applied to it only provides 80G" [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [08:38:55] (03PS1) 10Muehlenhoff: Remove obsolete entries from netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/553694 [08:43:37] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [08:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:59] (03CR) 10Vgutierrez: [C: 03+1] keyholder: fix memory leak in Python 3.7+ [puppet] - 10https://gerrit.wikimedia.org/r/553460 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans) [08:45:44] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:56] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.95 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [08:58:10] hah that's a new one, filing a task :| [08:59:09] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [08:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:42] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.025 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [09:00:32] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:13] !log temporary disabling puppet on 'R:keyholder::agent' to merge gerrit:operations/puppet/+/553460 - T239386 [09:01:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:24] T239386: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386 [09:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:17] PROBLEM - mediawiki-installation DSH group on mw1262 is CRITICAL: Host mw1262 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:02:29] (03CR) 10Volans: [C: 03+2] keyholder: fix memory leak in Python 3.7+ [puppet] - 10https://gerrit.wikimedia.org/r/553460 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans) [09:03:30] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:54] (03PS1) 10Filippo Giunchedi: prometheus: fix link to logstash indexing errors [puppet] - 10https://gerrit.wikimedia.org/r/553695 [09:15:50] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix link to logstash indexing errors [puppet] - 10https://gerrit.wikimedia.org/r/553695 (owner: 10Filippo Giunchedi) [09:18:36] 10Operations, 10serviceops: Kubernetes emits log level 50 - https://phabricator.wikimedia.org/T239459 (10jijiki) [09:19:25] PROBLEM - Keyholder SSH agent on cumin2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [09:20:20] unfortunately expected, will be fixed as soon as I get a @netops online to arm homer's key [09:22:22] (03PS2) 10Filippo Giunchedi: install_server: move custom partman recipes to partman/custom [puppet] - 10https://gerrit.wikimedia.org/r/553482 (https://phabricator.wikimedia.org/T156955) [09:23:44] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: move custom partman recipes to partman/custom [puppet] - 10https://gerrit.wikimedia.org/r/553482 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [09:23:58] !log reimage mw1263.eqiad.wmnet mw1307.eqiad.wmnet [09:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:20] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1263.eqiad.wmnet', 'mw1307.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191... [09:26:47] PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [09:27:33] damn, too slow, this will recover now [09:28:25] RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [09:33:34] !log Stop replication on db2105 (s3 codfw) for schema change [09:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:53] !log Remove triggers from db2094:3313 - T234704 [09:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:58] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [09:34:51] (03Abandoned) 10Jbond: debdeploy: ignore emacs for service restarts [puppet] - 10https://gerrit.wikimedia.org/r/507056 (owner: 10Jbond) [09:35:32] (03Abandoned) 10Jbond: decom cp3011-22: Removing old DNS entries [dns] - 10https://gerrit.wikimedia.org/r/507010 (https://phabricator.wikimedia.org/T130883) (owner: 10Jbond) [09:36:20] (03Abandoned) 10Jbond: varnish: ratelimit thumbor - cache_upload backend [puppet] - 10https://gerrit.wikimedia.org/r/514019 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [09:40:15] (03CR) 10Jbond: "bump, any chance of a review (its simple one line patch)" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 (owner: 10Jbond) [09:41:23] (03Abandoned) 10Jbond: backup::host: remove day and jobsdefault [puppet] - 10https://gerrit.wikimedia.org/r/547559 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [09:41:54] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1278.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1278.eqiad.wmnet'] ` [09:42:39] (03PS1) 10Ema: ATS: allow connection reuse for requests with Authorization [puppet] - 10https://gerrit.wikimedia.org/r/553696 (https://phabricator.wikimedia.org/T238494) [09:43:39] (03CR) 10Jbond: "Anything still outstanding on this?" [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [09:45:29] (03CR) 10Vgutierrez: [C: 03+1] ATS: allow connection reuse for requests with Authorization [puppet] - 10https://gerrit.wikimedia.org/r/553696 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [09:46:24] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:36] (03PS13) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [09:49:23] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:31] ACKNOWLEDGEMENT - Keyholder SSH agent on cumin2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. Volans Waiting for @netops to arm the key as I dont have access https://wikitech.wikimedia.org/wiki/Keyholder [09:51:48] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:41] (03PS2) 10Filippo Giunchedi: Remove obsolete entries from netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/553694 (owner: 10Muehlenhoff) [09:53:50] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:00] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, PS2 rebased" [puppet] - 10https://gerrit.wikimedia.org/r/553694 (owner: 10Muehlenhoff) [09:56:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete entries from netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/553694 (owner: 10Muehlenhoff) [10:02:23] RECOVERY - mediawiki-installation DSH group on mw1262 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [10:03:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:12:15] (03PS1) 10Volans: keyholder proxy: do not notify the agent [puppet] - 10https://gerrit.wikimedia.org/r/553700 (https://phabricator.wikimedia.org/T239386) [10:14:58] (03PS14) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [10:15:00] (03PS2) 10Volans: keyholder: notify only the proxy on proxy changes [puppet] - 10https://gerrit.wikimedia.org/r/553700 (https://phabricator.wikimedia.org/T239386) [10:15:02] (03PS4) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 [10:16:31] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:16:42] (03CR) 10Jbond: "Hi All Any chance on another pass on this and https://gerrit.wikimedia.org/r/c/operations/puppet/+/544943" [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [10:18:18] (03CR) 10Jbond: "@alex, wonder if you have an answer to moritz's questions?" [puppet] - 10https://gerrit.wikimedia.org/r/549815 (https://phabricator.wikimedia.org/T237713) (owner: 10Jbond) [10:19:25] (03CR) 10Jbond: [V: 03+2 C: 03+2] promethus: add the metrics overlay to provide prometheus support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/551209 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [10:21:03] (03CR) 10Volans: "Compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/553700 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans) [10:22:33] !log reimage mw1306.eqiad.wmnet mw1264.eqiad.wmnet mw1279.eqiad.wmnet [10:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:53] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1306.eqiad.wmnet', 'mw1264.eqiad.wmnet', 'mw1279.eqiad.wmnet'] ` The log can be found in `/var/log/... [10:22:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/553700 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans) [10:25:51] (03CR) 10Volans: [C: 03+2] keyholder: notify only the proxy on proxy changes [puppet] - 10https://gerrit.wikimedia.org/r/553700 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans) [10:25:54] (03PS2) 10Jbond: promethus: add the metrics overlay to provide prometheus support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/551209 (https://phabricator.wikimedia.org/T233934) [10:26:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] promethus: add the metrics overlay to provide prometheus support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/551209 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [10:27:00] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) >>! In T239054#5701191, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['mw1278.eqiad.wmnet'] > ` > > Of which those **FAILED**: > ` > ['mw1278.eqiad.wmnet'] > ` I checked the... [10:28:31] (03Abandoned) 10Jbond: CI - pytohn3: make python3 the default for tests [puppet] - 10https://gerrit.wikimedia.org/r/550437 (owner: 10Jbond) [10:44:26] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1308.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1308.eqiad.wmnet'] ` [10:46:03] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [10:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:45] !log elukey@deploy1001 Started deploy [analytics/refinery@97015e4] (thin): Deploy thin Analytics Refinery (no jars/git-fat-obj) to notebook and labstore hosts [10:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:53] !log elukey@deploy1001 Finished deploy [analytics/refinery@97015e4] (thin): Deploy thin Analytics Refinery (no jars/git-fat-obj) to notebook and labstore hosts (duration: 00m 08s) [10:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:30] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:06] (03CR) 10Alexandros Kosiaris: "Sigh. I had a response since Nov 8 and never pressed "Reply". Sorry about that." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549815 (https://phabricator.wikimedia.org/T237713) (owner: 10Jbond) [10:57:31] (03PS2) 10Jbond: apereo_cas: add prometheus actuator [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934) [10:58:04] (03CR) 10Urbanecm: [C: 03+1] GrowthExperiments: Remove unused config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552502 (owner: 10Kosta Harlan) [10:58:17] log reimage mw1268.eqiad.wmnet mw1280.eqiad.wmnet mw1281.eqiad.wmne [10:58:36] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1268.eqiad.wmnet', 'mw1280.eqiad.wmnet', 'mw1281.eqiad.wmnet'] ` The log can be found in `/var/log/... [10:58:40] (03CR) 10Volans: "LGTM, two nits inline" (032 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 (owner: 10Jbond) [10:58:47] effie: you miss !, fyi [10:58:54] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [10:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:38] 10Operations, 10Core Platform Team, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog, 10serviceops-radar: wikifeeds (Service-runner app) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) p:05Triage→03Low [11:03:37] !log 10:58:17 log reimage mw1268.eqiad.wmnet mw1280.eqiad.wmnet mw1281.eqiad.wmne [11:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:44] ^ better than nothing Urbanecm [11:04:06] thx Lucas_WMDE [11:04:14] (eh, I should’ve removed the “log” word ^^) [11:08:51] 10Operations, 10Dumps-Generation, 10hardware-requests: Get a third dumpsdata server - https://phabricator.wikimedia.org/T219768 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn This is so done. So very very done. Thanks everyone! [11:08:53] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) [11:14:12] (03CR) 10Muehlenhoff: "One comment inline, but feel free to ignore, this would also work." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [11:14:31] (03PS3) 10Jbond: apereo_cas: add prometheus actuator [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934) [11:20:18] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:44] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:27] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:37] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:57] 10Operations, 10Core Platform Team, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog, 10serviceops-radar: wikifeeds (Service-runner app) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10jijiki) I am afraid this is not limited to wikifeeds on... [11:30:37] (03CR) 10Jbond: "thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [11:33:08] PROBLEM - Host mw1308 is DOWN: PING CRITICAL - Packet loss = 100% [11:33:16] ^ me [11:34:36] RECOVERY - Host mw1308 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [11:34:59] !log reimage mw1268.eqiad.wmnet mw1280.eqiad.wmnet mw1281.eqiad.wmnet [11:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:05] thanx Urbanecm [11:35:23] and Lucas_WMDE [11:35:47] sorry, there has been a lot of notifications for this channel with my nicknames :p [11:36:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [11:37:04] no problem ^^ [11:39:16] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1263.eqiad.wmnet', 'mw1307.eqiad.wmnet'] ` and were **ALL** successful. [11:39:58] !log disabling puppet on dbprov2001 to test recoveries [11:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:59] 10Operations, 10CX-cxserver, 10Core Platform Team, 10Mobile-Content-Service, and 2 others: wikifeeds (Service-runner app) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) So it does look like it's service-runner specific then. [11:46:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, minus the two notes by Volans." [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 (owner: 10Jbond) [11:46:24] 10Operations, 10CX-cxserver, 10Core Platform Team, 10Mobile-Content-Service, and 2 others: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) [11:47:05] (03PS2) 10Jbond: wmf_style: add contain to this list of include like types [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 [11:49:01] (03PS3) 10Jbond: wmf_style: add contain to this list of include like types [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 [11:49:08] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:46] (03CR) 10Jbond: "updated thanks" (032 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 (owner: 10Jbond) [11:51:07] (03CR) 10Jbond: [C: 03+2] apereo_cas: add prometheus actuator [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [11:51:22] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] "None on my side. +1ed" [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [11:58:59] !log reimage mw1305.eqiad.wmnet mw1265.eqiad.wmnet mw1270.eqiad.wmnet [11:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:32] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1305.eqiad.wmnet', 'mw1265.eqiad.wmnet', 'mw1270.eqiad.wmnet'] ` The log can be found in `/var/log/... [12:00:48] PROBLEM - mediawiki-installation DSH group on mw2222 is CRITICAL: Host mw2222 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [12:19:18] RECOVERY - Keyholder SSH agent on cumin2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [12:22:40] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [12:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:46] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:09] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [12:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:23] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [12:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:29] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:34] !log disabling puppet also on on backup1001 to test recoveries [12:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:11] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [12:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:53] (03CR) 10Ema: [C: 03+2] ATS: allow connection reuse for requests with Authorization [puppet] - 10https://gerrit.wikimedia.org/r/553696 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [13:04:43] Is there some maintenance script that can be run to manually update Special:Statistics? When new wikis are imported, Special:Statistics says "0" in all fields for several days after the import is finished. Case in point: https://szy.wikipedia.org/wiki/sazumaay:Statistics [13:08:11] there is a maintenance script that could be run: updateSpecialPages.php [13:08:52] a cron job already does this, running once every 3 days [13:11:26] apergos, then the question becomes, why isn't the stats on szywiki updated yet? The import was done 8 days ago [13:12:26] I don't know. that's worth investigating [13:13:01] let me see if that cronjob actually still runs or if the manifest isn't active [13:13:56] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) Error while trying to restore sodium contents: `lines=10 29-Nov 13:04 backup1001.eqiad.wmnet JobId 162656: Start Restore Job RestoreFiles.2019-11-29... [13:14:05] it's active [13:14:23] let me make sure I am reading the crontab correctly and that it's every 3 days and not every 10 days or something [13:14:52] nope, it's every 3 days all right [13:16:01] Nov 28 05:00:01 mwmaint1002 CRON[128049]: (www-data) CMD (flock -n /var/lock/update-special-pages /usr/local/bin/foreachwiki updateSpecialPages.php > /var/log/mediawiki/updateSpecialPages.log 2>&1) [13:16:12] this ran yesterday [13:16:25] let me see what the log shows [13:18:26] it ran and there are messages indicating updates. they do not include the article count etc however. [13:20:11] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2223.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2223.codfw.wmnet'] ` [13:20:36] https://szy.wikipedia.org/wiki/sazumaay:SpecialPages links off of here are updated, the ones in the maintenance reports [13:21:57] was the import done as part of wiki creation? [13:22:48] (03PS3) 10Jbond: package_builder: clean up build and results directory [puppet] - 10https://gerrit.wikimedia.org/r/549815 (https://phabricator.wikimedia.org/T237713) [13:22:56] if so, I would suggest you open a task in phabricator to update the specific statistic (requires running that script with a specific argument, I think) as part of wiki creation [13:23:30] in general those particular stats (articles, pages) are not updated by script but as new articles/pages are created the numbers are incremented [13:24:29] usually it will take a few days from when i import a new wiki until the stats are updated, but it has taken unusually long in this case [13:26:11] 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) Same for bast1001: `lines=10 29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Start Restore Job RestoreFiles.2019-11-29_13.18.59_50 29-Nov 13:19 b... [13:26:33] (03CR) 10Jbond: "thanks, updated with timings suggested by alex, althugh i went for a more conservative 2 weeks on the build dir. happy to lower if though" [puppet] - 10https://gerrit.wikimedia.org/r/549815 (https://phabricator.wikimedia.org/T237713) (owner: 10Jbond) [13:26:35] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/elastalert] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/553734 [13:27:32] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:16] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553735 [13:29:11] I see a few new pages in recent changes and they have not registered in special:statistics, that seems odd to me [13:29:23] I wonder if something has changed in the way that's updated since I looked at it last [13:29:41] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:29:43] I do suggest a task in phabricator to get people who pay attention to that code to look at it [13:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:46] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736 [13:29:53] Jhs: Have you tried updateArticleCount (https://www.mediawiki.org/wiki/Manual:UpdateArticleCount.php) [13:32:49] https://www.mediawiki.org/wiki/Manual:InitSiteStats.php is probably what was needed, but I would expect that to have been run after the import [13:33:13] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/elastalert] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/553734 (owner: 10Hashar) [13:33:25] !log reenable puppet on dbprov2001, backup1001 [13:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:15] https://phabricator.wikimedia.org/T237369 maybe you could ask about it here and re-open the task [13:43:50] PROBLEM - mediawiki-installation DSH group on mw2221 is CRITICAL: Host mw2221 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:44:16] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) >>! In T239054#5701280, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['mw1308.eqiad.wmnet'] > ` > > Of which those **FAILED**: > ` > ['mw1308.eqiad.wmnet'] > ` I run and che... [13:44:32] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1306.eqiad.wmnet', 'mw1264.eqiad.wmnet', 'mw1279.eqiad.wmnet'] ` and were **ALL** successful. [13:46:05] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:44] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:31] PROBLEM - mediawiki-installation DSH group on mw1268 is CRITICAL: Host mw1268 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:56:12] (03PS2) 10Ema: ATS: re-use origin server connections for matching IPs [puppet] - 10https://gerrit.wikimedia.org/r/553490 (https://phabricator.wikimedia.org/T238494) [14:00:31] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [14:02:01] RECOVERY - mediawiki-installation DSH group on mw1268 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:02:01] RECOVERY - mediawiki-installation DSH group on mw2221 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:02:01] RECOVERY - mediawiki-installation DSH group on mw2222 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [14:02:06] !log reimage mw1271.eqiad.wmnet mw1272.eqiad.wmnet mw1304.eqiad.wmnet [14:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:24] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1271.eqiad.wmnet', 'mw1272.eqiad.wmnet', 'mw1304.eqiad.wmnet'] ` The log can be found in `/var/log/... [14:09:36] (03PS2) 10Hashar: Configuration for gbp buildpackage and fix patches [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553735 [14:09:38] (03PS1) 10Hashar: Revert local hack to sources [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553741 [14:10:36] (03PS3) 10Hashar: Configuration for gbp buildpackage [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553735 [14:13:33] !log reimage mw2228 for partman tests [14:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:28] (03CR) 10Hashar: "I have reverted the 3 commits that made change to the upstream source and then manually updated debian/patches/reqfix . It is simple enou" [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553741 (owner: 10Hashar) [14:14:45] !log filippo@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2228.codfw.wmnet [14:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:56] (03CR) 10Hashar: "The build pass but lintian reports failures:" (032 comments) [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553735 (owner: 10Hashar) [14:17:32] (03PS1) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826) [14:23:51] 10Operations, 10Traffic, 10Performance-Team (Radar): ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10ema) 05Open→03Resolved >>! In T237687#5679746, @Krinkle wrote: >> ` if ts.client_request.get_url_host() == 'appservers-rw.svc.wmnet' then` > > Looks like condition... [14:23:53] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [14:25:36] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:46] (03PS1) 10Jbond: profile::prometheus::ops: add scraper for apero_cas idp service [puppet] - 10https://gerrit.wikimedia.org/r/553743 (https://phabricator.wikimedia.org/T233934) [14:27:44] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:03] (03PS1) 10Alexandros Kosiaris: prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744 [14:31:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] package_builder: clean up build and results directory [puppet] - 10https://gerrit.wikimedia.org/r/549815 (https://phabricator.wikimedia.org/T237713) (owner: 10Jbond) [14:32:00] (03CR) 10jerkins-bot: [V: 04-1] prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744 (owner: 10Alexandros Kosiaris) [14:32:01] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1268.eqiad.wmnet', 'mw1280.eqiad.wmnet', 'mw1281.eqiad.wmnet'] ` and were **ALL** successful. [14:35:08] (03PS2) 10Alexandros Kosiaris: prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744 [14:35:15] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:36] !log reimage mw1323.eqiad.wmnet mw1297.eqiad.wmnet mw1273.eqiad.wmnet [14:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:23] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:27] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1323.eqiad.wmnet', 'mw1297.eqiad.wmnet', 'mw1273.eqiad.wmnet'] ` The log can be found in `/var/log/... [14:38:08] (03CR) 10Jbond: [C: 03+2] package_builder: clean up build and results directory [puppet] - 10https://gerrit.wikimedia.org/r/549815 (https://phabricator.wikimedia.org/T237713) (owner: 10Jbond) [14:39:01] (03PS3) 10Alexandros Kosiaris: prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744 [14:40:02] (03PS1) 10Jbond: Revert "package_builder: clean up build and results directory" [puppet] - 10https://gerrit.wikimedia.org/r/553745 [14:41:11] (03CR) 10Jbond: [C: 03+2] Revert "package_builder: clean up build and results directory" [puppet] - 10https://gerrit.wikimedia.org/r/553745 (owner: 10Jbond) [14:43:23] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:44:37] 10Operations, 10Pybal, 10SRE-tools, 10Traffic, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10akosiaris) `need to be able to understand the pooled status` I have to question this. Why... [14:45:28] !log reimage mw1282.eqiad.wmne [14:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:46] !log reimage mw1282.eqiad.wmnet [14:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:03] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1282.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911291445_jiji_220572.log`. [14:48:02] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 556 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:49:33] (03CR) 10Muehlenhoff: profile::prometheus::ops: add scraper for apero_cas idp service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553743 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [14:51:37] (03PS2) 10Jbond: profile::prometheus::ops: add scraper for apero_cas idp service [puppet] - 10https://gerrit.wikimedia.org/r/553743 (https://phabricator.wikimedia.org/T233934) [14:52:03] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "overall LGTM - see two nitpicks and what I think is just a simple error." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [14:52:44] (03PS4) 10Alexandros Kosiaris: prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744 [14:56:51] (03PS5) 10Alexandros Kosiaris: prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744 [14:59:04] (03PS2) 10Hashar: Update debian/changelog to point to unstable [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736 [14:59:30] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:38] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:44] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#5698785, @Krinkle wrote: > Latency remains elevated. Do we have a status update or better idea about the root... [15:02:09] (03PS1) 10ArielGlenn: make cirrussearch dumps write output to a temp file, then move into place [puppet] - 10https://gerrit.wikimedia.org/r/553746 (https://phabricator.wikimedia.org/T238646) [15:02:20] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` mw2228.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911291502_filippo_222670_... [15:02:22] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2228.codfw.wmnet'] ` Of which those **FAILED**: ` ['mw2228.codfw.wmnet'] ` [15:02:29] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` mw2228.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911291502_filippo_222681_... [15:02:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/553743 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [15:05:35] (03PS6) 10Alexandros Kosiaris: prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744 [15:08:02] (03PS15) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [15:08:06] (03CR) 10Jbond: "thanks, updated" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [15:08:40] (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler1003/19688/prometheus1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/553744 (owner: 10Alexandros Kosiaris) [15:09:14] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [15:09:16] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [15:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:09] (03CR) 10Hashar: "The CI build here targets unstable and fails due to a lintian error:" [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736 (owner: 10Hashar) [15:11:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:19] (03PS1) 10Jbond: apereo_cas: add localhost to list of allowed prometheus scrappers [puppet] - 10https://gerrit.wikimedia.org/r/553750 (https://phabricator.wikimedia.org/T233934) [15:14:47] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: add localhost to list of allowed prometheus scrappers [puppet] - 10https://gerrit.wikimedia.org/r/553750 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [15:16:46] (03PS16) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 [15:18:12] 10Operations, 10Traffic: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360 (10Aklapper) p:05High→03Triage a:05Timothycrice→03None @Timothy.davis18: Hi, is this still a problem now, two and a half years later? Or has this problem solved itself? Thanks! [15:19:56] (03PS2) 10Jbond: apereo_cas: add localhost to list of allowed prometheus scrappers [puppet] - 10https://gerrit.wikimedia.org/r/553750 (https://phabricator.wikimedia.org/T233934) [15:20:16] (03PS3) 10Filippo Giunchedi: install_server: standard recipe and raid1/raid10 [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) [15:20:18] (03PS2) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) [15:22:43] (03CR) 10Filippo Giunchedi: "After more tests on Buster and Stretch I've updated the comments on standard.cfg." [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [15:23:46] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [15:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:24] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1305.eqiad.wmnet', 'mw1265.eqiad.wmnet', 'mw1270.eqiad.wmnet'] ` and were **ALL** successful. [15:25:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/553743 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond) [15:25:52] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/553744 (owner: 10Alexandros Kosiaris) [15:27:51] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on mw2223 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {ssbd, md_clear, flush_l1d} https://wikitech.wikimedia.org/wiki/Microcode [15:34:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/553744 (owner: 10Alexandros Kosiaris) [15:38:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] httpbb: Install python3-requests-toolbelt. [puppet] - 10https://gerrit.wikimedia.org/r/551249 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus) [15:38:16] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [15:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:34] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) [15:40:24] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:31] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10ema) p:05Triage→03Normal [15:46:02] 10Operations, 10Traffic: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360 (10ema) p:05Triage→03Normal [15:51:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/553473 (owner: 10Giuseppe Lavagetto) [15:52:44] (03CR) 10Giuseppe Lavagetto: [C: 03+1] puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond) [15:58:43] PROBLEM - mediawiki-installation DSH group on mw1281 is CRITICAL: Host mw1281 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:58:50] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1282.eqiad.wmnet'] ` and were **ALL** successful. [16:12:46] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [16:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:06] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [16:14:54] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:30] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2228.codfw.wmnet'] ` and were **ALL** successful. [16:17:12] !log reimage mw1274.eqiad.wmnet [16:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:39] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1274.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911291617_jiji_242202.log`. [16:22:03] !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=mw2228.codfw.wmnet [16:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:31] 10Operations, 10Traffic: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360 (10Aklapper) @ema: I don't understand how a task about an issue which happened 30 months ago and we're unsure if there is still a problem can have a "Medium" priority... [16:40:43] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [16:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:55] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:26] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [16:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:39] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:23] PROBLEM - mediawiki-installation DSH group on mw1323 is CRITICAL: Host mw1323 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:22:17] (03CR) 10MarcoAurelio: [C: 03+1] "Code LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) (owner: 10DannyS712) [17:24:24] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [17:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:01] (03PS4) 10Filippo Giunchedi: install_server: standard recipe and raid1/raid10 [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) [17:25:03] (03PS3) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) [17:26:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:59] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1271.eqiad.wmnet', 'mw1272.eqiad.wmnet', 'mw1304.eqiad.wmnet'] ` and were **ALL** successful. [17:31:06] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1274.eqiad.wmnet'] ` and were **ALL** successful. [18:09:05] PROBLEM - mediawiki-installation DSH group on mw1272 is CRITICAL: Host mw1272 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [18:09:42] effie: FYI for the DSH alerts here and earlier ^^^ [18:15:02] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1323.eqiad.wmnet', 'mw1297.eqiad.wmnet', 'mw1273.eqiad.wmnet'] ` and were **ALL** successful. [19:02:52] volans: yeah they are hosts I havent pooled back [19:03:05] will fix [19:15:43] RECOVERY - mediawiki-installation DSH group on mw1323 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:19:15] !log reimage mw1303.eqiad.wmnet mw1283.eqiad.wmnet [19:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:59] !log reimage mw1284.eqiad.wmnet [19:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:38] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1303.eqiad.wmnet', 'mw1283.eqiad.wmnet', 'mw1284.eqiad.wmnet'] ` The log can be found in `/var/log/... [19:24:31] RECOVERY - mediawiki-installation DSH group on mw1272 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:24:31] RECOVERY - mediawiki-installation DSH group on mw1281 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:42:47] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [19:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:54] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:17] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1285.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911291950_jiji_24084.log`. [20:13:25] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [20:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:43] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:15:31] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:25] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [20:44:29] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [20:44:57] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [20:47:10] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 28539 bytes in 0.306 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [20:47:16] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28535 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [20:47:50] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: HTTP OK: HTTP/1.1 200 OK - 28534 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [20:47:58] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [20:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:54] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1285.eqiad.wmnet'] ` and were **ALL** successful. [21:12:25] !log reimage mw1302.eqiad.wmnet [21:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:50] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1302.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911292112_jiji_41316.log`. [21:31:28] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Krinkle) If we expect to fix it reasonably soon I suppose it's not worth reverting over indeed. I do have a gut-feeling though tha... [21:36:05] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [21:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:09] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:18] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [22:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:47] (03PS1) 10Hashar: Backports for Buster [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/553780 (https://phabricator.wikimedia.org/T239482) [22:12:15] 10Operations, 10Traffic, 10Patch-For-Review: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 (10hashar) >>! In T200178#4444797, @ema wrote: > CI tests [[https://integration.wikimedia.org/ci/job/debian-glue/1232/console | were failing ]] due to CI slaves being j... [22:19:27] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1302.eqiad.wmnet'] ` and were **ALL** successful. [22:40:08] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 51.25 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:41:52] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 88.87 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:45:17] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1303.eqiad.wmnet', 'mw1283.eqiad.wmnet', 'mw1284.eqiad.wmnet'] ` and were **ALL** successful. [22:54:45] PROBLEM - mediawiki-installation DSH group on mw1283 is CRITICAL: Host mw1283 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [23:55:29] RECOVERY - mediawiki-installation DSH group on mw1283 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups