[00:00:04] <jouncebot>	 Deploy window No deploys! (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191129T0000)
[00:22:07] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:24:45] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:25:25] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:28:55] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 28559 bytes in 8.337 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:28:59] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28427 bytes in 1.467 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:29:53] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: HTTP OK: HTTP/1.1 200 OK - 28536 bytes in 1.245 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[00:38:45] <wikibugs>	 (03CR) 10TerraCodes: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551778 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes)
[00:47:18] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1329.eqiad.wmnet'] `  and were **ALL** successful.
[01:06:00] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1309 is CRITICAL: Host mw1309 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[01:10:42] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1315 is CRITICAL: Host mw1315 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[01:54:01] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] daemon: fix memory leak in Python 3.7+ [software/keyholder] - 10https://gerrit.wikimedia.org/r/553481 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans)
[02:17:13] <wikibugs>	 10Operations, 10Traffic: cp1087 reboot - https://phabricator.wikimedia.org/T239449 (10Vgutierrez) 05Open→03Resolved p:05Triage→03Normal a:03Vgutierrez Yep, this is most likely another occurrence of T238305
[02:17:29] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez)
[02:17:51] <wikibugs>	 10Operations, 10Traffic: cp1087 reboot - https://phabricator.wikimedia.org/T239449 (10Vgutierrez)
[02:17:54] <wikibugs>	 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez)
[03:44:24] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1329 is CRITICAL: Host mw1329 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[04:09:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:30:04] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:47:06] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Trigger update-ocsp-all iff non acme-chief certs are deployed [puppet] - 10https://gerrit.wikimedia.org/r/553526
[04:52:35] <wikibugs>	 (03PS2) 10Vgutierrez: ATS: Trigger update-ocsp-all iff non acme-chief certs are deployed [puppet] - 10https://gerrit.wikimedia.org/r/553526
[04:55:36] * Krinkle staging on mwdebug1001 - https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/553372/
[04:58:56] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5963 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:00:38] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 2 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:00:40] <wikibugs>	 (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/19679/" [puppet] - 10https://gerrit.wikimedia.org/r/553526 (owner: 10Vgutierrez)
[05:08:33] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.5/includes/exception/MWExceptionHandler.php: 532f4aba96d85 (duration: 01m 03s)
[05:08:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1134 after schema change', diff saved to https://phabricator.wikimedia.org/P9781 and previous config saved to /var/cache/conftool/dbconfig/20191129-055845-marostegui.json
[05:58:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:24] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on cloudweb2001-dev is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (200946s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[06:14:57] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove db2062 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/553530 (https://phabricator.wikimedia.org/T238726)
[06:15:19] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Remove production dns for db2062 [dns] - 10https://gerrit.wikimedia.org/r/553531 (https://phabricator.wikimedia.org/T238726)
[06:15:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission
[06:15:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[06:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:16:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db2062 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/553530 (https://phabricator.wikimedia.org/T238726) (owner: 10Marostegui)
[06:16:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production dns for db2062 [dns] - 10https://gerrit.wikimedia.org/r/553531 (https://phabricator.wikimedia.org/T238726) (owner: 10Marostegui)
[06:17:39] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Marostegui) a:05Marostegui→03Papaul
[06:17:43] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Marostegui)
[06:18:06] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2062.codfw.wmnet - https://phabricator.wikimedia.org/T238726 (10Marostegui) Host ready for @Papaul to take over
[06:19:10] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[07:10:15] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1309 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[07:15:03] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1315 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[07:22:53] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Remove unused config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552502 (owner: 10Kosta Harlan)
[07:25:07] <effie>	 !log reimage mw1312.eqiad.wmnet mw1308.eqiad.wmnet  mw1261.eqiad.wmnet
[07:25:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:10] <effie>	 !reimage mw1262.eqiad.wmnet mw1278.eqiad.wmnet
[07:26:30] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1312.eqiad.wmnet', 'mw1308.eqiad.wmnet', 'mw1261.eqiad.wmnet'] ` The log can be found in `/var/log/...
[07:26:33] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1262.eqiad.wmnet', 'mw1278.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191...
[07:46:16] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki)
[07:46:43] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki)
[07:48:37] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[07:48:39] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[07:48:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:42] <logmsgbot>	 !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[07:48:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:49] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[07:50:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/553463 (owner: 10Muehlenhoff)
[07:55:05] <icinga-wm>	 PROBLEM - dhclient process on mw1262 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.57: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[07:55:43] <effie>	 ^ host is being reimaged 
[07:58:17] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1329 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[08:00:45] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on labweb1001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (207674s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:00:45] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on labweb1002 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (207674s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[08:18:56] <effie>	 !log reimage mw2223.codfw.wmnet  mw2222.codfw.wmnet mw2221.codfw.wmnet  mw2220.codfw.wmnet
[08:19:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:18] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw2223.codfw.wmnet', 'mw2222.codfw.wmnet', 'mw2221.codfw.wmnet', 'mw2220.codfw.wmnet'] ` The log can...
[08:19:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Nice detective work, looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/553460 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans)
[08:21:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/553482 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi)
[08:32:09] <icinga-wm>	 RECOVERY - dhclient process on mw1262 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[08:34:56] <wikibugs>	 (03CR) 10Muehlenhoff: "A straight replace won't work here, e.g. eventlog1002 uses half a TB of data in /srv, while the recipe now applied to it only provides 80G" [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi)
[08:38:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete entries from netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/553694
[08:43:37] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[08:43:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] keyholder: fix memory leak in Python 3.7+ [puppet] - 10https://gerrit.wikimedia.org/r/553460 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans)
[08:45:44] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:45:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:56] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.95 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[08:58:10] <godog>	 hah that's a new one, filing a task :|
[08:59:09] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[08:59:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:42] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.025 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[09:00:32] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[09:00:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:13] <volans>	 !log temporary disabling puppet on 'R:keyholder::agent' to merge gerrit:operations/puppet/+/553460 - T239386
[09:01:18] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:01:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:24] <stashbot>	 T239386: memory leak on keyholder-proxy on buster/python 3.7 - https://phabricator.wikimedia.org/T239386
[09:01:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:17] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1262 is CRITICAL: Host mw1262 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[09:02:29] <wikibugs>	 (03CR) 10Volans: [C: 03+2] keyholder: fix memory leak in Python 3.7+ [puppet] - 10https://gerrit.wikimedia.org/r/553460 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans)
[09:03:30] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:03:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: fix link to logstash indexing errors [puppet] - 10https://gerrit.wikimedia.org/r/553695
[09:15:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix link to logstash indexing errors [puppet] - 10https://gerrit.wikimedia.org/r/553695 (owner: 10Filippo Giunchedi)
[09:18:36] <wikibugs>	 10Operations, 10serviceops: Kubernetes emits  log level 50 - https://phabricator.wikimedia.org/T239459 (10jijiki)
[09:19:25] <icinga-wm>	 PROBLEM - Keyholder SSH agent on cumin2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder
[09:20:20] <volans>	 unfortunately expected, will be fixed as soon as I get a @netops online to arm homer's key
[09:22:22] <wikibugs>	 (03PS2) 10Filippo Giunchedi: install_server: move custom partman recipes to partman/custom [puppet] - 10https://gerrit.wikimedia.org/r/553482 (https://phabricator.wikimedia.org/T156955)
[09:23:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: move custom partman recipes to partman/custom [puppet] - 10https://gerrit.wikimedia.org/r/553482 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi)
[09:23:58] <effie>	 !log reimage mw1263.eqiad.wmnet mw1307.eqiad.wmnet
[09:24:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:20] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1263.eqiad.wmnet', 'mw1307.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191...
[09:26:47] <icinga-wm>	 PROBLEM - Keyholder SSH agent on netmon1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder
[09:27:33] <volans>	 damn, too slow, this will recover now
[09:28:25] <icinga-wm>	 RECOVERY - Keyholder SSH agent on netmon1002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder
[09:33:34] <marostegui>	 !log Stop replication on db2105 (s3 codfw) for schema change
[09:33:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:53] <marostegui>	 !log Remove triggers from db2094:3313 - T234704
[09:33:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:58] <stashbot>	 T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704
[09:34:51] <wikibugs>	 (03Abandoned) 10Jbond: debdeploy: ignore emacs for service restarts [puppet] - 10https://gerrit.wikimedia.org/r/507056 (owner: 10Jbond)
[09:35:32] <wikibugs>	 (03Abandoned) 10Jbond: decom cp3011-22: Removing old DNS entries [dns] - 10https://gerrit.wikimedia.org/r/507010 (https://phabricator.wikimedia.org/T130883) (owner: 10Jbond)
[09:36:20] <wikibugs>	 (03Abandoned) 10Jbond: varnish: ratelimit thumbor - cache_upload backend [puppet] - 10https://gerrit.wikimedia.org/r/514019 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond)
[09:40:15] <wikibugs>	 (03CR) 10Jbond: "bump, any chance of a review (its simple one line patch)" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 (owner: 10Jbond)
[09:41:23] <wikibugs>	 (03Abandoned) 10Jbond: backup::host: remove day and jobsdefault [puppet] - 10https://gerrit.wikimedia.org/r/547559 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond)
[09:41:54] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1278.eqiad.wmnet'] `  Of which those **FAILED**: ` ['mw1278.eqiad.wmnet'] `
[09:42:39] <wikibugs>	 (03PS1) 10Ema: ATS: allow connection reuse for requests with Authorization [puppet] - 10https://gerrit.wikimedia.org/r/553696 (https://phabricator.wikimedia.org/T238494)
[09:43:39] <wikibugs>	 (03CR) 10Jbond: "Anything still outstanding on this?" [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond)
[09:45:29] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ATS: allow connection reuse for requests with Authorization [puppet] - 10https://gerrit.wikimedia.org/r/553696 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema)
[09:46:24] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[09:46:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:36] <wikibugs>	 (03PS13) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214
[09:49:23] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:49:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:31] <icinga-wm>	 ACKNOWLEDGEMENT - Keyholder SSH agent on cumin2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. Volans Waiting for @netops to arm the key as I dont have access https://wikitech.wikimedia.org/wiki/Keyholder
[09:51:48] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[09:51:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:41] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Remove obsolete entries from netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/553694 (owner: 10Muehlenhoff)
[09:53:50] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[09:53:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:56] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:53:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, PS2 rebased" [puppet] - 10https://gerrit.wikimedia.org/r/553694 (owner: 10Muehlenhoff)
[09:56:08] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:56:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete entries from netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/553694 (owner: 10Muehlenhoff)
[10:02:23] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1262 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[10:03:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:12:15] <wikibugs>	 (03PS1) 10Volans: keyholder proxy: do not notify the agent [puppet] - 10https://gerrit.wikimedia.org/r/553700 (https://phabricator.wikimedia.org/T239386)
[10:14:58] <wikibugs>	 (03PS14) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214
[10:15:00] <wikibugs>	 (03PS2) 10Volans: keyholder: notify only the proxy on proxy changes [puppet] - 10https://gerrit.wikimedia.org/r/553700 (https://phabricator.wikimedia.org/T239386)
[10:15:02] <wikibugs>	 (03PS4) 10Jbond: puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943
[10:16:31] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:16:42] <wikibugs>	 (03CR) 10Jbond: "Hi All Any chance on another pass on this and https://gerrit.wikimedia.org/r/c/operations/puppet/+/544943" [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond)
[10:18:18] <wikibugs>	 (03CR) 10Jbond: "@alex, wonder if you have an answer to moritz's questions?" [puppet] - 10https://gerrit.wikimedia.org/r/549815 (https://phabricator.wikimedia.org/T237713) (owner: 10Jbond)
[10:19:25] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] promethus: add the metrics overlay to provide prometheus support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/551209 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond)
[10:21:03] <wikibugs>	 (03CR) 10Volans: "Compiler results here:" [puppet] - 10https://gerrit.wikimedia.org/r/553700 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans)
[10:22:33] <effie>	 !log reimage mw1306.eqiad.wmnet mw1264.eqiad.wmnet mw1279.eqiad.wmnet
[10:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:53] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1306.eqiad.wmnet', 'mw1264.eqiad.wmnet', 'mw1279.eqiad.wmnet'] ` The log can be found in `/var/log/...
[10:22:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/553700 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans)
[10:25:51] <wikibugs>	 (03CR) 10Volans: [C: 03+2] keyholder: notify only the proxy on proxy changes [puppet] - 10https://gerrit.wikimedia.org/r/553700 (https://phabricator.wikimedia.org/T239386) (owner: 10Volans)
[10:25:54] <wikibugs>	 (03PS2) 10Jbond: promethus: add the metrics overlay to provide prometheus support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/551209 (https://phabricator.wikimedia.org/T233934)
[10:26:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] promethus: add the metrics overlay to provide prometheus support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/551209 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond)
[10:27:00] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) >>! In T239054#5701191, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['mw1278.eqiad.wmnet'] > ` >  > Of which those **FAILED**: > ` > ['mw1278.eqiad.wmnet'] > `  I checked the...
[10:28:31] <wikibugs>	 (03Abandoned) 10Jbond: CI - pytohn3: make python3 the default for tests [puppet] - 10https://gerrit.wikimedia.org/r/550437 (owner: 10Jbond)
[10:44:26] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1308.eqiad.wmnet'] `  Of which those **FAILED**: ` ['mw1308.eqiad.wmnet'] `
[10:46:03] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[10:46:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:45] <logmsgbot>	 !log elukey@deploy1001 Started deploy [analytics/refinery@97015e4] (thin): Deploy thin Analytics Refinery (no jars/git-fat-obj) to notebook and labstore hosts
[10:47:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:53] <logmsgbot>	 !log elukey@deploy1001 Finished deploy [analytics/refinery@97015e4] (thin): Deploy thin Analytics Refinery (no jars/git-fat-obj) to notebook and labstore hosts (duration: 00m 08s)
[10:47:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:30] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:48:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Sigh. I had a response since Nov 8 and never pressed "Reply". Sorry about that." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549815 (https://phabricator.wikimedia.org/T237713) (owner: 10Jbond)
[10:57:31] <wikibugs>	 (03PS2) 10Jbond: apereo_cas: add prometheus actuator [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934)
[10:58:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] GrowthExperiments: Remove unused config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552502 (owner: 10Kosta Harlan)
[10:58:17] <effie>	 log reimage mw1268.eqiad.wmnet mw1280.eqiad.wmnet  mw1281.eqiad.wmne
[10:58:36] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1268.eqiad.wmnet', 'mw1280.eqiad.wmnet', 'mw1281.eqiad.wmnet'] ` The log can be found in `/var/log/...
[10:58:40] <wikibugs>	 (03CR) 10Volans: "LGTM, two nits inline" (032 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 (owner: 10Jbond)
[10:58:47] <Urbanecm>	 effie: you miss !, fyi
[10:58:54] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[10:58:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:03] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:01:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:38] <wikibugs>	 10Operations, 10Core Platform Team, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog, 10serviceops-radar: wikifeeds (Service-runner app) running on kubernetes emit logs with log level 50  - https://phabricator.wikimedia.org/T239459 (10akosiaris) p:05Triage→03Low
[11:03:37] <Lucas_WMDE>	 !log <effie> 10:58:17 log reimage mw1268.eqiad.wmnet mw1280.eqiad.wmnet mw1281.eqiad.wmne
[11:03:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:44] <Lucas_WMDE>	 ^ better than nothing Urbanecm
[11:04:06] <Urbanecm>	 thx Lucas_WMDE
[11:04:14] <Lucas_WMDE>	 (eh, I should’ve removed the “log” word ^^)
[11:08:51] <wikibugs>	 10Operations, 10Dumps-Generation, 10hardware-requests: Get a third dumpsdata server - https://phabricator.wikimedia.org/T219768 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn This is so done. So very very done. Thanks everyone!
[11:08:53] <wikibugs>	 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn)
[11:14:12] <wikibugs>	 (03CR) 10Muehlenhoff: "One comment inline, but feel free to ignore, this would also work." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond)
[11:14:31] <wikibugs>	 (03PS3) 10Jbond: apereo_cas: add prometheus actuator [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934)
[11:20:18] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[11:20:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:44] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[11:20:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:27] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:22:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:37] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:24:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:57] <wikibugs>	 10Operations, 10Core Platform Team, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog, 10serviceops-radar: wikifeeds (Service-runner app) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10jijiki) I am afraid this  is not limited to wikifeeds on...
[11:30:37] <wikibugs>	 (03CR) 10Jbond: "thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond)
[11:33:08] <icinga-wm>	 PROBLEM - Host mw1308 is DOWN: PING CRITICAL - Packet loss = 100%
[11:33:16] <effie>	 ^ me 
[11:34:36] <icinga-wm>	 RECOVERY - Host mw1308 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[11:34:59] <effie>	 !log reimage mw1268.eqiad.wmnet mw1280.eqiad.wmnet  mw1281.eqiad.wmnet
[11:35:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:05] <effie>	 thanx Urbanecm 
[11:35:23] <effie>	 and Lucas_WMDE 
[11:35:47] <effie>	 sorry, there has been a lot of notifications for this channel with my nicknames :p
[11:36:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond)
[11:37:04] <Lucas_WMDE>	 no problem ^^
[11:39:16] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1263.eqiad.wmnet', 'mw1307.eqiad.wmnet'] `  and were **ALL** successful.
[11:39:58] <jynus>	 !log disabling puppet on dbprov2001 to test recoveries
[11:40:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:59] <wikibugs>	 10Operations, 10CX-cxserver, 10Core Platform Team, 10Mobile-Content-Service, and 2 others: wikifeeds (Service-runner app) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) So it does look like it's service-runner specific then.
[11:46:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, minus the two notes by Volans." [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 (owner: 10Jbond)
[11:46:24] <wikibugs>	 10Operations, 10CX-cxserver, 10Core Platform Team, 10Mobile-Content-Service, and 2 others: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50  - https://phabricator.wikimedia.org/T239459 (10akosiaris)
[11:47:05] <wikibugs>	 (03PS2) 10Jbond: wmf_style: add contain to this list of include like types [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893
[11:49:01] <wikibugs>	 (03PS3) 10Jbond: wmf_style: add contain to this list of include like types [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893
[11:49:08] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[11:49:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:46] <wikibugs>	 (03CR) 10Jbond: "updated thanks" (032 comments) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/531893 (owner: 10Jbond)
[11:51:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] apereo_cas: add prometheus actuator [puppet] - 10https://gerrit.wikimedia.org/r/551212 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond)
[11:51:22] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:51:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "None on my side. +1ed" [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond)
[11:58:59] <effie>	 !log reimage mw1305.eqiad.wmnet mw1265.eqiad.wmnet mw1270.eqiad.wmnet
[11:59:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:32] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1305.eqiad.wmnet', 'mw1265.eqiad.wmnet', 'mw1270.eqiad.wmnet'] ` The log can be found in `/var/log/...
[12:00:48] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2222 is CRITICAL: Host mw2222 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[12:19:18] <icinga-wm>	 RECOVERY - Keyholder SSH agent on cumin2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder
[12:22:40] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[12:22:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:46] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:09] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[12:33:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:23] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[12:34:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:18] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:29] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:37:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:34] <jynus>	 !log disabling puppet also on on backup1001 to test recoveries
[12:48:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:11] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[12:57:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:20] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:59:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:53] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: allow connection reuse for requests with Authorization [puppet] - 10https://gerrit.wikimedia.org/r/553696 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema)
[13:04:43] <Jhs>	 Is there some maintenance script that can be run to manually update Special:Statistics? When new wikis are imported, Special:Statistics says "0" in all fields for several days after the import is finished. Case in point: https://szy.wikipedia.org/wiki/sazumaay:Statistics
[13:08:11] <apergos>	 there is a maintenance script that could be run: updateSpecialPages.php
[13:08:52] <apergos>	 a cron job already does this, running once every 3 days
[13:11:26] <Jhs>	 apergos, then the question becomes, why isn't the stats on szywiki updated yet? The import was done 8 days ago
[13:12:26] <apergos>	 I don't know. that's worth investigating
[13:13:01] <apergos>	 let me see if that cronjob actually still runs or if the manifest isn't active
[13:13:56] <wikibugs>	 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) Error while trying to restore sodium contents:  `lines=10 29-Nov 13:04 backup1001.eqiad.wmnet JobId 162656: Start Restore Job RestoreFiles.2019-11-29...
[13:14:05] <apergos>	 it's active
[13:14:23] <apergos>	 let me make sure I am reading the crontab correctly and that it's every 3 days and not every 10 days or something
[13:14:52] <apergos>	 nope, it's every 3 days all right 
[13:16:01] <apergos>	 Nov 28 05:00:01 mwmaint1002 CRON[128049]: (www-data) CMD (flock -n /var/lock/update-special-pages /usr/local/bin/foreachwiki updateSpecialPages.php > /var/log/mediawiki/updateSpecialPages.log 2>&1)
[13:16:12] <apergos>	 this ran yesterday
[13:16:25] <apergos>	 let me see what the log shows
[13:18:26] <apergos>	 it ran and there are messages indicating updates. they do not include the article count etc however.
[13:20:11] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2223.codfw.wmnet'] `  Of which those **FAILED**: ` ['mw2223.codfw.wmnet'] `
[13:20:36] <apergos>	 https://szy.wikipedia.org/wiki/sazumaay:SpecialPages links off of here are updated, the ones in the maintenance reports
[13:21:57] <apergos>	 was the import done as part of wiki creation?
[13:22:48] <wikibugs>	 (03PS3) 10Jbond: package_builder: clean up build and results directory [puppet] - 10https://gerrit.wikimedia.org/r/549815 (https://phabricator.wikimedia.org/T237713)
[13:22:56] <apergos>	 if so, I would suggest you open a task in phabricator to update the specific statistic (requires running that script with a specific argument, I think) as part of wiki creation
[13:23:30] <apergos>	 in general those particular stats (articles, pages) are not updated by script but as new articles/pages are created the numbers are incremented
[13:24:29] <Jhs>	 usually it will take a few days from when i import a new wiki until the stats are updated, but it has taken unusually long in this case
[13:26:11] <wikibugs>	 10Operations, 10Goal, 10Patch-For-Review: Followup to backup1001 bacula switchover (misc pending tasks) - https://phabricator.wikimedia.org/T238048 (10jcrespo) Same for bast1001:   `lines=10 29-Nov 13:19 backup1001.eqiad.wmnet JobId 162657: Start Restore Job RestoreFiles.2019-11-29_13.18.59_50 29-Nov 13:19 b...
[13:26:33] <wikibugs>	 (03CR) 10Jbond: "thanks, updated with timings suggested by alex, althugh i went for a more conservative 2 weeks on the build dir.  happy to lower if though" [puppet] - 10https://gerrit.wikimedia.org/r/549815 (https://phabricator.wikimedia.org/T237713) (owner: 10Jbond)
[13:26:35] <wikibugs>	 (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/elastalert] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/553734
[13:27:32] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[13:27:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:16] <wikibugs>	 (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553735
[13:29:11] <apergos>	 I see a few new pages in recent changes and they have not registered in special:statistics, that seems odd to me
[13:29:23] <apergos>	 I wonder if something has changed in the way that's updated since I looked at it last
[13:29:41] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:29:43] <apergos>	 I do suggest a task in phabricator to get people who pay attention to that code to look at it
[13:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:46] <wikibugs>	 (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736
[13:29:53] <Christian75_>	 Jhs: Have you tried updateArticleCount (https://www.mediawiki.org/wiki/Manual:UpdateArticleCount.php)
[13:32:49] <apergos>	 https://www.mediawiki.org/wiki/Manual:InitSiteStats.php  is probably what was needed, but I would expect that to have been run after the import
[13:33:13] <wikibugs>	 (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/elastalert] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/553734 (owner: 10Hashar)
[13:33:25] <jynus>	 !log reenable puppet on dbprov2001, backup1001
[13:33:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:15] <apergos>	 https://phabricator.wikimedia.org/T237369  maybe you could ask about it here and re-open the task
[13:43:50] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2221 is CRITICAL: Host mw2221 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[13:44:16] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) >>! In T239054#5701280, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['mw1308.eqiad.wmnet'] > ` >  > Of which those **FAILED**: > ` > ['mw1308.eqiad.wmnet'] > `  I run and che...
[13:44:32] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1306.eqiad.wmnet', 'mw1264.eqiad.wmnet', 'mw1279.eqiad.wmnet'] `  and were **ALL** successful.
[13:46:05] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[13:46:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:44] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:48:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:31] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1268 is CRITICAL: Host mw1268 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[13:56:12] <wikibugs>	 (03PS2) 10Ema: ATS: re-use origin server connections for matching IPs [puppet] - 10https://gerrit.wikimedia.org/r/553490 (https://phabricator.wikimedia.org/T238494)
[14:00:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi)
[14:02:01] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1268 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[14:02:01] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2221 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[14:02:01] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2222 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[14:02:06] <effie>	 !log reimage mw1271.eqiad.wmnet mw1272.eqiad.wmnet mw1304.eqiad.wmnet
[14:02:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:24] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1271.eqiad.wmnet', 'mw1272.eqiad.wmnet', 'mw1304.eqiad.wmnet'] ` The log can be found in `/var/log/...
[14:09:36] <wikibugs>	 (03PS2) 10Hashar: Configuration for gbp buildpackage and fix patches [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553735
[14:09:38] <wikibugs>	 (03PS1) 10Hashar: Revert local hack to sources [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553741
[14:10:36] <wikibugs>	 (03PS3) 10Hashar: Configuration for gbp buildpackage [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553735
[14:13:33] <godog>	 !log reimage mw2228 for partman tests
[14:13:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:28] <wikibugs>	 (03CR) 10Hashar: "I have reverted the 3 commits that made change to the upstream source and then manually updated debian/patches/reqfix .  It is simple enou" [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553741 (owner: 10Hashar)
[14:14:45] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2228.codfw.wmnet
[14:14:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:56] <wikibugs>	 (03CR) 10Hashar: "The build pass but lintian reports failures:" (032 comments) [debs/pynetbox] (debian) - 10https://gerrit.wikimedia.org/r/553735 (owner: 10Hashar)
[14:17:32] <wikibugs>	 (03PS1) 10Elukey: WIP - Introduce profile::mariadb::misc::analytics [puppet] - 10https://gerrit.wikimedia.org/r/553742 (https://phabricator.wikimedia.org/T234826)
[14:23:51] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10ema) 05Open→03Resolved >>! In T237687#5679746, @Krinkle wrote: >> `    if ts.client_request.get_url_host() == 'appservers-rw.svc.wmnet' then` >  > Looks like condition...
[14:23:53] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema)
[14:25:36] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[14:25:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:46] <wikibugs>	 (03PS1) 10Jbond: profile::prometheus::ops: add scraper for apero_cas idp service [puppet] - 10https://gerrit.wikimedia.org/r/553743 (https://phabricator.wikimedia.org/T233934)
[14:27:44] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:27:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:03] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744
[14:31:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] package_builder: clean up build and results directory [puppet] - 10https://gerrit.wikimedia.org/r/549815 (https://phabricator.wikimedia.org/T237713) (owner: 10Jbond)
[14:32:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744 (owner: 10Alexandros Kosiaris)
[14:32:01] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1268.eqiad.wmnet', 'mw1280.eqiad.wmnet', 'mw1281.eqiad.wmnet'] `  and were **ALL** successful.
[14:35:08] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744
[14:35:15] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[14:35:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:36] <effie>	 !log reimage mw1323.eqiad.wmnet mw1297.eqiad.wmnet mw1273.eqiad.wmnet
[14:36:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:23] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:27] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1323.eqiad.wmnet', 'mw1297.eqiad.wmnet', 'mw1273.eqiad.wmnet'] ` The log can be found in `/var/log/...
[14:38:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] package_builder: clean up build and results directory [puppet] - 10https://gerrit.wikimedia.org/r/549815 (https://phabricator.wikimedia.org/T237713) (owner: 10Jbond)
[14:39:01] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744
[14:40:02] <wikibugs>	 (03PS1) 10Jbond: Revert "package_builder: clean up build and results directory" [puppet] - 10https://gerrit.wikimedia.org/r/553745
[14:41:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "package_builder: clean up build and results directory" [puppet] - 10https://gerrit.wikimedia.org/r/553745 (owner: 10Jbond)
[14:43:23] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[14:44:37] <wikibugs>	 10Operations, 10Pybal, 10SRE-tools, 10Traffic, 10serviceops: Applications and scripts need to be able to understand the pooled status of servers in our load balancers. - https://phabricator.wikimedia.org/T239392 (10akosiaris) `need to be able to understand the pooled status`  I have to question this. Why...
[14:45:28] <effie>	 !log reimage mw1282.eqiad.wmne
[14:45:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:46] <effie>	 !log reimage mw1282.eqiad.wmnet
[14:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:03] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1282.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911291445_jiji_220572.log`.
[14:48:02] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 556 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[14:49:33] <wikibugs>	 (03CR) 10Muehlenhoff: profile::prometheus::ops: add scraper for apero_cas idp service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/553743 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond)
[14:51:37] <wikibugs>	 (03PS2) 10Jbond: profile::prometheus::ops: add scraper for apero_cas idp service [puppet] - 10https://gerrit.wikimedia.org/r/553743 (https://phabricator.wikimedia.org/T233934)
[14:52:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "overall LGTM - see two nitpicks and what I think is just a simple error." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond)
[14:52:44] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744
[14:56:51] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744
[14:59:04] <wikibugs>	 (03PS2) 10Hashar: Update debian/changelog to point to unstable [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736
[14:59:30] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[14:59:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:38] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:01:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:44] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#5698785, @Krinkle wrote: > Latency remains elevated. Do we have a status update or better idea about the root...
[15:02:09] <wikibugs>	 (03PS1) 10ArielGlenn: make cirrussearch dumps write output to a temp file, then move into place [puppet] - 10https://gerrit.wikimedia.org/r/553746 (https://phabricator.wikimedia.org/T238646)
[15:02:20] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` mw2228.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911291502_filippo_222670_...
[15:02:22] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2228.codfw.wmnet'] `  Of which those **FAILED**: ` ['mw2228.codfw.wmnet'] `
[15:02:29] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by filippo on cumin1001.eqiad.wmnet for hosts: ` mw2228.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201911291502_filippo_222681_...
[15:02:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/553743 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond)
[15:05:35] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: prometheus: scrape calico felix agent [puppet] - 10https://gerrit.wikimedia.org/r/553744
[15:08:02] <wikibugs>	 (03PS15) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214
[15:08:06] <wikibugs>	 (03CR) 10Jbond: "thanks, updated" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond)
[15:08:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "PCC at https://puppet-compiler.wmflabs.org/compiler1003/19688/prometheus1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/553744 (owner: 10Alexandros Kosiaris)
[15:09:14] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[15:09:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond)
[15:09:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:09] <wikibugs>	 (03CR) 10Hashar: "The CI build here targets unstable and fails due to a lintian error:" [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/553736 (owner: 10Hashar)
[15:11:25] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:11:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:19] <wikibugs>	 (03PS1) 10Jbond: apereo_cas: add localhost to list of allowed prometheus scrappers [puppet] - 10https://gerrit.wikimedia.org/r/553750 (https://phabricator.wikimedia.org/T233934)
[15:14:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: add localhost to list of allowed prometheus scrappers [puppet] - 10https://gerrit.wikimedia.org/r/553750 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond)
[15:16:46] <wikibugs>	 (03PS16) 10Jbond: puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214
[15:18:12] <wikibugs>	 10Operations, 10Traffic: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360 (10Aklapper) p:05High→03Triage a:05Timothycrice→03None @Timothy.davis18: Hi, is this still a problem now, two and a half years later? Or has this problem solved itself? Thanks!
[15:19:56] <wikibugs>	 (03PS2) 10Jbond: apereo_cas: add localhost to list of allowed prometheus scrappers [puppet] - 10https://gerrit.wikimedia.org/r/553750 (https://phabricator.wikimedia.org/T233934)
[15:20:16] <wikibugs>	 (03PS3) 10Filippo Giunchedi: install_server: standard recipe and raid1/raid10 [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955)
[15:20:18] <wikibugs>	 (03PS2) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955)
[15:22:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: "After more tests on Buster and Stretch I've updated the comments on standard.cfg." [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi)
[15:23:46] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime
[15:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:24] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1305.eqiad.wmnet', 'mw1265.eqiad.wmnet', 'mw1270.eqiad.wmnet'] `  and were **ALL** successful.
[15:25:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/553743 (https://phabricator.wikimedia.org/T233934) (owner: 10Jbond)
[15:25:52] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:25:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/553744 (owner: 10Alexandros Kosiaris)
[15:27:51] <icinga-wm>	 PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on mw2223 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {ssbd, md_clear, flush_l1d} https://wikitech.wikimedia.org/wiki/Microcode
[15:34:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/553744 (owner: 10Alexandros Kosiaris)
[15:38:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] httpbb: Install python3-requests-toolbelt. [puppet] - 10https://gerrit.wikimedia.org/r/551249 (https://phabricator.wikimedia.org/T236699) (owner: 10RLazarus)
[15:38:16] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[15:38:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:34] <wikibugs>	 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata)
[15:40:24] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[15:40:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:31] <wikibugs>	 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10ema) p:05Triage→03Normal
[15:46:02] <wikibugs>	 10Operations, 10Traffic: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360 (10ema) p:05Triage→03Normal
[15:51:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/553473 (owner: 10Giuseppe Lavagetto)
[15:52:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] puppet-merge: refactor [puppet] - 10https://gerrit.wikimedia.org/r/544214 (owner: 10Jbond)
[15:58:43] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1281 is CRITICAL: Host mw1281 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:58:50] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1282.eqiad.wmnet'] `  and were **ALL** successful.
[16:12:46] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[16:13:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:06] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki)
[16:14:54] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:30] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2228.codfw.wmnet'] `  and were **ALL** successful.
[16:17:12] <effie>	 !log reimage mw1274.eqiad.wmnet
[16:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:39] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1274.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911291617_jiji_242202.log`.
[16:22:03] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=yes; selector: name=mw2228.codfw.wmnet
[16:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:31] <wikibugs>	 10Operations, 10Traffic: 404 loading images from Virgin Media - https://phabricator.wikimedia.org/T161360 (10Aklapper) @ema: I don't understand how a task about an issue which happened 30 months ago and we're unsure if there is still a problem can have a "Medium" priority...
[16:40:43] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[16:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:55] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:42:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:26] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[16:49:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:39] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[16:51:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:23] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1323 is CRITICAL: Host mw1323 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[17:22:17] <wikibugs>	 (03CR) 10MarcoAurelio: [C: 03+1] "Code LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) (owner: 10DannyS712)
[17:24:24] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[17:24:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:01] <wikibugs>	 (03PS4) 10Filippo Giunchedi: install_server: standard recipe and raid1/raid10 [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955)
[17:25:03] <wikibugs>	 (03PS3) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955)
[17:26:32] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:26:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:59] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1271.eqiad.wmnet', 'mw1272.eqiad.wmnet', 'mw1304.eqiad.wmnet'] `  and were **ALL** successful.
[17:31:06] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1274.eqiad.wmnet'] `  and were **ALL** successful.
[18:09:05] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1272 is CRITICAL: Host mw1272 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[18:09:42] <volans>	 effie: FYI for the DSH alerts here and earlier ^^^
[18:15:02] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1323.eqiad.wmnet', 'mw1297.eqiad.wmnet', 'mw1273.eqiad.wmnet'] `  and were **ALL** successful.
[19:02:52] <effie>	 volans: yeah they are hosts I havent pooled back
[19:03:05] <effie>	 will fix
[19:15:43] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1323 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[19:19:15] <effie>	 !log reimage mw1303.eqiad.wmnet mw1283.eqiad.wmnet
[19:19:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:59] <effie>	 !log reimage mw1284.eqiad.wmnet
[19:20:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:38] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1303.eqiad.wmnet', 'mw1283.eqiad.wmnet', 'mw1284.eqiad.wmnet'] ` The log can be found in `/var/log/...
[19:24:31] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1272 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[19:24:31] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1281 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[19:42:47] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[19:42:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:54] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[19:44:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:17] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1285.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911291950_jiji_24084.log`.
[20:13:25] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[20:13:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:43] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:15:31] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:15:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:25] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[20:44:29] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[20:44:57] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[20:47:10] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 28539 bytes in 0.306 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[20:47:16] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28535 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[20:47:50] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: HTTP OK: HTTP/1.1 200 OK - 28534 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[20:47:58] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[20:48:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:08] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:50:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:54] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1285.eqiad.wmnet'] `  and were **ALL** successful.
[21:12:25] <effie>	 !log reimage  mw1302.eqiad.wmnet
[21:12:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:50] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1302.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911292112_jiji_41316.log`.
[21:31:28] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 200ms / 50% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10Krinkle) If we expect to fix it reasonably soon I suppose it's not worth reverting over indeed. I do have a gut-feeling though tha...
[21:36:05] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[21:36:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:09] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:38:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:18] <logmsgbot>	 !log jiji@cumin1001 START - Cookbook sre.hosts.downtime
[22:00:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:25] <logmsgbot>	 !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:02:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:06:47] <wikibugs>	 (03PS1) 10Hashar: Backports for Buster [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/553780 (https://phabricator.wikimedia.org/T239482)
[22:12:15] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Traffic Server packaging and initial puppetization - https://phabricator.wikimedia.org/T200178 (10hashar) >>! In T200178#4444797, @ema wrote: > CI tests [[https://integration.wikimedia.org/ci/job/debian-glue/1232/console  | were failing ]] due to CI slaves being j...
[22:19:27] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1302.eqiad.wmnet'] `  and were **ALL** successful.
[22:40:08] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 51.25 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[22:41:52] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 88.87 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[22:45:17] <wikibugs>	 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1303.eqiad.wmnet', 'mw1283.eqiad.wmnet', 'mw1284.eqiad.wmnet'] `  and were **ALL** successful.
[22:54:45] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1283 is CRITICAL: Host mw1283 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[23:55:29] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1283 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups