[00:01:52] (03CR) 10Alex Monk: "need to figure out whether this is a problem due to T216497 or not. last comment implies we can still pin, but I'm not sure this makes sen" [puppet] - 10https://gerrit.wikimedia.org/r/497009 (owner: 10Alex Monk) [00:02:49] (03PS2) 10CRusnov: Specify the actual correct parameters to the uwsgi::app [puppet] - 10https://gerrit.wikimedia.org/r/497016 [00:03:01] PROBLEM - toolschecker: Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 237 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [00:03:42] What's this ? [00:03:44] it paged [00:03:57] PROBLEM - Labs LDAP on seaborgium is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [00:04:16] * arturo paged [00:04:55] tools thinks ldap is broken? or the tools checker itself is broken? [00:05:10] ldap appears up [00:07:12] oh, there's an alarm for ldap itself too? [00:07:19] hm [00:07:55] ah [00:08:10] stuff is falling back to using the codfw ldap server I think [00:08:12] eqiad one is broken? [00:08:22] `ldapsearch -H ldap://ldap-labs.codfw.wikimedia.org:389 -x uid=krenair` is fine [00:08:33] krenair@krenair-clientpackages-py3-jessie:~$ ldapsearch -H ldap://ldap-labs.eqiad.wikimedia.org:389 -x uid=krenair [00:08:34] ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1) [00:11:11] (03PS4) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [00:11:24] (03CR) 10CRusnov: "compile looks good" [puppet] - 10https://gerrit.wikimedia.org/r/497016 (owner: 10CRusnov) [00:11:41] arturo, I know there's been issues with LDAP recently, is this anything like that? [00:17:32] (03CR) 10CDanis: [C: 03+1] Specify the actual correct parameters to the uwsgi::app (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497016 (owner: 10CRusnov) [00:18:09] (03CR) 10CRusnov: [C: 03+2] Specify the actual correct parameters to the uwsgi::app [puppet] - 10https://gerrit.wikimedia.org/r/497016 (owner: 10CRusnov) [00:19:31] !log restarting slapd on seaborgium [00:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:51] RECOVERY - toolschecker: Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.438 second response time https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [00:20:47] RECOVERY - Labs LDAP on seaborgium is OK: LDAP OK - 0.117 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [00:27:45] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1048 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:30:47] RECOVERY - High lag on wdqs1009 is OK: (C)3600 ge (W)1200 ge 1122 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:33:13] RECOVERY - High lag on wdqs1007 is OK: (C)3600 ge (W)1200 ge 1106 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:33:31] RECOVERY - High lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 1143 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:41:15] (03PS3) 10CRusnov: Add report which checks against puppetdb and compares serial numbers [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/495267 (https://phabricator.wikimedia.org/T212526) [00:41:44] (03CR) 10jerkins-bot: [V: 04-1] Add report which checks against puppetdb and compares serial numbers [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/495267 (https://phabricator.wikimedia.org/T212526) (owner: 10CRusnov) [00:45:04] (03PS4) 10CRusnov: Add report which checks against puppetdb and compares serial numbers [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/495267 (https://phabricator.wikimedia.org/T212526) [01:02:09] RECOVERY - High lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 1156 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:04:13] RECOVERY - High lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 987 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:04:27] RECOVERY - High lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 988 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:50:41] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1156 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:15:23] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [04:29:45] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [04:33:04] 10Operations, 10VisualEditor, 10Readers-Web-Backlog (Tracking), 10Wikimedia-production-error: [Bug] Sporadic 503 errors when editing - https://phabricator.wikimedia.org/T218252 (10Krinkle) @Jdlrobson What work / by whom is this tracking, and are they aware of that? [05:55:23] (03PS1) 10MaxSem: Disable display_errors in FPM mode [puppet] - 10https://gerrit.wikimedia.org/r/497024 (https://phabricator.wikimedia.org/T218005) [06:15:29] (03PS2) 10Krinkle: Disable display_errors in FPM mode [puppet] - 10https://gerrit.wikimedia.org/r/497024 (https://phabricator.wikimedia.org/T218005) (owner: 10MaxSem) [06:25:29] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga2001 is CRITICAL: 287.8 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [06:28:11] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:28:59] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:30:35] PROBLEM - puppet last run on ms-be1049 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/wmf_ca_2017_2020.crt] [06:36:51] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.666 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:37:39] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:37:49] PROBLEM - puppet last run on druid1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:42:11] PROBLEM - HTTP availability for Varnish at codfw on icinga2001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:42:21] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:47:03] RECOVERY - HTTP availability for Varnish at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:47:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:56:37] RECOVERY - puppet last run on ms-be1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:23] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga2001 is CRITICAL: 58.55 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:08:59] RECOVERY - puppet last run on druid1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:13:47] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga2001 is OK: (C)60 le (W)70 le 70.74 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:08:08] 10Operations, 10serviceops: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (10jijiki) [08:15:23] 10Operations, 10serviceops: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (10jijiki) [08:16:14] 10Operations, 10Scap, 10serviceops, 10User-jijiki: Introduce state to Scap - https://phabricator.wikimedia.org/T209881 (10jijiki) [08:16:40] 10Operations, 10Scap, 10serviceops, 10Goal, 10User-jijiki: SRE FY2019 Q3:TEC6: First steps towards Canary Deployments - https://phabricator.wikimedia.org/T213156 (10jijiki) [08:16:42] 10Operations, 10serviceops: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (10jijiki) [08:54:41] RECOVERY - Disk space on prometheus2003 is OK: DISK OK [10:00:27] !log stop apache on cobalt for maintenance [10:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:25] PROBLEM - HTTPS on cobalt is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/330/ [10:03:13] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: connect to address gerrit.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [10:03:29] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: connect to address gerrit.wikimedia.org and port 443: Connection refused https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [10:03:45] PROBLEM - puppet last run on labsdb1012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [10:05:09] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [10:05:15] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [10:05:23] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [10:06:47] PROBLEM - puppet last run on db2095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [10:06:57] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] [10:08:21] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [10:08:59] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [10:08:59] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [10:08:59] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [10:08:59] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [10:09:51] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [10:09:59] PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [10:10:09] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] [10:10:51] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [10:10:51] PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [10:11:47] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [10:12:35] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 6 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [10:12:39] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] [10:13:15] PROBLEM - SSH access on cobalt is CRITICAL: connect to address 208.80.154.85 and port 29418: Connection refused https://wikitech.wikimedia.org/wiki/Gerrit [10:13:19] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] [10:13:57] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [10:13:57] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] [10:14:41] PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [10:14:41] PROBLEM - Check systemd state on cobalt is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:17:03] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [10:17:39] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [10:17:49] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [10:18:31] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 6 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater] [10:18:59] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [10:20:07] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [10:22:25] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [10:24:19] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] [10:24:59] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [10:25:05] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 7 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas] [10:26:13] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [10:29:25] PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_operations/mediawiki-config] [10:30:23] PROBLEM - puppet last run on db2094 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [10:32:41] PROBLEM - puppet last run on sarin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [10:33:29] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [10:34:01] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [10:34:21] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [10:35:41] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 6 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [10:38:17] chasemp: how long will the maintenance on cobalt be? [10:44:29] <_joe_> p858snake: we don't know at the moment [10:47:52] _joe_: I cannot set topic here, can you? [10:48:03] <_joe_> chasemp: sure [10:49:29] tx _joe_ [11:30:38] Any idea when Gerrit will work again? [11:39:48] MGC: unfortunately no [11:39:56] we are working on it [12:07:49] Hmm, what is the maintenance on cobalt for? Is there something wrong with gerrit? [12:10:46] Something like that [12:13:51] There’s a problem with gerrit? [12:47:50] 10Operations, 10Gerrit, 10Release-Engineering-Team: gerrit.wikimedia.org is down - https://phabricator.wikimedia.org/T218472 (10Paladox) [12:50:26] paladox: it's known, we're having some issues [12:50:36] ok [12:50:44] Nikerabbit: are you here by any chance? [12:54:00] Hi, when is expected to gerrit works again? [12:54:21] Zoranzoki21: there isn't a eta at this stage [13:00:39] p858snake: Ok, thanks! I found T218472 [13:01:10] T218472: gerrit.wikimedia.org is down - https://phabricator.wikimedia.org/T218472 [13:02:21] if there's anything I can do, let me know. I'm around [13:21:10] Poke it with a stick. [13:32:39] 10Operations, 10Gerrit, 10Release-Engineering-Team: gerrit.wikimedia.org is down - https://phabricator.wikimedia.org/T218472 (10matmarex) SRE (Operations) know about the problem and are working on it right now, a few folks commented about it on IRC. It's unknown yet when it will be back up. [14:26:17] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga2001 is OK: (C)130 ge (W)110 ge 96.65 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [14:28:45] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10dbarratt) Do we know how much of a burden it will be to vary the cache based on the user-... [14:47:57] It seems a little bit serious, do you want to send an email to wikitech-l? [14:50:43] Amir1: I'm asking folks, but we are actively figuring things out now [14:53:10] Thanks [15:07:22] uh oh [16:50:43] Hi.. [16:50:58] Were you guys doing any planned updates today? [16:51:12] ShakespeareFan00: Is that related to gerrit? [16:51:25] Well I seem to not being seeing images [16:51:26] https://commons.wikimedia.org/w/index.php?title=Special:ListFiles/ShakespeareFan00&ilshowall=1 [16:51:41] I suspect what's happened is a local config/filter issue [16:51:47] my end.. [16:52:11] ShakespeareFan00: Works for me [16:52:13] but wanted to check if there was a probable external cause before digging into config files [16:52:20] Okay I will assume it's local then [16:55:54] works for me too. [16:56:08] I will assume it's a local problem them... [16:56:12] works for me too. [16:56:13] ShakespeareFan00: check your web browser's developer tools in the network section [16:56:19] See https://www.mediawiki.org/wiki/Help:Locating_broken_scripts [16:57:42] Very odd [16:57:53] Re-loaded the page and it worked [17:58:25] 10Operations, 10Gerrit, 10Release-Engineering-Team: Deploy multi-site plugin to cobalt and gerrit2001 - https://phabricator.wikimedia.org/T217174 (10Paladox) [17:58:47] RECOVERY - Memory correctable errors -EDAC- on wtp2013 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops [18:00:48] 10Operations, 10Gerrit, 10Release-Engineering-Team: Deploy multi-site plugin to cobalt and gerrit2001 - https://phabricator.wikimedia.org/T217174 (10Paladox) [18:06:37] What's going on with maintenance of gerrit? [18:08:38] Zoranzoki21: see wikitech-l [18:08:58] paravoid: I am not subscribed there [18:09:16] Zoranzoki21: https://lists.wikimedia.org/pipermail/wikitech-l/ [18:09:28] https://lists.wikimedia.org/pipermail/wikitech-l/2019-March/091744.html [18:10:20] Oh, found. Thanks! [18:26:07] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [18:26:11] RECOVERY - Check systemd state on cobalt is OK: OK - running: The system is fully operational [18:26:47] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.15.11-18-gd3ca89353d (SSHD-CORE-1.6.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [18:27:55] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27311 bytes in 0.716 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [18:27:57] RECOVERY - HTTPS on cobalt is OK: SSL OK - Certificate gerrit.wikimedia.org valid until 2019-04-27 14:00:20 +0000 (expires in 41 days) https://phabricator.wikimedia.org/project/view/330/ [18:28:11] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.240 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [18:42:12] 10Operations, 10Gerrit, 10Release-Engineering-Team: gerrit.wikimedia.org is down - https://phabricator.wikimedia.org/T218472 (10LucasWerkmeister) (The error appears to have changed from “connection refused” to “connection timed out” now, though that’s probably not very significant.) [18:42:49] 10Operations, 10Gerrit, 10Release-Engineering-Team: gerrit.wikimedia.org is down - https://phabricator.wikimedia.org/T218472 (10Paladox) Please see https://lists.wikimedia.org/pipermail/wikitech-l/2019-March/091744.html [19:18:01] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:18:58] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10User-greg: gerrit.wikimedia.org is down - https://phabricator.wikimedia.org/T218472 (10greg) 05Open→03Resolved a:03greg Gerrit is back, sorry for the interruption. [19:19:03] gerrit is up and running again [19:19:06] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10User-greg: gerrit.wikimedia.org is down - https://phabricator.wikimedia.org/T218472 (10greg) a:05greg→03None [19:19:59] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:20:33] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:21:05] (03PS1) 10Andrew Bogott: labpuppetmasters: disable automatic syncing of the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/497068 [19:21:07] (03PS1) 10Andrew Bogott: puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 [19:22:07] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 (owner: 10Andrew Bogott) [19:22:26] (03CR) 10Andrew Bogott: [C: 03+2] labpuppetmasters: disable automatic syncing of the puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/497068 (owner: 10Andrew Bogott) [19:24:59] RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:25:43] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:26:01] RECOVERY - puppet last run on db2094 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:26:53] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:28:01] RECOVERY - puppet last run on sarin is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:29:29] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:29:44] thanks akosiaris [19:29:47] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:29:49] (03CR) 10Andrew Bogott: "This introduces a linting issue but I'm not sure I want to do a big role/profile refactor for this temporary change." [puppet] - 10https://gerrit.wikimedia.org/r/497069 (owner: 10Andrew Bogott) [19:30:27] RECOVERY - puppet last run on labsdb1012 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:31:09] (03PS2) 10Andrew Bogott: puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 [19:31:13] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:31:49] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:32:03] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 (owner: 10Andrew Bogott) [19:33:37] RECOVERY - puppet last run on db2095 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:33:37] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:33:49] andrewbogott: not sure if it's allowed according to the linting rules, but you could duplicate the profile::openstack::main::puppetmaster::servers hiera key under a different name puppetmaster::extra_servers, or something along those lines [19:34:15] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:34:22] that does spread the hotfix to two locations, though [19:34:57] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:35:18] valhallasw`cloud: I don't think that helps — the existing code is /already/ in violation of the lint, it's just that the linter only docs me for additional violations. [19:35:25] (maybe I'm misunderstanding) [19:35:26] ohhhh [19:35:39] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:35:39] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:35:39] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:35:39] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:35:53] my interpretation of the error was that hiera is only allowed if it does a lookup in the same namespace [19:36:09] I think hiera lookups aren't allowed in that code, full stop [19:36:13] so puppetmaster::scripts can lookup puppetmaster::* but not anythingelse::* [19:36:16] that makes more sense [19:36:18] Since it isn't a profile [19:36:29] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:36:35] RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:36:47] if the linter only checks new lines that's actually pretty neat [19:36:57] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:37:13] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [19:37:39] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:37:39] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:39:15] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:39:19] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:40:03] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:40:47] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:41:59] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:42:28] the topic needs to be updated, marostegui revi, moritzm if you're around [19:43:09] ack, I'll take care of it [19:43:37] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:43:37] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:43:58] thanks [19:44:19] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:44:33] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:45:17] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:45:45] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:45:59] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:46:51] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:05:06] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Traffic, 10Patch-For-Review: Expires header for load.php should be relative to request time instead of cache time - https://phabricator.wikimedia.org/T105657 (10Krinkle) I've re-run the analysis from 2 years ago (T105657#2918703) to see what... [20:12:05] (03CR) 10Krinkle: [C: 03+1] varnish: set /w/load.php Age to 0 [puppet] - 10https://gerrit.wikimedia.org/r/496497 (https://phabricator.wikimedia.org/T105657) (owner: 10Ema) [21:36:40] (03PS3) 10MarcoAurelio: maintain-views: Note explicit exclusion of `oathauth_users` from replicas [puppet] - 10https://gerrit.wikimedia.org/r/496063 (https://phabricator.wikimedia.org/T218165) [22:15:35] (03PS1) 10Jforrester: [Wikitech] Enable VisualEditor in extra namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497081 [22:17:30] (03CR) 10Krinkle: [C: 03+1] [Wikitech] Enable VisualEditor in extra namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497081 (owner: 10Jforrester) [22:27:15] PROBLEM - Apache HTTP on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:27:19] PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:27:57] PROBLEM - HHVM rendering on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:28:27] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.120 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:28:31] RECOVERY - Nginx local proxy to apache on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.213 second response time https://wikitech.wikimedia.org/wiki/Application_servers [22:29:09] RECOVERY - HHVM rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 75083 bytes in 0.307 second response time https://wikitech.wikimedia.org/wiki/Application_servers