[00:00:29] yep:) [00:02:29] PROBLEM - puppet last run on labvirt1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:53] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/13256/thumbor1003.eqiad.wmnet/ https://puppet-compiler.wmflabs.org/compiler1001/13256/" [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:07:04] (03PS3) 10Dzahn: icinga: logging optimizations [puppet] - 10https://gerrit.wikimedia.org/r/469320 (https://phabricator.wikimedia.org/T202782) [00:07:39] RECOVERY - puppet last run on labvirt1015 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:08:26] (03PS2) 10Niedzielski: Update: add Wikimedia logo for SEO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469214 (https://phabricator.wikimedia.org/T198946) [00:09:01] (03CR) 10BBlack: [C: 031] "I think this is actually pretty much ready to go and would work fine as-is, it's just fallen of our tiny radars! But while I'm reminded o" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/434055 (https://phabricator.wikimedia.org/T27611) (owner: 10Gilles) [00:18:25] (03PS1) 10Huji: Changing the language of votewiki to Persian (fa) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470526 (https://phabricator.wikimedia.org/T207560) [00:18:57] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:22:01] well.. that's odd but must be temp.. i just ran that. but double checking [00:23:19] (03PS3) 10Niedzielski: Update: add Wikimedia logo for SEO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469214 (https://phabricator.wikimedia.org/T198946) [00:23:40] (03CR) 10Niedzielski: "@krinkle, thank you. Revised." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469214 (https://phabricator.wikimedia.org/T198946) (owner: 10Niedzielski) [00:26:46] 10Operations, 10Analytics, 10Analytics-EventLogging, 10MediaWiki-extensions-NavigationTiming, and 2 others: Increase maxUrlSize from 1000 to 1500 - https://phabricator.wikimedia.org/T112002 (10Krinkle) [00:28:58] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:32:01] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) It appears to be working. Tested in {T207887} [00:34:42] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) So how do we want to roll this out? Do it on a per-project basis while moving a project across regions? Just flip the big switch in hieradata... [00:36:14] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) [00:45:29] I need to test a config patch, will mess around on tin & mwdebug1002 a bit [00:45:43] please do not sync mediawiki without pinging me first [00:51:05] tgr, tin? [00:52:01] mwdeploy, or whatever the current name is [00:52:17] setting SSH aliases and using the old name is easier than trying to keep up [00:54:46] it's currently deploy1001.eqiad.wmnet [00:55:02] if you have a local copy of puppet repo: grep deployment_server hieradata/common.yaml [00:55:08] Lol, rip tin [00:57:04] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Traffic: Increase EventLogging limit from 2K to 4K - https://phabricator.wikimedia.org/T208282 (10Krinkle) [00:58:24] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Traffic: Increase EventLogging limit from 2K to 5K - https://phabricator.wikimedia.org/T208282 (10Krinkle) [01:10:17] (03PS1) 10Dzahn: create bienvenida.wikimedia.org for Mexico awareness campaign [dns] - 10https://gerrit.wikimedia.org/r/470531 (https://phabricator.wikimedia.org/T207816) [01:10:32] (03CR) 10jerkins-bot: [V: 04-1] create bienvenida.wikimedia.org for Mexico awareness campaign [dns] - 10https://gerrit.wikimedia.org/r/470531 (https://phabricator.wikimedia.org/T207816) (owner: 10Dzahn) [01:12:19] (03PS2) 10Dzahn: create bienvenida.wikimedia.org for Mexico awareness campaign [dns] - 10https://gerrit.wikimedia.org/r/470531 (https://phabricator.wikimedia.org/T207816) [01:21:13] 10Operations, 10Release-Engineering-Team, 10monitoring, 10Performance-Team (Radar), 10goodfirstbug: Increase "check_legal_html" coverage to group0 wikis - https://phabricator.wikimedia.org/T208284 (10Krinkle) [01:28:17] 10Operations, 10Release-Engineering-Team, 10monitoring, 10Performance-Team (Radar), 10goodfirstbug: Increase "check_legal_html" coverage to group0 wikis - https://phabricator.wikimedia.org/T208284 (10Dzahn) Ensure legal html en.wb On Host en.wikibooks.org is broken since 2018-07-18 [[ https://icinga.wiki... [01:29:53] 10Operations, 10New-Readers, 10Patch-For-Review: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10Krinkle) I'll add that if this url is meant to be typed by humans and visually transmitted in text form on social media or in images/posters, a Wikimedia subdomain may not... [01:39:44] 10Operations, 10New-Readers, 10Patch-For-Review: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10Dzahn) I think we should avoid adding more subdomains to the wikipedia.org name space that are not actual wikis. One of the concerns when doing 15.wikipedia.org was that it... [01:39:45] < tgr> I need to test a config patch, will mess around on tin & mwdebug1002 a bit [01:39:49] ^^^ done [01:42:09] 10Operations, 10New-Readers, 10Patch-For-Review: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10Dzahn) If it _is_ important because it is used for printed materials i get the point though. In that case i am thinking "i wish the w.wiki URL shortener would be enabled".... [01:45:46] (03CR) 10Dzahn: "this should be independent of the URL discussion on ticket. it seems we agree to create a micro site at something in .wikimedia.org either" [dns] - 10https://gerrit.wikimedia.org/r/470531 (https://phabricator.wikimedia.org/T207816) (owner: 10Dzahn) [01:50:53] (03PS4) 10Dzahn: icinga: use fping instead of ping for faster host checks [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) [01:51:39] (03CR) 10jerkins-bot: [V: 04-1] icinga: use fping instead of ping for faster host checks [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:52:29] (03CR) 10Dzahn: "does anyone get the jerkins issue here?" [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [02:13:43] (03PS2) 10Aklapper: Order list of extensions by alphabet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455188 [02:28:17] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [02:30:18] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [02:34:47] 10Operations, 10Community-Tech, 10MediaWiki-Parser, 10Thumbor, and 5 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10Samwilson) I think this just needs rebasing. [02:38:25] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/setup cr2-eqord - https://phabricator.wikimedia.org/T204170 (10Papaul) 05Open>03Resolved Double checked, it looks like https://netbox.wikimedia.org/dcim/devices/1954/ is complete so I deleted the first one [02:41:18] RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 3.311 second response time [02:44:48] PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:56:04] 10Operations, 10New-Readers, 10Patch-For-Review: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10atgo) This will be linked from a video campaign - we won't be showing the URL that much for folks to type, but will expect people to click through to it (and from it). [03:20:47] (03PS1) 10Andrew Bogott: nova-api: Allow everyone to view the hypervisor for a given VM [puppet] - 10https://gerrit.wikimedia.org/r/470540 (https://phabricator.wikimedia.org/T208099) [03:24:40] (03PS2) 10Andrew Bogott: nova-api: Allow everyone to view the hypervisor for a given VM [puppet] - 10https://gerrit.wikimedia.org/r/470540 (https://phabricator.wikimedia.org/T208099) [03:25:59] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Andrew) >>! In T41785#4704964, @Krenair wrote: > So how do we want to roll this out? Do it on a per-project basis while moving a project across region... [03:31:58] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 961.64 seconds [03:59:58] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 235.04 seconds [04:21:21] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987 (10Liuxinyu970226) [04:26:06] (03PS7) 10Mathew.onipe: elasticsearch: cookbook for multi-cluster services rolling restart [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) [04:28:20] (03PS9) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) [04:29:25] (03CR) 10Mathew.onipe: elasticsearch_cluster: multi-cluster/multi-instance support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [06:04:47] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:05:08] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 64, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:16:47] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:23:37] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:30:27] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/DigiCert_High_Assurance_CA-3.crt] [06:30:27] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 49 probes of 324 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:31:49] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py] [06:32:28] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ImageMagick-6/policy.xml] [06:35:37] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 35 probes of 324 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:52:28] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [06:57:29] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:50] !log graphite1001: Remove Graphite data from corrupted names under media_* and ve_* (T189530) [06:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:54] T189530: Possible statsv corruption? - https://phabricator.wikimedia.org/T189530 [06:57:58] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:38] !log graphite1004: Remove Graphite data from corrupted names under media_* and ve_* (T189530) [06:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:57] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:01:38] !log graphite2001: Remove Graphite data from corrupted names under media_* and ve_* (T189530) [07:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:07] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:07:28] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:10:27] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad%2520prometheus%252Fops [07:51:21] Krinkle: graphite2003 also should get the rm, planning to decom 2001 soon tho [07:54:58] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:59:17] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:59:57] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 66, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:01:37] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:02:50] (03CR) 10Filippo Giunchedi: "See nit inline, not in love with adding $::realm checks but looks like the least intrusive change." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470446 (https://phabricator.wikimedia.org/T208244) (owner: 10Andrew Bogott) [08:04:26] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/470445 (https://phabricator.wikimedia.org/T208244) (owner: 10Andrew Bogott) [08:06:00] (03CR) 10Filippo Giunchedi: [C: 031] diamond: remove nagios collector [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:06:38] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 57 probes of 324 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:09:55] (03CR) 10Muehlenhoff: [C: 04-1] "The server names are wrong." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470446 (https://phabricator.wikimedia.org/T208244) (owner: 10Andrew Bogott) [08:10:28] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:16:57] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 14 probes of 324 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:17:17] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:17:59] (03CR) 10Muehlenhoff: [C: 04-1] "The ntp profile configures $our_network_acls to the production networks, you're missing the 172* network for eqiad1-r. For a WMCS-specific" [puppet] - 10https://gerrit.wikimedia.org/r/470445 (https://phabricator.wikimedia.org/T208244) (owner: 10Andrew Bogott) [08:18:57] (03CR) 10Filippo Giunchedi: "Looks like a good start to me, see inline" (034 comments) [debs/statsd-proxy] (wmf_v0.0.10) - 10https://gerrit.wikimedia.org/r/470512 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [08:24:41] !log starting to delete moved to s5, s3 wikis T184805 [08:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:45] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [08:32:21] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), and 4 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) @ayounsi gave me a better link to check interface exceptions... [08:33:06] (03CR) 10Muehlenhoff: "New host list no longer contains 127.0.0.1 (but not sure if that's an issue in practice or not)" [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [08:34:57] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 64, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:35:27] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:35:48] ACKNOWLEDGEMENT - Device not healthy -SMART- on labsdb1005 is CRITICAL: cluster=mysql device=megaraid,8 instance=labsdb1005:9100 job=node site=eqiad Banyek predictive failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labsdb1005&var-datasource=eqiad%2520prometheus%252Fops [08:36:55] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Order list of extensions by alphabet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455188 (owner: 10Aklapper) [08:37:13] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), and 4 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) @Krinkle, @aaron - I am wondering if https://gerrit.wikimedi... [08:37:17] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1073 is CRITICAL: cluster=mysql device=megaraid,3 instance=db1073:9100 job=node site=eqiad Banyek predictive failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops [08:55:06] (03PS4) 10Giuseppe Lavagetto: httpd: add httpd::env [puppet] - 10https://gerrit.wikimedia.org/r/470347 [08:57:01] (03CR) 10Giuseppe Lavagetto: [C: 032] httpd: add httpd::env [puppet] - 10https://gerrit.wikimedia.org/r/470347 (owner: 10Giuseppe Lavagetto) [08:57:44] (03CR) 10Gehel: [C: 031] "LGTM, waiting for volans to give his feedback before merging." [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [09:19:30] (03PS8) 10Giuseppe Lavagetto: mediawiki: add httpd class, alternative to mediawiki::web [puppet] - 10https://gerrit.wikimedia.org/r/467643 [09:19:31] (03PS15) 10Giuseppe Lavagetto: mediawiki::webserver: introduce profile, use it on mwdebug* [puppet] - 10https://gerrit.wikimedia.org/r/467644 [09:24:51] !log installing paramiko security updates [09:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:12] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Banyek) All those wikis sanitarium set up on db2094 too [09:28:36] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) > All those wikis sanitarium set up on db2094 too Thanks, I can see the triggers there now. Did you run the check script successfully, too? [09:29:44] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Banyek) yes, of course - I didn't saved the output. :( [09:32:18] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 66, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:32:37] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:33:02] my deletion process, as logged, is reaching high traffic databases, please report any issue or mediawiki errors you find [09:34:12] we are removing close to 10 TB of data [09:36:06] (03PS3) 10Ema: check_vcl_reload: no unknowns if reload-vcl still has to run [puppet] - 10https://gerrit.wikimedia.org/r/470353 (https://phabricator.wikimedia.org/T206950) [09:37:58] (03CR) 10Ema: [C: 032] check_vcl_reload: no unknowns if reload-vcl still has to run [puppet] - 10https://gerrit.wikimedia.org/r/470353 (https://phabricator.wikimedia.org/T206950) (owner: 10Ema) [09:40:06] 10Operations, 10ops-codfw: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T208245 (10fgiunchedi) [09:40:09] 10Operations, 10ops-codfw: Degraded RAID on ms-be2021 - https://phabricator.wikimedia.org/T208096 (10fgiunchedi) [09:40:34] !log installing libmspack security updates [09:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:36] (03PS1) 10Filippo Giunchedi: lower TTL for graphite CNAMEs before failover [dns] - 10https://gerrit.wikimedia.org/r/470557 (https://phabricator.wikimedia.org/T196484) [09:44:08] (03CR) 10Filippo Giunchedi: [C: 032] lower TTL for graphite CNAMEs before failover [dns] - 10https://gerrit.wikimedia.org/r/470557 (https://phabricator.wikimedia.org/T196484) (owner: 10Filippo Giunchedi) [09:46:28] (03PS2) 10Filippo Giunchedi: update graphite-in to use graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/470410 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [09:52:39] !log installing mysql-5.5 security updates on trusty/jessie (only clients as packaged in Debian/Ubuntu) [09:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:04] jouncebot: now [09:55:05] No deployments scheduled for the next 1 hour(s) and 4 minute(s) [09:55:07] jouncebot: next [09:55:08] In 1 hour(s) and 4 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181030T1100) [09:57:58] RECOVERY - Check systemd state on ms-be1042 is OK: OK - running: The system is fully operational [10:01:32] that was me ^ the session-.scope unit getting stuck [10:07:31] !log installing exiv2 security updates [10:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:55] 10Operations, 10Certcentral, 10Traffic: certcentral: delay deployment of renewed certs to wait out skewed client clocks - https://phabricator.wikimedia.org/T204997 (10Vgutierrez) Let's Encrypt intentionally backdates the issued certificates 1 hour. ```name=cercentral logs Oct 30 10:02:36 certcentral1001 cert... [10:39:49] (03CR) 10Filippo Giunchedi: rsyslog: add prometheus-rsyslog-exporter support (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/470345 (https://phabricator.wikimedia.org/T205862) (owner: 10Filippo Giunchedi) [10:40:51] !log installing ghostscript security updates [10:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181030T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:01:03] o/ [11:04:09] Hi hashar - Would you have a minute to help me troubleshoot a jenkins issue? [11:07:54] joal: yes! [11:08:43] Thanks hashar - I have experienced failing builds (https://integration.wikimedia.org/ci/job/analytics-refinery-release/140/console) but local mvn clean package works :( [11:13:39] (03PS1) 10Muehlenhoff: Script to generate service principals/keytabs (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/470566 [11:15:37] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 7 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) [11:15:58] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:17:13] joal: Error: Could not find or load main class org.apache.maven.surefire.booter.ForkedBooter [11:17:28] hmm [11:17:36] oh that job runs on some old slave [11:18:08] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:18:08] hashar: Anything we should do about that? [11:19:00] joal: moritzm told me about it. Seems to be related to java update https://docs.oracle.com/cd/E15289_01/JRRLN/newchanged.htm [11:19:01] the jenkins instance that job runs automatically upgrade packages over night [11:19:21] gotta check [11:19:34] Start-Date: 2018-10-26 06:41:09 [11:19:34] Commandline: /usr/bin/unattended-upgrade [11:19:35] Upgrade: openjdk-8-jre-headless:amd64 (8u171-b11-1~bpo8+1, 8u181-b13-2), openjdk-8-jdk:amd64 (8u171-b11-1~bpo8+1, 8u181-b13-2), openjdk-8-jre:amd64 (8u171-b11-1~bpo8+1, 8u181-b13-2), openjdk-8-jdk-headless:amd64 (8u171-b11-1~bpo8+1, 8u181-b13-2) [11:19:36] yeah, we either set jdk.net.URLClassPath.disableClassPathURLCheck=true or deploy a fixed version of Surefire [11:19:43] (not sure if that exists yet) [11:20:02] and that might affects production later on [11:21:18] PROBLEM - Ensure trafficserver_exporter is running on cp1071 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/trafficserver_exporter [11:21:57] joal: if you are curious, the jenkins job definition is at https://github.com/wikimedia/integration-config/blob/master/jjb/analytics.yaml#L37-L50 [11:22:10] joal: seems Surefire needs to be pinned? [11:24:35] the trafficserver_exporter alert on cp1071 is me testing the new upstream, sorry for the spam! [11:24:44] hashar: by pinned, you mean version pinned in a pom for instance? [11:25:33] hashar: would it be easier to use the setting moritzm passed on? [11:26:07] joal: version pinning, I have no idea how that works :-( [11:26:07] zeljkof: no swat patches so I'm going to use this time to merge a beta config change or 2 [11:26:30] joal: but yeah maybe for that job we can set the system settings. Not sure whether the job supports system properties but it should [11:27:04] addshore: ok [11:27:18] ACKNOWLEDGEMENT - Ensure trafficserver_exporter is running on cp1071 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/python3 /usr/bin/trafficserver_exporter Ema Testing prometheus-trafficserver-exporter 0.2.0-1, binary renamed [11:27:28] we can only ping stuff that's fixed :-) let me check whether that is addressed in Surefire upstream [11:27:28] sorry for the mess hashar :( [11:27:57] joal: I am used to those madness. Don't worry. Maybe try to update Surefire / pin it somehow to a version that is not affected [11:28:17] joal: I gotta lunch, but once I am back I can look at adding the jdk.net.URLClassPath.disableClassPathURLCheck=true system property to that job [11:29:27] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:30:05] ok hasharLunch - Will try to see if a new version of Surefire is available [11:31:52] (03CR) 10Faidon Liambotis: "(As we're discussing in the task, I don't think we should be using hosts in prod, i.e {lab,cloud}services for this, but rather VPS instanc" [puppet] - 10https://gerrit.wikimedia.org/r/470446 (https://phabricator.wikimedia.org/T208244) (owner: 10Andrew Bogott) [11:31:59] (03PS1) 10Addshore: BETA, dont use WB_NS_PROPERTY in IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470570 [11:32:03] joal: not seeing anything in their git repo [11:32:05] (03PS2) 10Addshore: BETA, dont use WB_NS_PROPERTY in IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470570 [11:33:41] (03PS3) 10Addshore: BETA, dont use WB_NS_PROPERTY in IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470570 (https://phabricator.wikimedia.org/T208306) [11:33:57] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:35:09] 10Operations, 10Traffic, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10ema) [11:35:12] 10Operations, 10Traffic, 10monitoring, 10Patch-For-Review: Icinga: check_confd_vcl_reload unknown when file is missing - https://phabricator.wikimedia.org/T206950 (10ema) 05Open>03Resolved a:03ema Fixed: https://gerrit.wikimedia.org/r/470353 [11:35:29] Thanks moritzm for checking [11:35:47] moritzm: I guess it'll need to be fixed through system-prop [11:37:19] I'm not seeing anything reported upstream either: https://issues.apache.org/jira/browse/SUREFIRE-1587?jql=project%20%3D%20SUREFIRE%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC%2C%20updated%20DESC [11:37:32] (03PS1) 10Ema: prometheus-trafficserver-exporter: executable renamed [puppet] - 10https://gerrit.wikimedia.org/r/470573 (https://phabricator.wikimedia.org/T204232) [11:38:40] (03CR) 10Addshore: [C: 032] BETA, dont use WB_NS_PROPERTY in IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470570 (https://phabricator.wikimedia.org/T208306) (owner: 10Addshore) [11:39:47] (03Merged) 10jenkins-bot: BETA, dont use WB_NS_PROPERTY in IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470570 (https://phabricator.wikimedia.org/T208306) (owner: 10Addshore) [11:41:23] !log addshore@deploy1001 Synchronized wmf-config: 2x beta only patches, 66b8d5b70a16e51d 8fd16149b1808a58c (duration: 00m 53s) [11:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:16] (03CR) 10jenkins-bot: BETA, dont use WB_NS_PROPERTY in IS-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470570 (https://phabricator.wikimedia.org/T208306) (owner: 10Addshore) [11:46:29] zeljkof: is it okay if I do a SWAT deploy now? [11:46:55] Amir1: addshore is doing something [11:47:01] fine by me if he's done [11:49:44] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to netbox for bd808 - https://phabricator.wikimedia.org/T208267 (10faidon) Netbox does have a piece of functionality called [[ https://netbox.readthedocs.io/en/latest/core-functionality/secrets/ | "secrets" ]], but we're not curre... [11:49:51] (03PS1) 10Addshore: Wikibase.php, wrap $wmgWikibaseClientPropertyOrderUrl use in condition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470575 [11:49:59] Amir1: this is my last one to go out ^^ [11:50:12] (03CR) 10Addshore: [C: 032] Wikibase.php, wrap $wmgWikibaseClientPropertyOrderUrl use in condition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470575 (owner: 10Addshore) [11:51:14] (03Merged) 10jenkins-bot: Wikibase.php, wrap $wmgWikibaseClientPropertyOrderUrl use in condition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470575 (owner: 10Addshore) [11:51:50] syncing [11:52:26] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: Wikibase.php, wrap $wmgWikibaseClientPropertyOrderUrl use in condition (duration: 00m 46s) [11:52:27] Amir1: it is all yours! [11:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:08] addshore: thanks [11:53:28] PROBLEM - HHVM rendering on mw2262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:54:28] RECOVERY - HHVM rendering on mw2262 is OK: HTTP OK: HTTP/1.1 200 OK - 73824 bytes in 0.334 second response time [11:54:28] (03PS2) 10Ladsgroup: Changing the language of votewiki to Persian (fa) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470526 (https://phabricator.wikimedia.org/T207560) (owner: 10Huji) [11:55:20] (03CR) 10Alex Monk: "T137160" [puppet] - 10https://gerrit.wikimedia.org/r/293057 (owner: 10Faidon Liambotis) [11:55:32] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470526 (https://phabricator.wikimedia.org/T207560) (owner: 10Huji) [11:55:35] (03PS1) 10Addshore: commonswiki, Wikibase, clientDbList as empty array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470576 [11:56:23] Amir1: I actually have another one to push out after youu! [11:56:31] (03Merged) 10jenkins-bot: Changing the language of votewiki to Persian (fa) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470526 (https://phabricator.wikimedia.org/T207560) (owner: 10Huji) [11:56:37] oh okay, I'll be quick [11:56:42] (03PS2) 10Addshore: commonswiki, Wikibase, clientDbList as empty array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470576 [11:56:45] thanks :) [11:56:46] jouncebot: next [11:56:47] In 0 hour(s) and 3 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181030T1200) [11:58:05] works fine in mwdebug, moving everywhere [11:59:26] (03CR) 10Volans: [C: 04-1] "The structure looks solid, most of my comments are [nitpicks] but few are [important] things to fix/bugs, see inline." (0333 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [11:59:52] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:470526|Changing the language of votewiki to Persian (fa) (T207560)]] (duration: 00m 48s) [11:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:56] T207560: Set up VoteWiki for the 2018 fawiki elections - https://phabricator.wikimedia.org/T207560 [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181030T1200) [12:00:13] I'm done, addshore it's yours [12:00:19] Amir1: thanks [12:00:44] (03CR) 10Addshore: [C: 032] commonswiki, Wikibase, clientDbList as empty array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470576 (owner: 10Addshore) [12:01:40] (03Merged) 10jenkins-bot: commonswiki, Wikibase, clientDbList as empty array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470576 (owner: 10Addshore) [12:01:42] !log finishing deleting moved to s5, s3 wikis T184805 [12:01:44] (03CR) 10jenkins-bot: Wikibase.php, wrap $wmgWikibaseClientPropertyOrderUrl use in condition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470575 (owner: 10Addshore) [12:01:46] (03CR) 10jenkins-bot: Changing the language of votewiki to Persian (fa) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470526 (https://phabricator.wikimedia.org/T207560) (owner: 10Huji) [12:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:48] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [12:01:50] (03CR) 10jenkins-bot: commonswiki, Wikibase, clientDbList as empty array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470576 (owner: 10Addshore) [12:02:14] 10Operations, 10Certcentral, 10Traffic: certcentral: delay deployment of renewed certs to wait out skewed client clocks - https://phabricator.wikimedia.org/T204997 (10BBlack) So, with regard to the potential staging delays in this and T207295 , the reason they're not urgent or required for conversion of the... [12:02:55] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: commonswiki, Wikibase, clientDbList as empty array (duration: 00m 47s) [12:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:18] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) In theory the drops finished, but I need to do an additonal pass to check for missing hosts/dbs as well as check/remove filters. [12:06:40] !log removing s3 replication filters on labsdb1009/10/11 T184805 [12:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:14] !log removing s3 replication filters on dbstore1002 T184805 [12:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:18] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [12:10:26] (03PS1) 10BBlack: Drop support for *.zero.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/470577 [12:10:31] (03PS1) 10Addshore: Wikibase.php, check for 2 more wmg vars before using them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470578 [12:10:35] where should I contact if I need ops for private matters? [12:10:44] (not a serious stuff but) [12:10:56] revi: if private == privacy, security [12:10:58] serious(ly important) [12:11:05] security at wikimedia .org [12:11:22] or you can file a security ticket, too [12:11:27] (03CR) 10jerkins-bot: [V: 04-1] Wikibase.php, check for 2 more wmg vars before using them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470578 (owner: 10Addshore) [12:11:32] (03PS2) 10BBlack: interface::rps: strict single CPU core per queue [puppet] - 10https://gerrit.wikimedia.org/r/468313 [12:11:34] (03PS4) 10BBlack: interface::rps: always be NUMA aware [puppet] - 10https://gerrit.wikimedia.org/r/467469 [12:11:36] (03PS4) 10BBlack: graphite: add interface::rps settings to graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/468388 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [12:11:38] (03PS1) 10BBlack: Drop support for *.zero.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/470579 [12:11:39] well, just ops in non-public way [12:11:39] if it is not urgent/important [12:11:42] https://phabricator.wikimedia.org/maniphest/task/edit/form/2/ [12:11:52] oops, forgot I had other patches queued in my branch! [12:12:17] (03PS2) 10Addshore: Wikibase.php, check for 2 more wmg vars before using them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470578 [12:12:23] :) [12:12:24] (03PS2) 10BBlack: Drop support for *.zero.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/470579 [12:13:33] !log start memkeys on mc1035 to periodically dump the status of the most used keys - memkeys will use a bit of resources, please stop it if needed (root tmux) - T203786 [12:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:37] T203786: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [12:14:40] * addshore has one last one... [12:14:50] (03CR) 10Addshore: [C: 032] Wikibase.php, check for 2 more wmg vars before using them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470578 (owner: 10Addshore) [12:15:09] (03PS1) 10BBlack: Remove langlist import where not needed [dns] - 10https://gerrit.wikimedia.org/r/470581 [12:15:55] (03Merged) 10jenkins-bot: Wikibase.php, check for 2 more wmg vars before using them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470578 (owner: 10Addshore) [12:16:29] (03CR) 10jenkins-bot: Wikibase.php, check for 2 more wmg vars before using them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470578 (owner: 10Addshore) [12:16:37] revi: it depends on the nature of your request I guess. You could also send a private message here on IRC to get further clarification! [12:16:56] ppl: see -staff [12:17:28] !log addshore@deploy1001 Synchronized wmf-config/Wikibase.php: Wikibase.php, check for 2 more wmg vars before using them (duration: 00m 47s) [12:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:31] (03CR) 10BBlack: [C: 032] "Confirmed that this is a legacy thing, and not necessary for remaining wind-down time with remaining partners. We're pulling it now so we" [dns] - 10https://gerrit.wikimedia.org/r/470577 (owner: 10BBlack) [12:20:04] (03CR) 10BBlack: [C: 032] Remove langlist import where not needed [dns] - 10https://gerrit.wikimedia.org/r/470581 (owner: 10BBlack) [12:21:37] PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [12:22:47] RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [12:28:06] (03PS3) 10BBlack: Drop support for *.zero.wikipedia.org in SAN check [puppet] - 10https://gerrit.wikimedia.org/r/470579 [12:29:36] (03CR) 10BBlack: [C: 032] Drop support for *.zero.wikipedia.org in SAN check [puppet] - 10https://gerrit.wikimedia.org/r/470579 (owner: 10BBlack) [12:29:44] (03PS4) 10BBlack: Drop support for *.zero.wikipedia.org in SAN check [puppet] - 10https://gerrit.wikimedia.org/r/470579 [12:37:07] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) No filters left that I can see: ``` ./software/dbtools/section s5 | while read host; do echo $host; mysql.py -h $host -e "SHOW ALL SLAVES STATUS\... [12:39:13] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10jcrespo) [12:39:20] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) 05Open>03Resolved Everthing at T184805#4654953 done, except the GTID handling, which has to be checked separately for other reasons. [12:41:51] (03PS3) 10BBlack: interface::rps: strict single CPU core per queue [puppet] - 10https://gerrit.wikimedia.org/r/468313 [12:41:53] (03PS5) 10BBlack: interface::rps: always be NUMA aware [puppet] - 10https://gerrit.wikimedia.org/r/467469 [12:41:55] (03PS5) 10BBlack: graphite: add interface::rps settings to graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/468388 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [12:41:57] (03PS1) 10BBlack: drop SAN check for *.m.wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/470583 [12:42:15] heh again [12:42:18] (03PS4) 10BBlack: interface::rps: strict single CPU core per queue [puppet] - 10https://gerrit.wikimedia.org/r/468313 [12:42:20] (03PS6) 10BBlack: interface::rps: always be NUMA aware [puppet] - 10https://gerrit.wikimedia.org/r/467469 [12:42:22] I need to clear out my production branch! :) [12:42:22] (03PS6) 10BBlack: graphite: add interface::rps settings to graphite hosts [puppet] - 10https://gerrit.wikimedia.org/r/468388 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [12:42:24] (03PS2) 10BBlack: drop SAN check for *.m.wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/470583 [12:43:09] (03PS3) 10BBlack: drop SAN check for *.m.wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/470583 [12:45:12] (03PS1) 10Dereckson: Find bash in environment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470585 [12:45:30] jouncebot: next [12:45:30] In 0 hour(s) and 14 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181030T1300) [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181030T1300) [13:10:14] (03PS1) 10Addshore: Revert "Wikibase, move namespace config to IS.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470591 [13:10:37] (03CR) 10jerkins-bot: [V: 04-1] Revert "Wikibase, move namespace config to IS.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470591 (owner: 10Addshore) [13:14:22] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, not related to this review but I thought the check_procs checks would be covered by the generic systemd alert when units fail?" [puppet] - 10https://gerrit.wikimedia.org/r/470573 (https://phabricator.wikimedia.org/T204232) (owner: 10Ema) [13:16:25] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10fgiunchedi) Somewhat related, Grafana upstream has this issue for feedback on dashboard provisioning workflows https://github.com/grafana/grafana/... [13:18:22] (03PS1) 10Fdans: Add change_tag to list of tables to sqoop [puppet] - 10https://gerrit.wikimedia.org/r/470593 (https://phabricator.wikimedia.org/T205940) [13:21:54] (03PS3) 10Filippo Giunchedi: rsyslog: add prometheus-rsyslog-exporter support [puppet] - 10https://gerrit.wikimedia.org/r/470345 (https://phabricator.wikimedia.org/T205862) [13:22:27] jouncebot: next [13:22:27] In 2 hour(s) and 37 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181030T1600) [13:22:47] FYI I'm going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/470382 shortly, no impact expected [13:27:40] !log temporarily disable puppet in eqiad to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/470382 [13:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:04] (03PS2) 10Filippo Giunchedi: hieradata: enable syslog-tls in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/470382 (https://phabricator.wikimedia.org/T136312) [13:28:14] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable syslog-tls in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/470382 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [13:31:11] !log prometheus-trafficserver-exporter 0.2.0-1 uploaded to stretch-wikimedia T204232 [13:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:14] T204232: Package and deploy ATS v8.x - https://phabricator.wikimedia.org/T204232 [13:35:26] (03PS1) 10Addshore: Wikibase IS.php use 120 for property namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470595 [13:36:38] (03CR) 10Addshore: [C: 032] Wikibase IS.php use 120 for property namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470595 (owner: 10Addshore) [13:36:45] ping hashar - any news on the jenkins side? [13:36:48] going to push ^^ out of the door (should be a noop) [13:37:46] (03Merged) 10jenkins-bot: Wikibase IS.php use 120 for property namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470595 (owner: 10Addshore) [13:41:43] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Wikibase IS.php use 120 for property namespace (duration: 00m 47s) [13:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:12] (03PS1) 10Andrew Bogott: horizon: enable eqiad1 for puppet-diffs, fastcci, cyberbot [puppet] - 10https://gerrit.wikimedia.org/r/470596 (https://phabricator.wikimedia.org/T204745) [13:46:17] (03CR) 10jenkins-bot: Wikibase IS.php use 120 for property namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470595 (owner: 10Addshore) [13:52:24] (03PS1) 10Addshore: Add item & property to wmgWikibaseClientRepoNamespaces for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470599 (https://phabricator.wikimedia.org/T208293) [13:53:01] * addshore has another config issue to fix .... [13:53:06] (03CR) 10Andrew Bogott: [C: 032] horizon: enable eqiad1 for puppet-diffs, fastcci, cyberbot [puppet] - 10https://gerrit.wikimedia.org/r/470596 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [13:53:22] (03CR) 10Addshore: [C: 032] Add item & property to wmgWikibaseClientRepoNamespaces for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470599 (https://phabricator.wikimedia.org/T208293) (owner: 10Addshore) [13:53:26] (03CR) 10Gehel: [C: 031] "Thanks for the great (long?) review! Some answers of my own." (0313 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [13:53:35] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:53:58] ah! [13:54:23] !log reenable puppet in eqiad [13:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:46] (03Merged) 10jenkins-bot: Add item & property to wmgWikibaseClientRepoNamespaces for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470599 (https://phabricator.wikimedia.org/T208293) (owner: 10Addshore) [13:55:54] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:56:05] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add item & property to wmgWikibaseClientRepoNamespaces for wiktionaries T208293 (duration: 00m 48s) [13:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:09] T208293: The interwikis are not displayed anymore on the French Wiktionary - https://phabricator.wikimedia.org/T208293 [13:58:29] (03PS2) 10Ema: prometheus-trafficserver-exporter: executable renamed [puppet] - 10https://gerrit.wikimedia.org/r/470573 (https://phabricator.wikimedia.org/T204232) [13:59:21] (03CR) 10Ema: [C: 032] prometheus-trafficserver-exporter: executable renamed [puppet] - 10https://gerrit.wikimedia.org/r/470573 (https://phabricator.wikimedia.org/T204232) (owner: 10Ema) [14:01:07] (03CR) 10jenkins-bot: Add item & property to wmgWikibaseClientRepoNamespaces for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470599 (https://phabricator.wikimedia.org/T208293) (owner: 10Addshore) [14:01:32] 10Operations, 10ops-codfw, 10netops: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10ayounsi) Here is the full list of hosts in that row. No outages expected, but brief (5s) connectivity interruption for some racks is possible. CCing services owners, to know if it's an acceptab... [14:07:25] RECOVERY - Ensure trafficserver_exporter is running on cp1071 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter [14:08:26] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10herron) Everything at once sounds good to me as well. Should we test/confirm connectivity from various other projects to the new smarthosts on tcp/25... [14:09:10] (03PS1) 10Addshore: Add entry to wmgWikibaseClientEntityNamespaces for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470600 (https://phabricator.wikimedia.org/T208293) [14:10:26] (03CR) 10Addshore: [C: 032] Add entry to wmgWikibaseClientEntityNamespaces for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470600 (https://phabricator.wikimedia.org/T208293) (owner: 10Addshore) [14:11:29] (03Merged) 10jenkins-bot: Add entry to wmgWikibaseClientEntityNamespaces for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470600 (https://phabricator.wikimedia.org/T208293) (owner: 10Addshore) [14:12:40] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add entry to wmgWikibaseClientEntityNamespaces for wiktionaries T208293 (duration: 00m 47s) [14:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:44] T208293: The interwikis are not displayed anymore on the French Wiktionary - https://phabricator.wikimedia.org/T208293 [14:15:08] (03CR) 10jenkins-bot: Add entry to wmgWikibaseClientEntityNamespaces for wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470600 (https://phabricator.wikimedia.org/T208293) (owner: 10Addshore) [14:19:22] 10Operations, 10ops-codfw, 10netops: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10jcrespo) > Thursday No problem on my side, a short network outage is not a huge issue on codfw for dbs, but I cannot guarantee they will not page, and I won't be around to attend it- someone e... [14:23:56] 10Operations, 10ops-codfw, 10netops: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10fgiunchedi) >>! In T208272#4706141, @ayounsi wrote: > CCing services owners, to know if it's an acceptable risk and if it can be mitigated by depooling services. Short interruptions are ok wit... [14:28:20] (03CR) 10Prtksxna: [C: 031] create bienvenida.wikimedia.org for Mexico awareness campaign [dns] - 10https://gerrit.wikimedia.org/r/470531 (https://phabricator.wikimedia.org/T207816) (owner: 10Dzahn) [14:30:00] 10Operations, 10New-Readers, 10Patch-For-Review: Create URL for Mexico Awareness Campaign - https://phabricator.wikimedia.org/T207816 (10Prtksxna) >>! In T207816#4704031, @Dzahn wrote: > How about bienvenida.wikimedia.org ? This should be fine, as @atgo points out people will just be clicking through. ***... [14:32:00] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review, 10User-herron: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) >>! In T41785#4706148, @herron wrote: > Everything at once sounds good to me as well. Should we test/confirm connectivity from various other... [14:35:17] (03CR) 10Joal: "Current change looks good, but shouldn't we add the corresponding addition about data deletion?" [puppet] - 10https://gerrit.wikimedia.org/r/470593 (https://phabricator.wikimedia.org/T205940) (owner: 10Fdans) [14:36:33] (03PS3) 10Gehel: tlsproxy::localssl: allow mutliple proxies with the same certificate [puppet] - 10https://gerrit.wikimedia.org/r/468320 (https://phabricator.wikimedia.org/T198352) [14:36:35] (03PS16) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [14:38:27] 10Operations, 10ops-codfw, 10netops: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10Papaul) [14:38:36] 10Operations, 10ops-eqiad, 10DBA: db1117 went away - https://phabricator.wikimedia.org/T208150 (10jcrespo) There is nothing else left for DBAs here except waiting for errors. [14:38:44] (03PS17) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [14:38:57] 10Operations, 10ops-eqiad, 10DBA: db1117 went away - https://phabricator.wikimedia.org/T208150 (10jcrespo) p:05High>03Normal [14:41:13] (03PS18) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) [14:41:15] (03PS1) 10Gehel: elasticsearch: all tls proxies for elasticsearch share the same cert [puppet] - 10https://gerrit.wikimedia.org/r/470606 [14:41:53] 10Operations, 10Epic: Encrypt all the things - https://phabricator.wikimedia.org/T111653 (10fgiunchedi) [14:42:00] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312 (10fgiunchedi) 05Open>03Resolved This has been deployed today, we're running TLS connections everywhere in the fleet. Exceptions being hosts that had puppet disabled whe... [14:42:06] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312 (10fgiunchedi) [14:43:05] 10Operations, 10DBA: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Banyek) [14:50:21] (03PS3) 10Ema: ATS: check HTTP responses from prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/469875 (https://phabricator.wikimedia.org/T204232) [14:50:33] (03Abandoned) 10Addshore: Revert "Wikibase, move namespace config to IS.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470591 (owner: 10Addshore) [14:51:56] (03CR) 10Ema: [C: 032] ATS: check HTTP responses from prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/469875 (https://phabricator.wikimedia.org/T204232) (owner: 10Ema) [14:54:46] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to netbox for bd808 - https://phabricator.wikimedia.org/T208267 (10MoritzMuehlenhoff) The contact data with the phone numbers of our data centre reps contains data like phone numbers with phone extension code, that seems in fact l... [14:56:38] !log gradually upgrade rsyslog to 8.38 on jessie hosts - T206633 [14:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:42] T206633: Setup rsyslog to be able to produce logs to Kafka - https://phabricator.wikimedia.org/T206633 [14:57:14] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:58:24] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:59:50] (03CR) 10Gehel: "tested on deployment-prep (multi instance on deployment-elastic05) and seems to work just fine." [puppet] - 10https://gerrit.wikimedia.org/r/468320 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [15:02:16] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Banyek) [15:02:45] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Banyek) p:05Triage>03Low [15:07:43] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2050 is CRITICAL: cluster=mysql device=cciss,6 instance=db2050:9100 job=node site=codfw Banyek T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2050&var-datasource=codfw%2520prometheus%252Fops [15:09:08] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[rsyslog-gnutls] [15:10:40] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2061 is CRITICAL: cluster=mysql device=cciss,1 instance=db2061:9100 job=node site=codfw Banyek T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2061&var-datasource=codfw%2520prometheus%252Fops [15:12:52] phab2001 is me, taking a look [15:13:33] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1073 is CRITICAL: cluster=mysql device=megaraid,3 instance=db1073:9100 job=node site=eqiad Banyek T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops [15:14:08] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:14:58] 10Operations, 10WMDE-Analytics-Engineering: Regularly & Automatically backup WMDE metrics stored in graphite - https://phabricator.wikimedia.org/T125408 (10Addshore) Tagging with operations so that we can try to get an answer. [15:15:16] (03PS3) 10Andrew Bogott: nova-api: Allow everyone to view the hypervisor for a given VM [puppet] - 10https://gerrit.wikimedia.org/r/470540 (https://phabricator.wikimedia.org/T208099) [15:16:02] (03CR) 10Andrew Bogott: [C: 032] nova-api: Allow everyone to view the hypervisor for a given VM [puppet] - 10https://gerrit.wikimedia.org/r/470540 (https://phabricator.wikimedia.org/T208099) (owner: 10Andrew Bogott) [15:16:33] !log Running migrateImageCommentTemp.php on test wikis and mediawikiwiki for T188132 [15:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:37] T188132: Merge image_comment_temp table into the image table - https://phabricator.wikimedia.org/T188132 [15:17:24] !log Running migrateComments.php on test wikis and mediawikiwiki for T166733 [15:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:32] T166733: Deploy refactored comment storage - https://phabricator.wikimedia.org/T166733 [15:22:02] (03PS1) 10Muehlenhoff: Remove obsolete rsync::repo [puppet] - 10https://gerrit.wikimedia.org/r/470611 [15:27:26] (03PS9) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [15:28:11] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [15:28:50] (03PS1) 10Muehlenhoff: Switch rsync::quickdatacopy to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/470612 [15:30:25] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:32:12] (03PS10) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [15:32:34] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:32:45] PROBLEM - Check systemd state on ms-be1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:33:09] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [15:33:33] jouncebot: now [15:33:33] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [15:34:25] (03CR) 10Jforrester: [C: 032] [Beta Cluster] Enable wgMediaInfoEnable on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [15:34:44] (03CR) 10Ema: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/468320 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [15:36:27] (03PS1) 10Effie Mouzeli: deployment-prep: added hieradata for deployment-rd3 host [puppet] - 10https://gerrit.wikimedia.org/r/470615 (https://phabricator.wikimedia.org/T206450) [15:37:32] (03PS8) 10Jforrester: [Beta Cluster] Enable wgMediaInfoEnable on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) [15:37:40] (03CR) 10Jforrester: [C: 032] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [15:38:33] 10Operations, 10ops-codfw, 10netops: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10ayounsi) [15:39:21] (03Merged) 10jenkins-bot: [Beta Cluster] Enable wgMediaInfoEnable on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [15:40:10] (03PS11) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [15:40:28] 10Operations, 10ops-codfw, 10netops: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10Papaul) [15:40:33] (03CR) 10jenkins-bot: [Beta Cluster] Enable wgMediaInfoEnable on Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [15:40:51] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [15:40:56] (03PS1) 10Alex Monk: horizon policy: Allow non-cloudadmins to view external server attributes [puppet] - 10https://gerrit.wikimedia.org/r/470617 [15:41:21] (03CR) 10Alex Monk: "entirely untested" [puppet] - 10https://gerrit.wikimedia.org/r/470617 (owner: 10Alex Monk) [15:41:23] (03CR) 10jerkins-bot: [V: 04-1] horizon policy: Allow non-cloudadmins to view external server attributes [puppet] - 10https://gerrit.wikimedia.org/r/470617 (owner: 10Alex Monk) [15:42:08] (03CR) 10BBlack: [C: 032] drop SAN check for *.m.wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/470583 (owner: 10BBlack) [15:42:17] (03PS4) 10BBlack: drop SAN check for *.m.wikimediafoundation.org [puppet] - 10https://gerrit.wikimedia.org/r/470583 [15:42:44] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:43:36] (03PS4) 10Gehel: tlsproxy::localssl: allow mutliple proxies with the same certificate [puppet] - 10https://gerrit.wikimedia.org/r/468320 (https://phabricator.wikimedia.org/T198352) [15:43:58] (03CR) 10Volans: [C: 031] "LGTM as for the location of the file, for the content I'll leave it to you ;)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470615 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [15:44:54] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:45:06] (03PS2) 10Alex Monk: horizon policy: Allow non-cloudadmins to view external server attributes [puppet] - 10https://gerrit.wikimedia.org/r/470617 [15:45:10] (03CR) 10C. Scott Ananian: [C: 04-1] "Maybe clarify in the commit message (so it's accessible from `git log`) that this image is referenced by the system message for T207790 / " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469214 (https://phabricator.wikimedia.org/T198946) (owner: 10Niedzielski) [15:46:09] (03CR) 10Banyek: "- How to make sure the cron will run only on one of the hosts from the cluster management role?" [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [15:47:54] (03CR) 10Gehel: [C: 032] tlsproxy::localssl: allow mutliple proxies with the same certificate [puppet] - 10https://gerrit.wikimedia.org/r/468320 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel) [15:49:05] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:50:13] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/13260/" [puppet] - 10https://gerrit.wikimedia.org/r/470612 (owner: 10Muehlenhoff) [15:50:41] (03PS2) 10Effie Mouzeli: deployment-prep: added hieradata for deployment-rd3 host [puppet] - 10https://gerrit.wikimedia.org/r/470615 (https://phabricator.wikimedia.org/T206450) [15:52:23] (03CR) 10Volans: [C: 04-1] "Replied to @gehel comments, thanks for the info." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [15:52:25] (03CR) 10Effie Mouzeli: deployment-prep: added hieradata for deployment-rd3 host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470615 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [15:55:14] (03CR) 10Effie Mouzeli: [C: 032] deployment-prep: added hieradata for deployment-rd3 host [puppet] - 10https://gerrit.wikimedia.org/r/470615 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [15:55:29] (03PS3) 10Effie Mouzeli: deployment-prep: added hieradata for deployment-rd3 host [puppet] - 10https://gerrit.wikimedia.org/r/470615 (https://phabricator.wikimedia.org/T206450) [15:56:34] (03PS12) 10Banyek: mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) [15:56:51] (03CR) 10Mathew.onipe: [C: 031] elasticsearch: all tls proxies for elasticsearch share the same cert [puppet] - 10https://gerrit.wikimedia.org/r/470606 (owner: 10Gehel) [15:57:15] Hey hashar - Is there any task about the problem I experiencved earlier or shall I create one? [15:57:28] (03CR) 10jerkins-bot: [V: 04-1] mariadb: table checker for monitoring data drift [puppet] - 10https://gerrit.wikimedia.org/r/469889 (https://phabricator.wikimedia.org/T207253) (owner: 10Banyek) [15:58:41] (03CR) 10Gehel: [C: 04-1] elasticsearch_cluster: multi-cluster/multi-instance support (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [16:00:04] godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181030T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:04:17] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) [16:05:14] 10Operations, 10ops-codfw, 10netops: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10ayounsi) [16:07:40] !log reboot cp-ats hosts for L1TF kernel/microcode updates T203011 [16:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:37] !log contint1001: rm -fR /srv/zuul-debug-logs # old logs from May 2018 [16:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:53] 10Operations, 10DBA, 10Growth-Team, 10StructuredDiscussions, and 2 others: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10kostajh) > I will make sure to check we have fresh backups by then. @jcrespo wanted to check if there are fresh backups. @... [16:11:56] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10thcipriani) [16:11:59] 10Operations, 10Release Pipeline, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Switch CI Docker Storage Driver to its own partition and to use devicemapper - https://phabricator.wikimedia.org/T178663 (10thcipriani) [16:12:26] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:13:35] jouncebot: next [16:13:35] In 0 hour(s) and 46 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181030T1700) [16:15:19] (03PS3) 10Cwhite: update graphite-in to use graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/470410 (https://phabricator.wikimedia.org/T196484) [16:17:16] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:17:51] 10Operations, 10Release Pipeline, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Switch CI Docker Storage Driver to its own partition and to use devicemapper - https://phabricator.wikimedia.org/T178663 (10thcipriani) Added T207707 as a subtask since it is about get... [16:18:25] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1089 days) [16:19:22] (03PS1) 10Effie Mouzeli: deployment-prep: fixed suffix for deployment-rd3-cptest-master01 [puppet] - 10https://gerrit.wikimedia.org/r/470623 (https://phabricator.wikimedia.org/T206450) [16:19:58] (03CR) 10jerkins-bot: [V: 04-1] deployment-prep: fixed suffix for deployment-rd3-cptest-master01 [puppet] - 10https://gerrit.wikimedia.org/r/470623 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [16:20:47] the tls listener on lithium was me [16:22:40] (03CR) 10Cwhite: [C: 032] update graphite-in to use graphite1004 [dns] - 10https://gerrit.wikimedia.org/r/470410 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [16:22:59] (03PS2) 10Effie Mouzeli: deployment-prep: fixed suffix for deployment-rd3-cptest-master01 [puppet] - 10https://gerrit.wikimedia.org/r/470623 (https://phabricator.wikimedia.org/T206450) [16:24:45] (03PS1) 10Urbanecm: Throttle lift for Wikidata event at University of Edinburgh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470624 (https://phabricator.wikimedia.org/T208236) [16:27:11] !log updated graphite-in cname to graphite1004 - T196484 [16:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:15] T196484: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 [16:30:18] 10Operations, 10Fundraising-Backlog, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) @jkim_wikimedia any luck? [16:31:12] (03PS1) 10Cwhite: Revert "update graphite-in to use graphite1004" [dns] - 10https://gerrit.wikimedia.org/r/470626 [16:31:14] (03CR) 10Cwhite: [C: 032] Revert "update graphite-in to use graphite1004" [dns] - 10https://gerrit.wikimedia.org/r/470626 (owner: 10Cwhite) [16:31:40] !log install carbon-c-relay 3.2-1 on graphite1004 [16:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:53] PROBLEM - Check systemd state on graphite1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:35:02] RECOVERY - Check systemd state on graphite1004 is OK: OK - running: The system is fully operational [16:36:48] (03CR) 10Effie Mouzeli: [C: 032] deployment-prep: fixed suffix for deployment-rd3-cptest-master01 [puppet] - 10https://gerrit.wikimedia.org/r/470623 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [16:39:52] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987 (10Glrx) Gnome closed issue 319 on 23 August 2018 with commit https://gitlab.gnome.org/GNOME/librsvg/commit/3d84acca9c11482cb0d2f75d379086be21... [16:41:58] (03PS4) 10EBernhardson: Collect prometheus metrics from mjolnir [puppet] - 10https://gerrit.wikimedia.org/r/454644 [16:45:23] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 (10mobrovac) [16:45:37] !log Starting deployment of AQS using scap [16:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:49] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Services (watching), 10User-fgiunchedi: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10mobrovac) [16:46:48] !log mforns@deploy1001 Started deploy [analytics/aqs/deploy@3a1d937]: (no justification provided) [16:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:48] !log mforns@deploy1001 Finished deploy [analytics/aqs/deploy@3a1d937]: (no justification provided) (duration: 02m 00s) [16:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:10] (03CR) 10Gehel: [C: 032] Collect prometheus metrics from mjolnir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/454644 (owner: 10EBernhardson) [16:55:56] !log Finished deployment of AQS using scap [16:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:32] (03PS2) 10Gehel: elasticsearch: all tls proxies for elasticsearch share the same cert [puppet] - 10https://gerrit.wikimedia.org/r/470606 [17:00:07] cscott, arlolra, subbu, halfak, and Amir1: That opportune time is upon us again. Time for a Services – Graphoid / Parsoid / Citoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181030T1700). [17:00:18] (03CR) 10Gehel: [C: 032] elasticsearch: all tls proxies for elasticsearch share the same cert [puppet] - 10https://gerrit.wikimedia.org/r/470606 (owner: 10Gehel) [17:03:25] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [17:03:52] known ^ :( [17:03:55] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/prometheus/ops/prometheus.yml] [17:04:16] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/prometheus/ops/prometheus.yml] [17:04:25] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1&panelId=21&fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1&panelId=21&fullscreen [17:04:43] oops, the prometheus failure might be me, checking [17:05:16] PROBLEM - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1031 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:05:25] PROBLEM - Check systemd state on elastic1031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:05:56] PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/prometheus/ops/prometheus.yml] [17:06:15] PROBLEM - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1040 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:06:16] PROBLEM - Elasticsearch HTTPS for relforge.svc.eqiad.wmnet on relforge1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:06:16] PROBLEM - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:06:16] RECOVERY - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1031 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1756 days) [17:06:36] PROBLEM - Elasticsearch HTTPS for search.svc.codfw.wmnet on elastic2020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:07:06] PROBLEM - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1035 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:07:11] ouch, elastic is me as well, missing nginx reload on last change, fixing... [17:07:16] RECOVERY - Elasticsearch HTTPS for relforge.svc.eqiad.wmnet on relforge1002 is OK: SSL OK - Certificate relforge.svc.eqiad.wmnet valid until 2023-08-22 09:43:00 +0000 (expires in 1756 days) [17:07:16] RECOVERY - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1020 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1756 days) [17:07:26] PROBLEM - puppet last run on elastic2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:55] PROBLEM - puppet last run on elastic2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:56] PROBLEM - puppet last run on elastic2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:06] PROBLEM - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:08:15] RECOVERY - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1040 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1756 days) [17:08:35] PROBLEM - puppet last run on elastic2031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:46] PROBLEM - puppet last run on elastic2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:55] PROBLEM - puppet last run on elastic2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:56] PROBLEM - Elasticsearch HTTPS for search.svc.codfw.wmnet on elastic2014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:10:06] PROBLEM - puppet last run on elastic2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:10:16] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:10:46] PROBLEM - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1029 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:10:49] godog: could you check that prometheus puppet failure? I'm dealing with the elastic one [17:11:25] PROBLEM - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:11:26] PROBLEM - puppet last run on bast5001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/prometheus/ops/prometheus.yml] [17:11:26] PROBLEM - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:11:36] PROBLEM - puppet last run on elastic2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:11:43] gehel: sorry I can't now, looking at graphite [17:11:56] PROBLEM - Elasticsearch HTTPS for search.svc.codfw.wmnet on elastic2008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:12:00] godog: np, I'll look in a moment [17:12:09] or if anyone else is around [17:12:15] RECOVERY - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1035 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1756 days) [17:12:25] PROBLEM - Elasticsearch HTTPS for search.svc.codfw.wmnet on elastic2015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:12:26] PROBLEM - puppet last run on elastic2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:12:26] RECOVERY - puppet last run on elastic2025 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:12:26] RECOVERY - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1052 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1756 days) [17:12:45] RECOVERY - Elasticsearch HTTPS for search.svc.codfw.wmnet on elastic2020 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1756 days) [17:12:55] PROBLEM - Elasticsearch HTTPS for search.svc.codfw.wmnet on elastic2026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:12:56] RECOVERY - puppet last run on elastic2012 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:12:56] RECOVERY - puppet last run on elastic2006 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:13:05] RECOVERY - Elasticsearch HTTPS for search.svc.codfw.wmnet on elastic2008 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1756 days) [17:13:13] recoveries coming [17:13:21] * gehel is breathing again [17:13:25] PROBLEM - puppet last run on elastic2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:13:36] RECOVERY - puppet last run on elastic2031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:13:46] RECOVERY - puppet last run on elastic2026 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:13:55] RECOVERY - puppet last run on elastic2029 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:14:16] PROBLEM - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1044 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:14:26] RECOVERY - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1032 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1756 days) [17:14:26] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/prometheus/ops/prometheus.yml] [17:14:56] RECOVERY - Elasticsearch HTTPS for search.svc.codfw.wmnet on elastic2026 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1756 days) [17:15:06] RECOVERY - puppet last run on elastic2009 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:15:15] PROBLEM - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1039 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:15:16] PROBLEM - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1037 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:15:16] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:15:25] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[mjolnir-kafka-bulk-daemon] [17:15:46] PROBLEM - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1049 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:15:56] RECOVERY - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1029 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1756 days) [17:16:15] RECOVERY - Elasticsearch HTTPS for search.svc.codfw.wmnet on elastic2014 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1756 days) [17:16:16] RECOVERY - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1039 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1756 days) [17:16:25] RECOVERY - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1037 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1756 days) [17:16:25] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/prometheus/ops/prometheus.yml] [17:16:25] RECOVERY - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1044 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1756 days) [17:16:26] RECOVERY - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1027 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1756 days) [17:16:36] RECOVERY - Elasticsearch HTTPS for search.svc.codfw.wmnet on elastic2015 is OK: SSL OK - Certificate search.svc.codfw.wmnet valid until 2023-08-22 10:03:17 +0000 (expires in 1756 days) [17:16:36] RECOVERY - puppet last run on elastic2005 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:16:55] RECOVERY - Elasticsearch HTTPS for search.svc.eqiad.wmnet on elastic1049 is OK: SSL OK - Certificate search.svc.eqiad.wmnet valid until 2023-08-22 10:28:57 +0000 (expires in 1756 days) [17:17:35] RECOVERY - puppet last run on elastic2030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:18:26] RECOVERY - puppet last run on elastic2010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:18:36] PROBLEM - puppet last run on prometheus1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/prometheus/ops/prometheus.yml] [17:18:55] RECOVERY - Check systemd state on elastic1031 is OK: OK - running: The system is fully operational [17:19:26] PROBLEM - Check systemd state on elastic1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:20:44] (03PS1) 10Gehel: elasticsearch / prometheus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/470637 [17:21:13] (03PS1) 10Hoo man: Add trwiktionary to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470638 (https://phabricator.wikimedia.org/T204419) [17:21:40] (03CR) 10Gehel: [C: 032] elasticsearch / prometheus: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/470637 (owner: 10Gehel) [17:21:50] ^that's the prometheus fix [17:23:13] !log starting branch cut for MediaWiki and extensions 1.33.0-wmf.2 [17:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:36] PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[mjolnir-kafka-bulk-daemon] [17:25:38] (03PS1) 10Gehel: mjolnir: fix typo in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/470639 [17:26:35] (03CR) 10Gehel: [C: 032] mjolnir: fix typo in systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/470639 (owner: 10Gehel) [17:26:40] gehel: nice, thank you! btw puppet failed but no damage done since puppet didn't deploy the new config [17:26:56] i.e [17:26:57] Oct 30 17:00:05 prometheus1003 puppet-agent[18867]: Execution of '/usr/bin/promtool check-config /srv/prometheus/ops/prometheus.yml20181030-18867-1cmjmmg' returned 1: Checking /srv/prometheus/ops/prometheus.yml20181030-18867-1cmjmmg [17:27:01] Oct 30 17:00:05 prometheus1003 puppet-agent[18867]: FAILED: unknown fields in scrape_config: schema [17:27:36] godog: I'm the one who broke it, least I can do is fix it :( [17:29:06] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:29:16] PROBLEM - puppet last run on bast4002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/prometheus/ops/prometheus.yml] [17:29:26] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:30:47] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:31:09] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) Finally coming back to this. The exact same condition is true on cloudstore1009. Will check for the GET.... [17:36:12] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), and 3 others: Cleanup Wikidata Query Service logging configuration - https://phabricator.wikimedia.org/T207834 (10Gehel) new configuration deployed, but raising some deprecations, needs some tuning. [17:36:17] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:37:46] RECOVERY - Check systemd state on elastic1023 is OK: OK - running: The system is fully operational [17:38:44] 10Operations, 10ops-codfw, 10netops: codfw row C recable and add QFX - https://phabricator.wikimedia.org/T208272 (10elukey) About the C4 switch replacement: there are 4 mw hosts in codfw that are acting as proxies for mcrouter to replicate keys from eqiad to codfw: ``` elukey@mw1347:~$ cat /etc/mcrouter/con... [17:38:46] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:41:36] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:41:37] RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:44:13] (03PS1) 10Tim Eulitz: Prepare AdvancedSearch go-live SWAT changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470642 (https://phabricator.wikimedia.org/T207638) [17:44:47] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:48:25] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) [17:48:56] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:49:26] RECOVERY - puppet last run on bast4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:58:34] (03PS2) 10Cwhite: remove graphite and carbon-relay cnames [dns] - 10https://gerrit.wikimedia.org/r/470626 [18:10:42] !log andrew@deploy1001 Started deploy [horizon/deploy@ce0b9b4]: Rolling out fix for T208099 [18:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:46] T208099: nova: can we expose the creator and virt host of VMs to the public? - https://phabricator.wikimedia.org/T208099 [18:14:01] !log andrew@deploy1001 Finished deploy [horizon/deploy@ce0b9b4]: Rolling out fix for T208099 (duration: 03m 19s) [18:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:20] 10Operations, 10SRE-Access-Requests, 10User-jijiki: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10Dzahn) re-added to LDAP group wmf [18:31:09] (03CR) 10Dzahn: [C: 04-1] "i should edit stretch-icinga.cfg and not icinga.cfg .. i guess.." [puppet] - 10https://gerrit.wikimedia.org/r/469320 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [18:34:14] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@14dd09e]: (no justification provided) [18:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:40] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@14dd09e]: (no justification provided) (duration: 00m 26s) [18:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:36] (03CR) 10Dzahn: servermon: Add gunicorn.service systemd script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [18:40:07] (03CR) 10Dzahn: "@paladox, see above, do you know what happens if these options are applied before it's 2.16 ?" [puppet] - 10https://gerrit.wikimedia.org/r/463519 (https://phabricator.wikimedia.org/T200739) (owner: 10Paladox) [18:40:33] (03PS14) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [18:40:46] (03CR) 10Paladox: servermon: Add gunicorn.service systemd script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [18:41:17] (03CR) 10jerkins-bot: [V: 04-1] servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [18:41:29] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@14dd09e]: adjust kafka bulk daemon timeouts [18:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:32] (03CR) 10Dzahn: [C: 031] "ready to merge?" [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [18:41:34] (03PS15) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [18:42:11] (03CR) 10jerkins-bot: [V: 04-1] servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 (owner: 10Paladox) [18:42:17] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@14dd09e]: adjust kafka bulk daemon timeouts (duration: 00m 48s) [18:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:24] (03PS16) 10Paladox: servermon: Add gunicorn.service systemd script [puppet] - 10https://gerrit.wikimedia.org/r/362455 [18:42:26] (03CR) 10Volans: [C: 04-2] "No actually we've decided to move this directly to swift, so I will amend it once it's all ready. Sorry for not have updated it" [puppet] - 10https://gerrit.wikimedia.org/r/463820 (https://phabricator.wikimedia.org/T190184) (owner: 10Volans) [18:42:33] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@14dd09e]: adjust kafka bulk daemon timeouts [18:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:20] (03CR) 10Paladox: [C: 031] "Yep, works on http://gerrit-test.wmflabs.org/gerrit/q/status:open" [puppet] - 10https://gerrit.wikimedia.org/r/463519 (https://phabricator.wikimedia.org/T200739) (owner: 10Paladox) [18:46:15] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@14dd09e]: adjust kafka bulk daemon timeouts (duration: 03m 42s) [18:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:35] (03CR) 10Dzahn: [C: 031] "confirmed all of these don't point to WMF name servers anymore. They all use ns.active24.cz" [dns] - 10https://gerrit.wikimedia.org/r/467087 (https://phabricator.wikimedia.org/T206923) (owner: 10Urbanecm) [18:51:46] !log thcipriani@deploy1001 Pruned MediaWiki: 1.32.0-wmf.20 (duration: 07m 55s) [18:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:37] !log restarted mjolnir-kafka-bulk-daemon on all elastic hosts [18:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:49] (03CR) 10Dzahn: [C: 031] "yea, looks right, but after the DNS change i would say" [puppet] - 10https://gerrit.wikimedia.org/r/467088 (https://phabricator.wikimedia.org/T206923) (owner: 10Urbanecm) [18:55:46] (03PS4) 10Niedzielski: Update: add Wikimedia logo for SEO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469214 (https://phabricator.wikimedia.org/T198946) [18:56:09] (03CR) 10Niedzielski: "Thanks! Commit message improved." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469214 (https://phabricator.wikimedia.org/T198946) (owner: 10Niedzielski) [18:59:54] !log thcipriani@deploy1001 Pruned MediaWiki: 1.32.0-wmf.22 (duration: 03m 10s) [18:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] thcipriani: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181030T1900). [19:00:13] * thcipriani working on it [19:07:25] (03PS1) 10Dzahn: delete mediawiki_singlenode module and mediawiki:::install role [puppet] - 10https://gerrit.wikimedia.org/r/470658 (https://phabricator.wikimedia.org/T162070) [19:08:08] (03PS1) 10Cwhite: graphite: add queue_depth and batch_size options to carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/470659 (https://phabricator.wikimedia.org/T196484) [19:08:11] (03PS1) 10EBernhardson: Correct mjolnir class reference in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/470660 [19:09:09] (03PS2) 10Cwhite: graphite: add queue_depth and batch_size options to carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/470659 (https://phabricator.wikimedia.org/T196484) [19:09:11] (03CR) 10Alex Monk: [C: 031] "I'm not aware of any uses of this." [puppet] - 10https://gerrit.wikimedia.org/r/470658 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [19:15:08] (03CR) 10Cwhite: add socket_bufsize option to make SO_RCVBUF tunable (032 comments) [debs/statsd-proxy] (wmf_v0.0.10) - 10https://gerrit.wikimedia.org/r/470512 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [19:20:31] !log thcipriani@deploy1001 Started scap: testwiki to 1.33.0-wmf.2 and rebuild l10n cache [19:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:49] (03PS2) 10Cwhite: add socket_bufsize option to make SO_RCVBUF tunable [debs/statsd-proxy] (wmf_v0.0.10) - 10https://gerrit.wikimedia.org/r/470512 (https://phabricator.wikimedia.org/T196484) [19:31:14] (03PS3) 10Cwhite: add socket_bufsize option to make SO_RCVBUF tunable [debs/statsd-proxy] (wmf_v0.0.10) - 10https://gerrit.wikimedia.org/r/470512 (https://phabricator.wikimedia.org/T196484) [19:35:14] 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Andrew) *bump* -- I'm interested on if anyone is working on fixing these issues. If not, that's fine but I'll put some more time into ensuring that we don't get pages for them :) [19:36:06] (03PS5) 10Cwhite: diamond: remove nagios collector [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) [19:37:41] 10Operations, 10cloud-services-team: Sporadic puppet failures on labvirt hosts - https://phabricator.wikimedia.org/T201247 (10Krenair) [19:38:20] 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Krenair) [19:39:16] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.37 seconds [19:40:07] (03CR) 10Cwhite: "https://puppet-compiler.wmflabs.org/compiler1002/13264/" [puppet] - 10https://gerrit.wikimedia.org/r/470659 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [19:40:16] !log thcipriani@deploy1001 Finished scap: testwiki to 1.33.0-wmf.2 and rebuild l10n cache (duration: 19m 45s) [19:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:19] (03CR) 10Cwhite: "https://puppet-compiler.wmflabs.org/compiler1002/13263/" [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [19:41:24] (03CR) 10Cwhite: [C: 032] diamond: remove nagios collector [puppet] - 10https://gerrit.wikimedia.org/r/468480 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [19:45:26] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: librsvg misinterpret quoted font family names that contain whitespaces - https://phabricator.wikimedia.org/T64987 (10Aklapper) It's included in 2.44.0 according to https://gitlab.gnome.org/GNOME/librsvg/commit/18a4f166c4faf590988823c472bd0333fcf7d1e7 [19:48:36] 10Operations, 10ops-codfw: scb2001: Power supply failure - https://phabricator.wikimedia.org/T207629 (10Papaul) Dell_Ent_Triage@dell.com 2:41 PM (5 minutes ago) to me Dell Customer Communication Your Service Request Contact Us | Support Library | Download Center | SupportAssist | Community Forums *... [19:51:20] 10Operations, 10Fundraising-Backlog, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) {F26987983} [19:51:47] 10Operations, 10Fundraising-Backlog, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) If it's easier to walk through via hangouts, let me know. Sorry :( [19:52:17] (03PS1) 10BBlack: temporary verifications for GlobalSign renewal [dns] - 10https://gerrit.wikimedia.org/r/470668 [19:52:28] (03CR) 10jerkins-bot: [V: 04-1] temporary verifications for GlobalSign renewal [dns] - 10https://gerrit.wikimedia.org/r/470668 (owner: 10BBlack) [19:55:05] (03PS1) 10Thcipriani: Group0 to 1.33.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470669 [19:55:44] (03CR) 10Thcipriani: [C: 032] Group0 to 1.33.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470669 (owner: 10Thcipriani) [19:56:24] (03PS2) 10BBlack: temporary verifications for GlobalSign renewal [dns] - 10https://gerrit.wikimedia.org/r/470668 [19:56:50] (03Merged) 10jenkins-bot: Group0 to 1.33.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470669 (owner: 10Thcipriani) [19:57:05] (03CR) 10jenkins-bot: Group0 to 1.33.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470669 (owner: 10Thcipriani) [19:57:29] (03PS2) 10Gehel: Correct mjolnir class reference in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/470660 (owner: 10EBernhardson) [19:58:16] (03CR) 10Gehel: [C: 032] Correct mjolnir class reference in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/470660 (owner: 10EBernhardson) [19:59:13] (03CR) 10BBlack: [C: 032] temporary verifications for GlobalSign renewal [dns] - 10https://gerrit.wikimedia.org/r/470668 (owner: 10BBlack) [20:10:07] (03PS1) 10Papaul: DNS: ADD production and mgmt DNS entries for pc200[7-9] and pc2010 [dns] - 10https://gerrit.wikimedia.org/r/470674 (https://phabricator.wikimedia.org/T207259) [20:13:35] (03PS1) 10BBlack: Revert "temporary verifications for GlobalSign renewal" [dns] - 10https://gerrit.wikimedia.org/r/470677 [20:14:52] 10Operations, 10Cloud-VPS, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): ntp broken in new region - https://phabricator.wikimedia.org/T208244 (10Andrew) [20:15:00] (03CR) 10BBlack: [C: 032] Revert "temporary verifications for GlobalSign renewal" [dns] - 10https://gerrit.wikimedia.org/r/470677 (owner: 10BBlack) [20:15:40] 10Operations, 10Cloud-VPS, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): ntp broken in new region - https://phabricator.wikimedia.org/T208244 (10Andrew) a:03Andrew [20:15:59] andrewbogott: cool! let me know if we can help! [20:18:58] 10Operations, 10Fundraising-Backlog, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10Dzahn) @jkim_wikimedia No worries, screenshots like this are helpful and we can use them to make better docs for the next time. So you made... [20:19:36] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:22:14] !log hotfixing T208254 (restarting apache2 on phab1001) [20:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:19] T208254: Legalpad access controls are confusing and seemingly broken - https://phabricator.wikimedia.org/T208254 [20:26:08] (03PS1) 10Ladsgroup: Do not load WikibaseQuality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470679 (https://phabricator.wikimedia.org/T205064) [20:30:46] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:31:00] (03PS1) 10Bstorm: sonofgridengine: remove puppet types for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/470680 (https://phabricator.wikimedia.org/T200557) [20:32:02] (03CR) 10Bstorm: [C: 032] sonofgridengine: remove puppet types for gridengine [puppet] - 10https://gerrit.wikimedia.org/r/470680 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [20:32:12] (03CR) 10Dzahn: [C: 032] "matches data in netbox, forward/reverse match, ping checked" [dns] - 10https://gerrit.wikimedia.org/r/470674 (https://phabricator.wikimedia.org/T207259) (owner: 10Papaul) [20:32:22] (03PS2) 10Dzahn: DNS: ADD production and mgmt DNS entries for pc200[7-9] and pc2010 [dns] - 10https://gerrit.wikimedia.org/r/470674 (https://phabricator.wikimedia.org/T207259) (owner: 10Papaul) [20:40:38] (03PS1) 10Bstorm: sonofgridengine: infrastructure restrictions on the master profile [puppet] - 10https://gerrit.wikimedia.org/r/470681 (https://phabricator.wikimedia.org/T200557) [20:41:54] (03CR) 10Bstorm: [C: 032] sonofgridengine: infrastructure restrictions on the master profile [puppet] - 10https://gerrit.wikimedia.org/r/470681 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [20:42:56] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Dzahn) [authdns1001:~] $ host pc2007.codfw.wmnet pc2007.codfw.wmnet has address 10.192.0.104 [authdns1001:~] $ host pc2008.codfw.wmnet pc2008.codfw.wmnet h... [20:45:33] !log thcipriani@deploy1001 Synchronized php-1.33.0-wmf.2/extensions/VisualEditor/lib/ve/src/ui/actions/ve.ui.WindowAction.js: [[gerrit:470672|ve.ui.WindowAction: Fix exception when opening windows]] T208347 (duration: 00m 54s) [20:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:38] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Dzahn) [20:45:38] T208347: Exception when opening any dialog/inspector in VE - https://phabricator.wikimedia.org/T208347 [20:53:46] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.33.0-wmf.2 [20:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:20] (03PS1) 10GTirloni: tools-services: Add updatetools_enabled key [puppet] - 10https://gerrit.wikimedia.org/r/470683 (https://phabricator.wikimedia.org/T207591) [21:09:44] (03CR) 10Cwhite: [C: 04-1] icinga: logging optimizations [puppet] - 10https://gerrit.wikimedia.org/r/469320 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:16:37] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) [21:23:12] (03PS2) 10GTirloni: tools-services: Add updatetools_enabled key [puppet] - 10https://gerrit.wikimedia.org/r/470683 (https://phabricator.wikimedia.org/T207591) [22:02:10] (03PS1) 10Papaul: DHCP: Add MAC address entries for pc200[7-9] and pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/470716 (https://phabricator.wikimedia.org/T207259) [22:04:24] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Papaul) [22:09:58] (03CR) 10Cwhite: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/470659 (https://phabricator.wikimedia.org/T196484) (owner: 10Cwhite) [22:12:45] (03CR) 10C. Scott Ananian: [C: 031] "C+1 from me, but I think this needs to be SWATed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/469214 (https://phabricator.wikimedia.org/T198946) (owner: 10Niedzielski) [22:15:30] /14/8 [22:15:37] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.30 seconds [22:17:56] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [22:38:16] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.2/extensions/VisualEditor/: Hot-deploy UBN train blocker VisualEditor bug T208366 (duration: 00m 56s) [22:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:20] T208366: JS fatal in VE from ve.ui.DesktopContext.prototype.updateDimensions - https://phabricator.wikimedia.org/T208366 [22:38:27] OK, train fixed. [22:39:04] (Jinx.) [22:45:25] 10Operations, 10Community-Tech, 10MediaWiki-Parser, 10Thumbor, and 5 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10Samwilson) I was waiting for others to weigh in. They haven't. I've +2'd it. :) [22:46:00] (03PS2) 10Jforrester: [Beta Cluster] UploadWizard: Enable Structured Data captions when WBMI is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470458 (https://phabricator.wikimedia.org/T180981) [22:46:17] (03CR) 10Jforrester: [C: 032] [Beta Cluster] UploadWizard: Enable Structured Data captions when WBMI is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470458 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [22:47:23] (03Merged) 10jenkins-bot: [Beta Cluster] UploadWizard: Enable Structured Data captions when WBMI is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470458 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [22:53:42] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: [Beta Cluster] UploadWizard: Enable Structured Data captions when WBMI is enabled (duration: 00m 53s) [22:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:56] (03CR) 10Dzahn: [C: 032] DHCP: Add MAC address entries for pc200[7-9] and pc2010 [puppet] - 10https://gerrit.wikimedia.org/r/470716 (https://phabricator.wikimedia.org/T207259) (owner: 10Papaul) [22:57:41] (03CR) 10jenkins-bot: [Beta Cluster] UploadWizard: Enable Structured Data captions when WBMI is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470458 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181030T2300). [23:00:04] Urbanecm: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:07:23] (03PS1) 10Dzahn: hieradata/labs: remove mysql::server::use_apparmor: false [puppet] - 10https://gerrit.wikimedia.org/r/470726 (https://phabricator.wikimedia.org/T162070) [23:08:11] * thcipriani does SWAT [23:12:28] (03PS2) 10Thcipriani: Throttle lift for Wikidata event at University of Edinburgh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470624 (https://phabricator.wikimedia.org/T208236) (owner: 10Urbanecm) [23:13:12] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470624 (https://phabricator.wikimedia.org/T208236) (owner: 10Urbanecm) [23:14:23] (03Merged) 10jenkins-bot: Throttle lift for Wikidata event at University of Edinburgh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470624 (https://phabricator.wikimedia.org/T208236) (owner: 10Urbanecm) [23:14:27] (03PS4) 10Dzahn: icinga: logging optimizations [puppet] - 10https://gerrit.wikimedia.org/r/469320 (https://phabricator.wikimedia.org/T202782) [23:20:30] (03CR) 10Dzahn: [C: 032] "not changing settings of production icinga https://puppet-compiler.wmflabs.org/compiler1002/13265/" [puppet] - 10https://gerrit.wikimedia.org/r/469320 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [23:20:45] (03PS5) 10Dzahn: icinga: logging optimizations [puppet] - 10https://gerrit.wikimedia.org/r/469320 (https://phabricator.wikimedia.org/T202782) [23:23:07] !log thcipriani@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:470624|Throttle lift for Wikidata event at University of Edinburgh]] T208236 (duration: 00m 54s) [23:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:11] T208236: IP unblock requested for 20 new accounts being created at University of Edinburgh Wikidata event. - https://phabricator.wikimedia.org/T208236 [23:26:03] 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10Volans) @Andrew did it reoccurred during last week? do you have a list of hostnames+time by any chance? [23:26:18] (03CR) 10jenkins-bot: Throttle lift for Wikidata event at University of Edinburgh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470624 (https://phabricator.wikimedia.org/T208236) (owner: 10Urbanecm) [23:32:31] (03PS1) 10Dzahn: microsites: create bienvenida.wikimedia.org apache static site [puppet] - 10https://gerrit.wikimedia.org/r/470728 (https://phabricator.wikimedia.org/T207816) [23:34:26] (03PS2) 10Dzahn: microsites: create bienvenida.wikimedia.org apache static site [puppet] - 10https://gerrit.wikimedia.org/r/470728 (https://phabricator.wikimedia.org/T207816) [23:39:08] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13266/" [dns] - 10https://gerrit.wikimedia.org/r/470531 (https://phabricator.wikimedia.org/T207816) (owner: 10Dzahn) [23:40:06] 10Operations, 10cloud-services-team: Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10faidon) >>! In T201247#4688838, @Andrew wrote: > Spoke too soon, got another failure overnight. > > > ``` > Oct 23 06:25:20 labvirt1017 puppet-agent[161569]: (/Stage[main]/Openstack::Nova::Common::Bas... [23:40:19] (03CR) 10Dzahn: [C: 032] "just preparing the apache setup on the "misc" webservers, not cloning content yet, not in DNS yet" [puppet] - 10https://gerrit.wikimedia.org/r/470728 (https://phabricator.wikimedia.org/T207816) (owner: 10Dzahn) [23:44:46] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13266/" [puppet] - 10https://gerrit.wikimedia.org/r/470728 (https://phabricator.wikimedia.org/T207816) (owner: 10Dzahn) [23:47:19] (03PS5) 10Dzahn: icinga: use fping instead of ping for faster host checks [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) [23:48:08] (03CR) 10jerkins-bot: [V: 04-1] icinga: use fping instead of ping for faster host checks [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [23:51:35] (03CR) 10Prtksxna: [C: 031] microsites: create bienvenida.wikimedia.org apache static site [puppet] - 10https://gerrit.wikimedia.org/r/470728 (https://phabricator.wikimedia.org/T207816) (owner: 10Dzahn) [23:51:51] (03CR) 10Dzahn: [C: 04-1] "it's still "ping" for some hosts on icinga1001 but "fping" for others? https://puppet-compiler.wmflabs.org/compiler1002/13267/icinga1001." [puppet] - 10https://gerrit.wikimedia.org/r/469333 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [23:54:05] (03PS1) 10Dzahn: microsites::bienvenida: enable content cloning [puppet] - 10https://gerrit.wikimedia.org/r/470732 (https://phabricator.wikimedia.org/T207816)